[RFC PATCH net-next 0/5] tcp: TCP tracer

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH net-next 0/5] tcp: TCP tracer
@ 2014-12-15  1:56 Martin KaFai Lau
  2014-12-15  1:56 ` [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs Martin KaFai Lau
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team

Hi,

We have been using the kernel ftrace infra to collect TCP per-flow statistics.
The following patch set is a first slim-down version of our
existing implementation. We would like to get some early feedback
and make it useful for others.

[RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs:
Defines some basic tracepoints (by TRACE_EVENT).

[RFC PATCH net-next 2/5] tcp: A perf script for TCP tracepoints:
A sample perf script with simple ip/port filtering and summary output.

[RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer:
Declares a few more tracepoints (by DECLARE_TRACE) which are
used by the tcp_tracer.  The tcp_tracer is in the patch 5/5.

[RFC PATCH net-next 4/5] tcp: Introduce tcp_sk_trace and related structs:
Defines a few tcp_trace structs which are used to collect statistics
on each tcp_sock.

[RFC PATCH net-next 5/5] tcp: Add TCP tracer:
It introduces a tcp_tracer which hooks onto the tracepoints defined in the
patch 1/5 and 3/5.  It collects data defined in patch 4/5. We currently
use this tracer to collect per-flow statistics.  The commit log has
some more details.

Thanks,
--Martin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs
  2014-12-15  1:56 [RFC PATCH net-next 0/5] tcp: TCP tracer Martin KaFai Lau
@ 2014-12-15  1:56 ` Martin KaFai Lau
  2014-12-17 15:08   ` David Ahern
  2014-12-15  1:56 ` [RFC PATCH net-next 2/5] tcp: A perf script for TCP tracepoints Martin KaFai Lau
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team

Add TRACE_EVENT when:
1. connection established
2. segs received
3. segs sending out
4. connection close

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/trace/events/tcp.h | 175 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-traces.c      |   1 +
 net/ipv4/tcp.c             |   6 +-
 net/ipv4/tcp_input.c       |   3 +
 net/ipv4/tcp_output.c      |   3 +
 5 files changed, 187 insertions(+), 1 deletion(-)
 create mode 100644 include/trace/events/tcp.h

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
new file mode 100644
index 0000000..81b40ef
--- /dev/null
+++ b/include/trace/events/tcp.h
@@ -0,0 +1,175 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tcp
+
+#if !defined(_TRACE_TCP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TCP_H
+
+#include <net/sock.h>
+#include <net/inet_sock.h>
+#include <net/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ipv6.h>
+#include <linux/tracepoint.h>
+#include <uapi/linux/in6.h>
+
+#define TCP_TRACE_ASSIGN_SA(e, sk)	do {				\
+	(e)->lport = inet_sk((sk))->inet_sport;				\
+	(e)->rport = inet_sk((sk))->inet_dport;				\
+	if ((sk)->sk_family == AF_INET) {				\
+		(e)->ipv6 = 0;						\
+		memset((e)->laddr, 0, sizeof((e)->laddr));		\
+		memset((e)->raddr, 0, sizeof((e)->raddr));		\
+		memcpy((e)->laddr, &inet_sk((sk))->inet_saddr,		\
+		       sizeof(inet_sk((sk))->inet_saddr));		\
+		memcpy((e)->raddr, &inet_sk((sk))->inet_daddr,		\
+		       sizeof(inet_sk((sk))->inet_daddr));		\
+	} else {							\
+		(e)->ipv6 = 1;						\
+		memcpy((e)->laddr, inet6_sk((sk))->saddr.s6_addr,	\
+		       sizeof((e)->laddr));				\
+		memcpy((e)->raddr, (sk)->sk_v6_daddr.s6_addr,		\
+		       sizeof((e)->raddr));				\
+	}								\
+} while (0)
+
+DECLARE_EVENT_CLASS(tcp,
+	TP_PROTO(struct sock *sk),
+	TP_ARGS(sk),
+	TP_STRUCT__entry(
+		__field(u8, ipv6)
+		__array(u8, laddr, 16)
+		__array(u8, raddr, 16)
+		__field(u16, lport)
+		__field(u16, rport)
+		__field(u32, snd_cwnd)
+		__field(u32, mss_cache)
+		__field(u32, ssthresh)
+		__field(u64, srtt_us)
+		__field(u32, rto_ms)
+	),
+	TP_fast_assign(
+		TCP_TRACE_ASSIGN_SA(__entry, sk);
+		__entry->snd_cwnd = tcp_sk(sk)->snd_cwnd;
+		__entry->mss_cache = tcp_sk(sk)->mss_cache;
+		__entry->ssthresh = tcp_current_ssthresh(sk);
+		__entry->srtt_us = tcp_sk(sk)->srtt_us >> 3;
+		__entry->rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+	),
+	TP_printk("local=%s:%d remote=%s:%d snd_cwnd=%u mss_cache=%u "
+		  "ssthresh=%u srtt_us=%llu rto_ms=%u",
+		  __print_hex(__entry->laddr, 16),
+		  __entry->lport,
+		  __print_hex(__entry->raddr, 16),
+		  __entry->rport,
+		  __entry->snd_cwnd, __entry->mss_cache,
+		  __entry->ssthresh, __entry->srtt_us, __entry->rto_ms)
+);
+
+DEFINE_EVENT(tcp,
+	     tcp_established,
+	     TP_PROTO(struct sock *sk),
+	     TP_ARGS(sk)
+);
+
+DEFINE_EVENT(tcp,
+	     tcp_close,
+	     TP_PROTO(struct sock *sk),
+	     TP_ARGS(sk)
+);
+
+TRACE_EVENT(tcp_transmit_skb,
+	TP_PROTO(struct sock *sk, struct sk_buff *skb),
+	TP_ARGS(sk, skb),
+	TP_STRUCT__entry(
+		__field(u8, ipv6)
+		__array(u8, laddr, 16)
+		__array(u8, raddr, 16)
+		__field(u16, lport)
+		__field(u16, rport)
+		__field(u32, seq)
+		__field(u32, end_seq)
+		__field(u32, pcount)
+		__field(u8, ca_state)
+		__field(u32, snd_nxt)
+		__field(u32, snd_una)
+		__field(u32, snd_wnd)
+		__field(u32, snd_cwnd)
+		__field(u32, mss_cache)
+		__field(u32, ssthresh)
+		__field(u64, srtt_us)
+		__field(u32, rto_ms)
+	),
+	TP_fast_assign(
+		TCP_TRACE_ASSIGN_SA(__entry, sk);
+		__entry->seq = TCP_SKB_CB(skb)->seq;
+		__entry->end_seq = TCP_SKB_CB(skb)->end_seq;
+		__entry->pcount = tcp_skb_pcount(skb);
+		__entry->ca_state = inet_csk(sk)->icsk_ca_state;
+		__entry->snd_nxt = tcp_sk(sk)->snd_nxt;
+		__entry->snd_una = tcp_sk(sk)->snd_una;
+		__entry->snd_wnd = tcp_sk(sk)->snd_wnd;
+		__entry->snd_cwnd = tcp_sk(sk)->snd_cwnd;
+		__entry->mss_cache = tcp_sk(sk)->mss_cache;
+		__entry->ssthresh = tcp_current_ssthresh(sk);
+		__entry->srtt_us = tcp_sk(sk)->srtt_us >> 3;
+		__entry->rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+	),
+	TP_printk("local=%s:%d remote=%s:%d "
+		  "skb_seq=%u skb_end_seq=%u pcount=%u ca_state=%x "
+		  "snd_nxt=%u snd_una=%u snd_wnd=%u snd_cwnd=%u mss_cache=%u "
+		  "ssthresh=%u srtt_us=%llu rto_ms=%u",
+		  __print_hex(__entry->laddr, 16), __entry->lport,
+		  __print_hex(__entry->raddr, 16), __entry->rport,
+
+		  __entry->seq, __entry->end_seq, __entry->pcount,
+		  __entry->ca_state,
+
+		  __entry->snd_nxt, __entry->snd_una, __entry->snd_wnd,
+		  __entry->snd_cwnd, __entry->mss_cache,
+
+		  __entry->ssthresh, __entry->srtt_us, __entry->rto_ms)
+);
+
+TRACE_EVENT(tcp_rcv_established,
+	    TP_PROTO(struct sock *sk, struct sk_buff *skb),
+	    TP_ARGS(sk, skb),
+	TP_STRUCT__entry(
+		__field(u8, ipv6)
+		__array(u8, laddr, 16)
+		__array(u8, raddr, 16)
+		__field(u16, lport)
+		__field(u16, rport)
+		__field(u32, seq)
+		__field(u32, end_seq)
+		__field(u32, ack_seq)
+		__field(u32, snd_una)
+		__field(u32, rcv_nxt)
+		__field(u32, rcv_wnd)
+	),
+	TP_fast_assign(
+		TCP_TRACE_ASSIGN_SA(__entry, sk);
+		__entry->seq = TCP_SKB_CB(skb)->seq;
+		__entry->end_seq = TCP_SKB_CB(skb)->end_seq;
+		__entry->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+		__entry->snd_una = tcp_sk(sk)->snd_una;
+		__entry->rcv_nxt = tcp_sk(sk)->rcv_nxt;
+		__entry->rcv_wnd = tcp_sk(sk)->rcv_wnd;
+	),
+	TP_printk("local=%s:%d remote=%s:%d "
+		  "skb_seq=%u skb_end_seq=%u skb_ack_seq=%u snd_una=%u "
+		  "rcv_nxt=%u, rcv_wnd=%u",
+		  __print_hex(__entry->laddr, 16), __entry->lport,
+		  __print_hex(__entry->raddr, 16), __entry->rport,
+
+		  __entry->seq, __entry->end_seq, __entry->ack_seq,
+		  __entry->snd_una,
+
+		  __entry->rcv_nxt, __entry->rcv_wnd)
+);
+
+#undef TCP_TRACE_ASSIGN_SA
+
+#endif /* _TRACE_TCP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/net/core/net-traces.c b/net/core/net-traces.c
index ba3c012..63f966b 100644
--- a/net/core/net-traces.c
+++ b/net/core/net-traces.c
@@ -31,6 +31,7 @@
 #include <trace/events/napi.h>
 #include <trace/events/sock.h>
 #include <trace/events/udp.h>
+#include <trace/events/tcp.h>
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3075723..3b887fa 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include <net/xfrm.h>
 #include <net/ip.h>
 #include <net/sock.h>
+#include <trace/events/tcp.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -1901,8 +1902,10 @@ void tcp_set_state(struct sock *sk, int state)
 
 	switch (state) {
 	case TCP_ESTABLISHED:
-		if (oldstate != TCP_ESTABLISHED)
+		if (oldstate != TCP_ESTABLISHED) {
 			TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
+			trace_tcp_established(sk);
+		}
 		break;
 
 	case TCP_CLOSE:
@@ -1913,6 +1916,7 @@ void tcp_set_state(struct sock *sk, int state)
 		if (inet_csk(sk)->icsk_bind_hash &&
 		    !(sk->sk_userlocks & SOCK_BINDPORT_LOCK))
 			inet_put_port(sk);
+		trace_tcp_close(sk);
 		/* fall through */
 	default:
 		if (oldstate == TCP_ESTABLISHED)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 075ab4d..808fad7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -75,6 +75,7 @@
 #include <linux/ipsec.h>
 #include <asm/unaligned.h>
 #include <linux/errqueue.h>
+#include <trace/events/tcp.h>
 
 int sysctl_tcp_timestamps __read_mostly = 1;
 int sysctl_tcp_window_scaling __read_mostly = 1;
@@ -5076,6 +5077,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	trace_tcp_rcv_established(sk, skb);
+
 	if (unlikely(sk->sk_rx_dst == NULL))
 		inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb);
 	/*
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7f18262..9832512 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -41,6 +41,7 @@
 #include <linux/compiler.h>
 #include <linux/gfp.h>
 #include <linux/module.h>
+#include <trace/events/tcp.h>
 
 /* People can turn this off for buggy TCP's found in printers etc. */
 int sysctl_tcp_retrans_collapse __read_mostly = 1;
@@ -1014,6 +1015,8 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	/* Our usage of tstamp should remain private */
 	skb->tstamp.tv64 = 0;
 
+	trace_tcp_transmit_skb(sk, skb);
+
 	/* Cleanup our debris for IP stacks */
 	memset(skb->cb, 0, max(sizeof(struct inet_skb_parm),
 			       sizeof(struct inet6_skb_parm)));
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH net-next 2/5] tcp: A perf script for TCP tracepoints
  2014-12-15  1:56 [RFC PATCH net-next 0/5] tcp: TCP tracer Martin KaFai Lau
  2014-12-15  1:56 ` [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs Martin KaFai Lau
@ 2014-12-15  1:56 ` Martin KaFai Lau
  2014-12-15  1:56 ` [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer Martin KaFai Lau
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team

A sample perf script.  It has a simple ip/port filtering and a summary output.

Here is a test with netem delay 100ms loss 0.1% and comparing the tcp-summary
output between reno and cubic.  It was run in a kvm environment:

[root@qemu1-centos65 perf]# sysctl -w net.ipv4.tcp_congestion_control=reno
net.ipv4.tcp_congestion_control = reno
[root@qemu1-centos65 perf]# ./perf record -a -e 'tcp:*' netperf -c -C -H 192.168.168.254 -l 60 -p 8888 -- -s 1M -S 1M -m 64K -P ,8889
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.168.254 () port 8889 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

2097152 425984  65536    60.78         2.91   0.00     0.86     0.000   771.397
[ perf record: Woken up 13 times to write data ]
[ perf record: Captured and wrote 3.231 MB perf.data (~141185 samples) ]
[root@qemu1 perf]# PYTHONPATH="$PYTHONPATH:/root/devhostshare/fb-kernel/linux/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace" ./perf script -s scripts/python/tcp-summary.py  -- --rport 8889
snd_cwnd(avg/min/max): 27/8/47
segs out: 15290
data segs out: 15275
octets out: 22116754
loss rxmits: 0
other rxmits: 13
rxmits%: 0.085
dup_acks: 402
established: 1
close: 1

[root@qemu1 perf]# sysctl -w net.ipv4.tcp_congestion_control=cubic
net.ipv4.tcp_congestion_control = cubic
[root@qemu1 perf]# ./perf record -a -e 'tcp:*' netperf -c -C -H 192.168.168.254 -l 60 -p 8888 -- -s 1M -S 1M -m 64K -P ,8889
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.168.254 () port 8889 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

2097152 425984  65536    60.25         4.97   0.02     0.86     1.096   454.051
[ perf record: Woken up 21 times to write data ]
[ perf record: Captured and wrote 5.525 MB perf.data (~241393 samples) ]
[root@qemu1 perf]# PYTHONPATH="$PYTHONPATH:/root/devhostshare/fb-kernel/linux/tools/perf/scripts/python/Perf-Trace-Util/lib/Perf/Trace" ./perf script -s scripts/python/tcp-summary.py  -- --rport 8889
snd_cwnd(avg/min/max): 47/10/78
segs out: 25869
data segs out: 25848
octets out: 37426458
loss rxmits: 0
other rxmits: 19
rxmits%: 0.074
dup_acks: 957
established: 1
close: 1

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/perf/scripts/python/tcp-summary.py | 262 +++++++++++++++++++++++++++++++
 1 file changed, 262 insertions(+)
 create mode 100644 tools/perf/scripts/python/tcp-summary.py

diff --git a/tools/perf/scripts/python/tcp-summary.py b/tools/perf/scripts/python/tcp-summary.py
new file mode 100644
index 0000000..fe85a43
--- /dev/null
+++ b/tools/perf/scripts/python/tcp-summary.py
@@ -0,0 +1,262 @@
+import os
+import sys
+import argparse
+import struct
+import socket
+
+sys.path.append(os.environ['PERF_EXEC_PATH'] + \
+	'/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
+
+from perf_trace_context import *
+from Core import *
+
+def ntohip(nip):
+	hip = 0L
+	for i in nip:
+		hip <<= 8
+		hip |= i
+	return hip
+
+def unpack_nip(nip_in_c):
+	fmt = "!" + "B" * len(nip_in_c)
+	return struct.unpack(fmt, nip_in_c)
+
+def unpack_and_hostify_sa(laddr, lport, raddr, rport):
+	return (ntohip(unpack_nip(laddr)), socket.ntohs(lport),
+		ntohip(unpack_nip(raddr)), socket.ntohs(rport))
+
+class SimpleSubnet(object):
+	def _parse_ip4(self):
+		hip = 0L
+		dec_u8 = self.ip_str.split(".")
+		for i in dec_u8:
+			hip = hip << 8
+			hip |= long(i)
+		self.hip = hip
+
+	def _parse_ip6(self):
+		hip = 0L;
+		hex_u16	 = self.ip_str.split(":")
+		if self.ip_str.endswith("::"):
+			if hex_u16.count("") > 2:
+				raise Exception("Error in parsing IPv6")
+			cpz_zeros = 8 - len(hex_u16) + 2
+			hex_u16 = hex_u16[:-1]
+		elif "" in hex_u16:
+			if hex_u16.count("") > 1:
+				raise Exception("Error in parsing IPv6")
+			cpz_zeros = 8 - len(hex_u16) + 1
+		for i in hex_u16:
+			if len(i) == 0:
+				for j in range(cpz_zeros):
+					hip = hip << 16
+			else:
+				hip = hip << 16;
+				hip |= long(i, 16)
+		self.hip = hip
+
+	def _parse_ip(self, ip_str):
+		slash_start = ip_str.find("/")
+		if slash_start == -1:
+			self.ip_str = ip_str
+			self.plen = 0
+		else:
+			self.ip_str = ip_str[0:slash_start]
+			self.plen = int(ip_str[slash_start+1:])
+
+		if ':' in self.ip_str:
+			self._parse_ip6()
+			self.is_ip6 = True
+			if self.plen == 0:
+				self.plen = 128
+			self.netmask = 0xffffffffffffffffffffffffffffffffL >> (128 - self.plen) << (128 - self.plen)
+		else:
+			self._parse_ip4()
+			self.is_ip6 = False
+			if self.plen == 0:
+				self.plen = 32
+			self.netmask = 0xffffffffL >> (32 - self.plen) << (32 - self.plen)
+
+	def _min_ip(self):
+		return self.hip & self.netmask
+
+	def _max_ip(self):
+		if self.is_ip6:
+			return self._min_ip() | (0xffffffffffffffffffffffffffffffffL - self.netmask)
+		else:
+			return self._min_ip() | (0xffffffffL - self.netmask)
+
+	def __contains__(self, hip):
+		return self._min_ip() <= hip <= self._max_ip()
+
+	def __init__(self, ip_str):
+		self._parse_ip(ip_str)
+
+class SimpleFilter(object):
+	def __init__(self, lsn, lhport, rsn, rhport):
+		self.lsn = lsn
+		self.rsn = rsn
+		self.lhport = lhport
+		self.rhport = rhport
+
+	def match(self, lhip, lhport, rhip, rhport):
+		if self.lsn and lhip not in self.lsn:
+			return False
+		if self.rsn and rhip not in self.rsn:
+			return False
+		if self.lhport and self.lhport != lhport:
+			return False
+		if self.rhport and self.rhport != rhport:
+			return False
+
+		return True
+
+class TcpSummary(object):
+	def __init__(self):
+		self.snd_cwnd_sum = 0
+		self.snd_cwnd_count = 0
+		self.max_snd_cwnd = 0
+		self.min_snd_cwnd = 0xffffffff
+		self.segs_out = 0
+		self.data_segs_out = 0
+		self.data_octets_out = 0
+		self.loss_rxmits = 0
+		self.other_rxmits = 0
+		self.dup_acks = 0
+		self.established = 0
+		self.close = 0
+
+	def add_loss_rxmits(self, n):
+		self.loss_rxmits += n
+
+	def add_other_rxmits(self, n):
+		self.other_rxmits += n
+
+	def add_segs_out(self, n):
+		self.segs_out += n
+
+	def add_data_segs_out(self, n):
+		self.data_segs_out += n
+
+	def add_data_octets_out(self, n):
+		self.data_octets_out += n
+
+	def add_snd_cwnd_sample(self, n):
+		self.snd_cwnd_sum += n
+		self.snd_cwnd_count += 1
+		self.min_snd_cwnd = min(self.min_snd_cwnd, n)
+		self.max_snd_cwnd = max(self.max_snd_cwnd, n)
+
+	def inc_dup_acks(self):
+		self.dup_acks += 1
+
+	def inc_established(self):
+		self.established += 1
+
+	def inc_close(self):
+		self.close += 1
+
+	def report(self):
+		if self.snd_cwnd_count == 0:
+			avg_cwnd = 0
+			min_snd_cwnd = 0
+		else:
+			avg_cwnd = self.snd_cwnd_sum / self.snd_cwnd_count
+			min_snd_cwnd = self.min_snd_cwnd
+		rxmit_rate = (self.loss_rxmits + self.other_rxmits) * 100.0 / \
+			     self.data_segs_out if self.data_segs_out else 0
+		print "snd_cwnd(avg/min/max): %u/%u/%u" % (avg_cwnd,
+							   min_snd_cwnd,
+							   self.max_snd_cwnd)
+		print "segs out: %u" % self.segs_out
+		print "data segs out: %u" % self.data_segs_out
+		print "octets out: %u" % self.data_octets_out
+		print "loss rxmits: %u" % self.loss_rxmits
+		print "other rxmits: %u" % self.other_rxmits
+		print "rxmits%%: %.3f" % rxmit_rate
+		print "dup_acks: %u" % self.dup_acks
+		print "established: %u" % self.established
+		print "close: %u" % self.close
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--laddr", help="Local address in IP[/prefix-len]")
+parser.add_argument("--raddr", help="Remote address in IP[/prefix-len]")
+parser.add_argument("--lport", type=int, default=0, help="Local port")
+parser.add_argument("--rport", type=int, default=0, help="Remote port")
+args = parser.parse_args()
+lsn = SimpleSubnet(args.laddr) if args.laddr else None
+rsn = SimpleSubnet(args.raddr) if args.raddr else None
+addr_filter = SimpleFilter(lsn, args.lport, rsn, args.rport)
+tcp_summary = TcpSummary()
+
+def trace_begin():
+	pass
+
+def trace_end():
+	tcp_summary.report();
+
+def tcp__tcp_transmit_skb(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	common_callchain, ipv6, laddr, raddr, lport,
+	rport, seq, end_seq, pcount, ca_state,
+	snd_nxt, snd_una, snd_wnd, snd_cwnd, mss_cache,
+	ssthresh, srtt_us, rto_ms):
+
+	if not addr_filter.match(*unpack_and_hostify_sa(laddr, lport,
+							raddr, rport)):
+		return
+
+	tcp_summary.add_segs_out(pcount)
+	if end_seq > seq:
+		tcp_summary.add_snd_cwnd_sample(snd_cwnd)
+		if seq < snd_nxt:
+			if ca_state == 4:
+				tcp_summary.add_loss_rxmits(pcount)
+			else:
+				tcp_summary.add_other_rxmits(pcount)
+		else:
+			tcp_summary.add_data_segs_out(pcount)
+			tcp_summary.add_data_octets_out(end_seq - seq)
+
+def tcp__tcp_rcv_established(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	common_callchain, ipv6, laddr, raddr, lport,
+	rport, seq, end_seq, ack_seq, snd_una,
+	rcv_nxt, rcv_wnd):
+	if not addr_filter.match(*unpack_and_hostify_sa(laddr, lport,
+							raddr, rport)):
+		return
+
+	if seq == end_seq and ack_seq == snd_una:
+		tcp_summary.inc_dup_acks()
+
+def tcp__tcp_established(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	common_callchain, ipv6, laddr, raddr, lport,
+	rport, snd_cwnd, mss_cache, ssthresh, srtt_us,
+	rto_ms):
+
+	if not addr_filter.match(*unpack_and_hostify_sa(laddr, lport,
+							raddr, rport)):
+		return
+
+	tcp_summary.inc_established()
+
+def tcp__tcp_close(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	common_callchain, ipv6, laddr, raddr, lport,
+	rport, snd_cwnd, mss_cache, ssthresh, srtt_us,
+	rto_ms):
+
+	if not addr_filter.match(*unpack_and_hostify_sa(laddr, lport,
+							raddr, rport)):
+		return
+
+	tcp_summary.inc_close()
+
+def trace_unhandled(event_name, context, event_fields_dict):
+	print ' '.join(['%s=%s'%(k,str(v))for k,v in sorted(event_fields_dict.items())])
+
+def print_header(event_name, cpu, secs, nsecs, pid, comm):
+	print "%-20s %5u %05u.%09u %8u %-20s " % \
+	(event_name, cpu, secs, nsecs, pid, comm),
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer
  2014-12-15  1:56 [RFC PATCH net-next 0/5] tcp: TCP tracer Martin KaFai Lau
  2014-12-15  1:56 ` [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs Martin KaFai Lau
  2014-12-15  1:56 ` [RFC PATCH net-next 2/5] tcp: A perf script for TCP tracepoints Martin KaFai Lau
@ 2014-12-15  1:56 ` Martin KaFai Lau
  2014-12-17 15:33   ` Arnaldo Carvalho de Melo
  2014-12-15  1:56 ` [RFC PATCH net-next 4/5] tcp: Introduce tcp_sk_trace and related structs Martin KaFai Lau
  2014-12-15  1:56 ` [RFC PATCH net-next 5/5] tcp: Add TCP tracer Martin KaFai Lau
  4 siblings, 1 reply; 10+ messages in thread
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team

The tcp tracer, which will be added in the later patch, depends
on them to collect statistics.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/trace/events/tcp.h | 15 +++++++++++++++
 net/ipv4/tcp_input.c       |  6 ++++++
 2 files changed, 21 insertions(+)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 81b40ef..440a26b 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -167,6 +167,21 @@ TRACE_EVENT(tcp_rcv_established,
 		  __entry->rcv_nxt, __entry->rcv_wnd)
 );
 
+DECLARE_TRACE(tcp_ooo_rcv,
+	      TP_PROTO(struct sock *sk),
+	      TP_ARGS(sk)
+);
+
+DECLARE_TRACE(tcp_sacks_rcv,
+	     TP_PROTO(struct sock *sk, int num_sacks),
+	     TP_ARGS(sk, num_sacks)
+);
+
+DECLARE_TRACE(tcp_rtt_sample,
+	      TP_PROTO(struct sock *sk, long rtt_sample_us),
+	      TP_ARGS(sk, rtt_sample_us)
+);
+
 #undef TCP_TRACE_ASSIGN_SA
 
 #endif /* _TRACE_TCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 808fad7..1f82987 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1650,6 +1650,8 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
 	int i, j;
 	int first_sack_index;
 
+	trace_tcp_sacks_rcv(sk, num_sacks);
+
 	state.flag = 0;
 	state.reord = tp->packets_out;
 	state.rtt_us = -1L;
@@ -2932,6 +2934,9 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
 
 	/* RFC6298: only reset backoff on valid RTT measurement. */
 	inet_csk(sk)->icsk_backoff = 0;
+
+	trace_tcp_rtt_sample(sk, seq_rtt_us);
+
 	return true;
 }
 
@@ -4232,6 +4237,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOQUEUE);
 	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
 		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
+	trace_tcp_ooo_rcv(sk);
 
 	skb1 = skb_peek_tail(&tp->out_of_order_queue);
 	if (!skb1) {
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH net-next 4/5] tcp: Introduce tcp_sk_trace and related structs.
  2014-12-15  1:56 [RFC PATCH net-next 0/5] tcp: TCP tracer Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2014-12-15  1:56 ` [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer Martin KaFai Lau
@ 2014-12-15  1:56 ` Martin KaFai Lau
  2014-12-15  1:56 ` [RFC PATCH net-next 5/5] tcp: Add TCP tracer Martin KaFai Lau
  4 siblings, 0 replies; 10+ messages in thread
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team

The tcp_sk_trace and its related structs define what will be
collected and recorded to the tracing's ring_buffer by
the TCP tracer (in the following patch).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/tcp.h            |  4 +++
 include/net/tcp_trace.h        | 18 ++++++++++
 include/uapi/linux/tcp_trace.h | 78 ++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/Kconfig           | 11 ++++++
 kernel/trace/Makefile          |  1 +
 kernel/trace/tcp_trace.c       | 37 ++++++++++++++++++++
 net/ipv4/tcp.c                 |  4 +++
 7 files changed, 153 insertions(+)
 create mode 100644 include/net/tcp_trace.h
 create mode 100644 include/uapi/linux/tcp_trace.h
 create mode 100644 kernel/trace/tcp_trace.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 67309ec..8d25cb3 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -315,6 +315,10 @@ struct tcp_sock {
 	 * socket. Used to retransmit SYNACKs etc.
 	 */
 	struct request_sock *fastopen_rsk;
+
+#ifdef CONFIG_TCP_TRACE
+	struct tcp_sk_trace *trace;
+#endif
 };
 
 enum tsq_flags {
diff --git a/include/net/tcp_trace.h b/include/net/tcp_trace.h
new file mode 100644
index 0000000..f800cc7
--- /dev/null
+++ b/include/net/tcp_trace.h
@@ -0,0 +1,18 @@
+#ifndef TCP_TRACE_H
+#define TCP_TRACE_H
+
+struct sock;
+
+#ifdef CONFIG_TCP_TRACE
+
+void tcp_sk_trace_init(struct sock *sk);
+void tcp_sk_trace_destruct(struct sock *sk);
+
+#else /* CONFIG_TCP_TRACE */
+
+static inline void tcp_sk_trace_init(struct sock *sk) {}
+static inline void tcp_sk_trace_destruct(struct sock *sk) {}
+
+#endif
+
+#endif /* TCP_TRACE_H */
diff --git a/include/uapi/linux/tcp_trace.h b/include/uapi/linux/tcp_trace.h
new file mode 100644
index 0000000..4f91056
--- /dev/null
+++ b/include/uapi/linux/tcp_trace.h
@@ -0,0 +1,78 @@
+#ifndef UAPI_TCP_TRACE_H
+#define UAPI_TCP_TRACE_H
+
+#include <linux/kernel.h>
+
+#define TCP_TRACE_MAGIC		0x54435000
+#define TCP_TRACE_VERSION	0x01
+#define TCP_TRACE_MAGIC_VERSION	(TCP_TRACE_MAGIC | TCP_TRACE_VERSION)
+
+enum tcp_trace_events {
+	TCP_TRACE_EVENT_ESTABLISHED,
+	TCP_TRACE_EVENT_PERIODIC,	/* Periodic event every 2s */
+	TCP_TRACE_EVENT_RETRANS,	/* Retrans (not in TCP_CA_Loss) */
+	TCP_TRACE_EVENT_RETRANS_LOSS,	/* Retrans in TCP_CA_Loss */
+	TCP_TRACE_EVENT_CLOSE,	/* Connection close */
+};
+
+struct tcp_stats {
+	/* outing packets */
+	__u32	segs_out;
+	__u32	data_segs_out;
+	__u64	data_octets_out;
+
+	/* retrans */
+	__u32	other_segs_retrans;
+	__u32	other_octets_retrans;
+	__u32	loss_segs_retrans;
+	__u32	loss_octets_retrans;
+
+	/* incoming packets */
+	__u32	segs_in;
+	__u32	data_segs_in;
+	__u64	data_octets_in;
+
+	/* RTT */
+	__u64	rtt_sample_us;
+	__u64	max_rtt_us;
+	__u64	min_rtt_us;
+	__u64	sum_rtt_us;
+	__u32	count_rtt;
+
+	/* RTO */
+	__u32	max_rto_ms;
+	__u32	min_rto_ms;
+
+	/* OOO or Loss */
+	__u32	dup_acks_in;
+	__u32	sacks_in;
+	__u32	sack_blks_in;
+	__u32	ooo_in;
+} __packed;
+
+struct tcp_trace {
+	__u32	magic;
+	__u8	event:7,
+		ipv6:1;
+	__u32	local_addr[4];
+	__u32	remote_addr[4];
+	__u16	local_port;
+	__u16	remote_port;
+} __packed;
+
+struct tcp_trace_basic {
+	struct tcp_trace event;
+	/* current values from tcp_sock */
+	__u32	snd_cwnd;
+	__u32	mss;
+	__u32	ssthresh;
+	__u64	srtt_us;
+	__u32	rto_ms;
+} __packed;
+
+struct tcp_trace_stats {
+	struct tcp_trace_basic basic;
+	struct tcp_stats stats;
+} __packed;
+
+#endif /* UAPI_TCP_TRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index a5da09c..f30835c 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -599,6 +599,17 @@ config RING_BUFFER_STARTUP_TEST
 
 	 If unsure, say N
 
+config TCP_TRACE
+	bool "TCP tracing"
+	depends on NET && INET
+	select DEBUG_FS
+	select TRACEPOINTS
+	select GENERIC_TRACER
+	help
+	  This tracer collects per-flow statistics and events.
+
+	  If unsure, say N.
+
 endif # FTRACE
 
 endif # TRACING_SUPPORT
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 67d6369..71d008a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -65,5 +65,6 @@ obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
 obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o
 
 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o
+obj-$(CONFIG_TCP_TRACE) += tcp_trace.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/tcp_trace.c b/kernel/trace/tcp_trace.c
new file mode 100644
index 0000000..9d09fd0
--- /dev/null
+++ b/kernel/trace/tcp_trace.c
@@ -0,0 +1,37 @@
+#include <net/tcp_trace.h>
+#include <linux/tcp.h>
+#include <uapi/linux/tcp_trace.h>
+
+static bool tcp_trace_enabled __read_mostly;
+
+struct tcp_sk_trace {
+	struct tcp_stats stats;
+	unsigned long start_ts;
+	unsigned long last_ts;
+};
+
+void tcp_sk_trace_init(struct sock *sk)
+{
+	struct tcp_sk_trace *sktr;
+
+	tcp_sk(sk)->trace = NULL;
+	if (!tcp_trace_enabled)
+		return;
+
+	sktr  = kzalloc(sizeof(*sktr), gfp_any());
+	if (unlikely(!sktr))
+		return;
+
+	tcp_sk(sk)->trace = sktr;
+	sk->sk_destruct = tcp_sock_destruct;
+
+	sktr->stats.min_rtt_us = U64_MAX;
+	sktr->stats.min_rto_ms = U32_MAX;
+
+	sktr->last_ts = sktr->start_ts = jiffies;
+}
+
+void tcp_sk_trace_destruct(struct sock *sk)
+{
+	kfree(tcp_sk(sk)->trace);
+}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3b887fa..41871c4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include <net/xfrm.h>
 #include <net/ip.h>
 #include <net/sock.h>
+#include <net/tcp_trace.h>
 #include <trace/events/tcp.h>
 
 #include <asm/uaccess.h>
@@ -1904,6 +1905,7 @@ void tcp_set_state(struct sock *sk, int state)
 	case TCP_ESTABLISHED:
 		if (oldstate != TCP_ESTABLISHED) {
 			TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
+			tcp_sk_trace_init(sk);
 			trace_tcp_established(sk);
 		}
 		break;
@@ -2254,6 +2256,8 @@ EXPORT_SYMBOL(tcp_disconnect);
 
 void tcp_sock_destruct(struct sock *sk)
 {
+	tcp_sk_trace_destruct(sk);
+
 	inet_sock_destruct(sk);
 
 	kfree(inet_csk(sk)->icsk_accept_queue.fastopenq);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH net-next 5/5] tcp: Add TCP tracer
  2014-12-15  1:56 [RFC PATCH net-next 0/5] tcp: TCP tracer Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2014-12-15  1:56 ` [RFC PATCH net-next 4/5] tcp: Introduce tcp_sk_trace and related structs Martin KaFai Lau
@ 2014-12-15  1:56 ` Martin KaFai Lau
  4 siblings, 0 replies; 10+ messages in thread
From: Martin KaFai Lau @ 2014-12-15  1:56 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team

Define probes and register them to the TCP tracepoints.  The probes
collect the data defined in struct tcp_sk_trace and record them to
the tracing's ring_buffer.

We currently use the TCP tracer to collect per-flow TCP statistics. Here is our
sample usage:
1. compare performance of different subnet, connection-type (like wire vs
   wireless), application-protocol...etc.
2. Uncover uplink/backbone/subnet issue, e.g. by tracking the rxmit rate.

This patch concludes the series. It is still missing a few things that
we currently have, like:
- why the sender is blocked? and how long for each reason?
- some TCP Congestion Control data
- ...etc.
We are interested to complete them in the next few versions.

[
Some background:
It was inspired by the Web10G effort (RFC4898) on collecting TCP
per-flow stats.  We initially took the Web10G kernel patch, capture only a
subset of RFC4898 that we monitor continuously in production and added codes to
leverage the ftrace infra.
]

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/trace/tcp_trace.c | 489 ++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace.h     |   1 +
 2 files changed, 488 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/tcp_trace.c b/kernel/trace/tcp_trace.c
index 9d09fd0..da9b748 100644
--- a/kernel/trace/tcp_trace.c
+++ b/kernel/trace/tcp_trace.c
@@ -1,12 +1,29 @@
 #include <net/tcp_trace.h>
+#include <net/tcp.h>
+#include <trace/events/tcp.h>
 #include <linux/tcp.h>
+#include <linux/ipv6.h>
+#include <linux/ftrace_event.h>
+#include <linux/jiffies.h>
 #include <uapi/linux/tcp_trace.h>
 
+#include "trace_output.h"
+
+#define REPORT_INTERVAL_MS 2000
+
+static struct trace_array *tcp_tr;
 static bool tcp_trace_enabled __read_mostly;
 
+static struct trace_print_flags tcp_trace_event_names[] = {
+	{ TCP_TRACE_EVENT_ESTABLISHED, "established" },
+	{ TCP_TRACE_EVENT_PERIODIC, "periodic" },
+	{ TCP_TRACE_EVENT_RETRANS, "retrans" },
+	{ TCP_TRACE_EVENT_RETRANS_LOSS, "retrans_loss" },
+	{ TCP_TRACE_EVENT_CLOSE, "close" }
+};
+
 struct tcp_sk_trace {
 	struct tcp_stats stats;
-	unsigned long start_ts;
 	unsigned long last_ts;
 };
 
@@ -28,10 +45,478 @@ void tcp_sk_trace_init(struct sock *sk)
 	sktr->stats.min_rtt_us = U64_MAX;
 	sktr->stats.min_rto_ms = U32_MAX;
 
-	sktr->last_ts = sktr->start_ts = jiffies;
+	sktr->last_ts = jiffies;
 }
 
 void tcp_sk_trace_destruct(struct sock *sk)
 {
 	kfree(tcp_sk(sk)->trace);
 }
+
+static void tcp_trace_init(struct tcp_trace *tr,
+			   enum tcp_trace_events trev,
+			   struct sock *sk)
+{
+	tr->event = trev;
+	if (sk->sk_family == AF_INET) {
+		tr->ipv6 = 0;
+		tr->local_addr[0] = inet_sk(sk)->inet_saddr;
+		tr->remote_addr[0] = inet_sk(sk)->inet_daddr;
+	} else {
+		BUG_ON(sk->sk_family != AF_INET6);
+		tr->ipv6 = 1;
+		memcpy(tr->local_addr, inet6_sk(sk)->saddr.s6_addr32,
+		       sizeof(tr->local_addr));
+		memcpy(tr->remote_addr, sk->sk_v6_daddr.s6_addr32,
+		       sizeof(tr->remote_addr));
+	}
+	tr->local_port = inet_sk(sk)->inet_sport;
+	tr->remote_port = inet_sk(sk)->inet_dport;
+}
+
+static void tcp_trace_basic_init(struct tcp_trace_basic *trb,
+				 enum tcp_trace_events trev,
+				 struct sock *sk)
+{
+	tcp_trace_init((struct tcp_trace *)trb, trev, sk);
+	trb->snd_cwnd = tcp_sk(sk)->snd_cwnd;
+	trb->mss = tcp_sk(sk)->mss_cache;
+	trb->ssthresh = tcp_current_ssthresh(sk);
+	trb->srtt_us = tcp_sk(sk)->srtt_us >> 3;
+	trb->rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+}
+
+static void tcp_trace_basic_add(enum tcp_trace_events trev, struct sock *sk)
+{
+	struct ring_buffer *buffer;
+	int pc;
+	struct ring_buffer_event *event;
+	struct tcp_trace_basic *trb;
+	struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+	if (!sktr)
+		return;
+
+	tracing_record_cmdline(current);
+	buffer = tcp_tr->trace_buffer.buffer;
+	pc = preempt_count();
+	event = trace_buffer_lock_reserve(buffer, TRACE_TCP,
+					  sizeof(*trb), 0, pc);
+	if (!event)
+		return;
+	trb = ring_buffer_event_data(event);
+	tcp_trace_basic_init(trb, trev, sk);
+	trace_buffer_unlock_commit(buffer, event, 0, pc);
+}
+
+static void tcp_trace_stats_init(struct tcp_trace_stats *trs,
+				 enum tcp_trace_events trev,
+				 struct sock *sk)
+{
+	struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+	tcp_trace_basic_init((struct tcp_trace_basic *)trs, trev, sk);
+	memcpy(&trs->stats, &sktr->stats, sizeof(sktr->stats));
+}
+
+static void tcp_trace_stats_add(enum tcp_trace_events trev, struct sock *sk)
+{
+	struct ring_buffer *buffer;
+	int pc;
+	struct ring_buffer_event *event;
+	struct tcp_trace_stats *trs;
+	struct tcp_sk_trace *sktr = tcp_sk(sk)->trace;
+
+	if (!sktr)
+		return;
+
+	tracing_record_cmdline(current);
+	buffer = tcp_tr->trace_buffer.buffer;
+	pc = preempt_count();
+	event = trace_buffer_lock_reserve(buffer, TRACE_TCP,
+					  sizeof(*trs), 0, pc);
+	if (!event)
+		return;
+	trs = ring_buffer_event_data(event);
+
+	tcp_trace_stats_init(trs, trev, sk);
+
+	trace_buffer_unlock_commit(buffer, event, 0, pc);
+}
+
+static void tcp_trace_established(void *ignore, struct sock *sk)
+{
+	tcp_trace_basic_add(TCP_TRACE_EVENT_ESTABLISHED, sk);
+}
+
+static void tcp_trace_transmit_skb(void *ignore, struct sock *sk,
+				   struct sk_buff *skb)
+{
+	int pcount;
+	struct tcp_sk_trace *sktr;
+	struct tcp_skb_cb *tcb;
+	unsigned int data_len;
+	bool retrans = false;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	tcb = TCP_SKB_CB(skb);
+	pcount = tcp_skb_pcount(skb);
+	data_len = tcb->end_seq - tcb->seq;
+
+	sktr->stats.segs_out += pcount;
+
+	if (!data_len)
+		goto out;
+
+	sktr->stats.data_segs_out += pcount;
+	sktr->stats.data_octets_out += data_len;
+
+	if (before(tcb->seq, tcp_sk(sk)->snd_nxt)) {
+		enum tcp_trace_events trev;
+
+		retrans = true;
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
+			sktr->stats.loss_segs_retrans += pcount;
+			sktr->stats.loss_octets_retrans += data_len;
+			trev = TCP_TRACE_EVENT_RETRANS_LOSS;
+		} else {
+			sktr->stats.other_segs_retrans += pcount;
+			sktr->stats.other_octets_retrans += data_len;
+			trev = TCP_TRACE_EVENT_RETRANS;
+		}
+		tcp_trace_stats_add(trev, sk);
+		return;
+	}
+
+out:
+	if (jiffies_to_msecs(jiffies - sktr->last_ts) >=
+	    REPORT_INTERVAL_MS) {
+		sktr->last_ts = jiffies;
+		tcp_trace_stats_add(TCP_TRACE_EVENT_PERIODIC, sk);
+	}
+}
+
+static void tcp_trace_rcv_established(void *ignore, struct sock *sk,
+				      struct sk_buff *skb)
+{
+	struct tcp_sk_trace *sktr;
+	unsigned int data_len;
+	struct tcphdr *th;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	th = tcp_hdr(skb);
+	WARN_ON_ONCE(skb->len < th->doff << 2);
+
+	sktr->stats.segs_in++;
+	data_len = skb->len - (th->doff << 2);
+	if (data_len) {
+		sktr->stats.data_segs_in++;
+		sktr->stats.data_octets_in += data_len;
+	} else {
+		if (TCP_SKB_CB(skb)->ack_seq == tcp_sk(sk)->snd_una)
+			sktr->stats.dup_acks_in++;
+	}
+
+	if (jiffies_to_msecs(jiffies - sktr->last_ts) >=
+	    REPORT_INTERVAL_MS) {
+		sktr->last_ts = jiffies;
+		tcp_trace_stats_add(TCP_TRACE_EVENT_PERIODIC, sk);
+	}
+}
+
+static void tcp_trace_close(void *ignore, struct sock *sk)
+{
+	struct tcp_sk_trace *sktr;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	tcp_trace_stats_add(TCP_TRACE_EVENT_CLOSE, sk);
+}
+
+static void tcp_trace_ooo_rcv(void *ignore, struct sock *sk)
+{
+	struct tcp_sk_trace *sktr;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	sktr->stats.ooo_in++;
+}
+
+static void tcp_trace_sacks_rcv(void *ignore, struct sock *sk, int num_sacks)
+{
+	struct tcp_sk_trace *sktr;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	sktr->stats.sacks_in++;
+	sktr->stats.sack_blks_in += num_sacks;
+}
+
+void tcp_trace_rtt_sample(void *ignore, struct sock *sk,
+			  long rtt_sample_us)
+{
+	struct tcp_sk_trace *sktr;
+	u32 rto_ms;
+
+	sktr = tcp_sk(sk)->trace;
+	if (!sktr)
+		return;
+
+	rto_ms = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
+
+	sktr->stats.rtt_sample_us = rtt_sample_us;
+	sktr->stats.max_rtt_us = max_t(u64, sktr->stats.max_rtt_us,
+				       rtt_sample_us);
+	sktr->stats.min_rtt_us = min_t(u64, sktr->stats.min_rtt_us,
+				       rtt_sample_us);
+
+	sktr->stats.count_rtt++;
+	sktr->stats.sum_rtt_us += rtt_sample_us;
+
+	sktr->stats.max_rto_ms = max_t(u32, sktr->stats.max_rto_ms, rto_ms);
+	sktr->stats.min_rto_ms = min_t(u32, sktr->stats.min_rto_ms, rto_ms);
+}
+
+static enum print_line_t
+tcp_trace_print(struct trace_iterator *iter)
+{
+	struct trace_seq *s = &iter->seq;
+	struct tcp_trace *tr = (struct tcp_trace *)iter->ent;
+	struct tcp_trace_basic *trb;
+	struct tcp_stats *stats;
+
+	union {
+		struct sockaddr_in v4;
+		struct sockaddr_in6 v6;
+	} local_sa, remote_sa;
+
+	local_sa.v4.sin_port = tr->local_port;
+	remote_sa.v4.sin_port = tr->remote_port;
+	if (tr->ipv6) {
+		local_sa.v6.sin6_family = AF_INET6;
+		remote_sa.v6.sin6_family = AF_INET6;
+		memcpy(local_sa.v6.sin6_addr.s6_addr, tr->local_addr, 4);
+		memcpy(remote_sa.v6.sin6_addr.s6_addr, tr->remote_addr, 4);
+	} else {
+		local_sa.v4.sin_family = AF_INET;
+		remote_sa.v4.sin_family = AF_INET;
+		local_sa.v4.sin_addr.s_addr =  tr->local_addr[0];
+		remote_sa.v4.sin_addr.s_addr = tr->remote_addr[0];
+	}
+
+	ftrace_print_symbols_seq(s, tr->event, tcp_trace_event_names);
+	if (trace_seq_has_overflowed(s))
+		goto out;
+
+	trb = (struct tcp_trace_basic *)tr;
+	trace_seq_printf(s,
+			 " %pISpc %pISpc snd_cwnd=%u mss=%u ssthresh=%u"
+			 " srtt_us=%llu rto_ms=%u",
+			 &local_sa, &remote_sa,
+			 trb->snd_cwnd, trb->mss, trb->ssthresh,
+			 trb->srtt_us, trb->rto_ms);
+
+	if (tr->event == TCP_TRACE_EVENT_ESTABLISHED ||
+	    trace_seq_has_overflowed(s))
+		goto out;
+
+	stats = &(((struct tcp_trace_stats *)tr)->stats);
+	trace_seq_printf(s,
+		" segs_out=%u data_segs_out=%u data_octets_out=%llu"
+		" other_segs_retrans=%u other_octets_retrans=%u"
+		" loss_segs_retrans=%u loss_octets_retrans=%u"
+		" segs_in=%u data_segs_in=%u data_octets_in=%llu"
+		" max_rtt_us=%llu min_rtt_us=%llu"
+		" count_rtt=%u sum_rtt_us=%llu"
+		" rtt_sample_us=%llu"
+		" max_rto_ms=%u min_rto_ms=%u"
+		" dup_acks_in=%u sacks_in=%u"
+		" sack_blks_in=%u ooo_in=%u",
+		stats->segs_out, stats->data_segs_out, stats->data_octets_out,
+		stats->other_segs_retrans, stats->other_octets_retrans,
+		stats->loss_segs_retrans, stats->loss_octets_retrans,
+		stats->segs_in, stats->data_segs_in, stats->data_octets_in,
+		stats->max_rtt_us, stats->min_rtt_us,
+		stats->count_rtt, stats->sum_rtt_us,
+		stats->rtt_sample_us,
+		stats->max_rto_ms, stats->min_rto_ms,
+		stats->dup_acks_in, stats->sacks_in,
+		stats->sack_blks_in, stats->ooo_in);
+	if (trace_seq_has_overflowed(s))
+		goto out;
+
+out:
+	trace_seq_putc(s, '\n');
+
+	return trace_seq_has_overflowed(s) ?
+		TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
+}
+
+static enum print_line_t
+tcp_trace_print_binary(struct trace_iterator *iter)
+{
+	struct trace_seq *s = &iter->seq;
+	struct tcp_trace *tr = (struct tcp_trace *)iter->ent;
+
+	tr->magic = TCP_TRACE_MAGIC_VERSION;
+	if (tr->event == TCP_TRACE_EVENT_ESTABLISHED)
+		trace_seq_putmem(s, tr, sizeof(struct tcp_trace_basic));
+	else
+		trace_seq_putmem(s, tr, sizeof(struct tcp_trace_stats));
+
+	return trace_seq_has_overflowed(s) ?
+		TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
+}
+
+static enum print_line_t
+tcp_tracer_print_line(struct trace_iterator *iter)
+{
+	return (trace_flags & TRACE_ITER_BIN) ?
+		tcp_trace_print_binary(iter) :
+		tcp_trace_print(iter);
+}
+
+static void tcp_unregister_tracepoints(void)
+{
+	unregister_trace_tcp_established(tcp_trace_established, NULL);
+	unregister_trace_tcp_close(tcp_trace_close, NULL);
+	unregister_trace_tcp_rcv_established(tcp_trace_rcv_established, NULL);
+	unregister_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
+	unregister_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
+	unregister_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
+	unregister_trace_tcp_rtt_sample(tcp_trace_rtt_sample, NULL);
+
+	tracepoint_synchronize_unregister();
+}
+
+static int tcp_register_tracepoints(void)
+{
+	int ret;
+
+	ret = register_trace_tcp_established(tcp_trace_established, NULL);
+	if (ret)
+		return ret;
+
+	ret = register_trace_tcp_close(tcp_trace_close, NULL);
+	if (ret)
+		goto err1;
+
+	ret = register_trace_tcp_rcv_established(tcp_trace_rcv_established,
+						 NULL);
+	if (ret)
+		goto err2;
+
+	ret = register_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
+	if (ret)
+		goto err3;
+
+	ret = register_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
+	if (ret)
+		goto err4;
+
+	ret = register_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
+	if (ret)
+		goto err5;
+
+	ret = register_trace_tcp_rtt_sample(tcp_trace_rtt_sample, NULL);
+	if (ret)
+		goto err6;
+
+	return ret;
+
+err6:
+	unregister_trace_tcp_sacks_rcv(tcp_trace_sacks_rcv, NULL);
+err5:
+	unregister_trace_tcp_ooo_rcv(tcp_trace_ooo_rcv, NULL);
+err4:
+	unregister_trace_tcp_transmit_skb(tcp_trace_transmit_skb, NULL);
+err3:
+	unregister_trace_tcp_rcv_established(tcp_trace_rcv_established, NULL);
+err2:
+	unregister_trace_tcp_close(tcp_trace_close, NULL);
+err1:
+	unregister_trace_tcp_established(tcp_trace_established, NULL);
+
+	return ret;
+}
+
+static void tcp_tracer_start(struct trace_array *tr)
+{
+	int ret;
+
+	if (tcp_trace_enabled)
+		return;
+
+	ret = tcp_register_tracepoints();
+	if (ret == 0) {
+		tcp_trace_enabled = true;
+		pr_info("tcp_trace: enabled\n");
+	}
+}
+
+static void tcp_tracer_stop(struct trace_array *tr)
+{
+	if (!tcp_trace_enabled)
+		return;
+
+	tcp_unregister_tracepoints();
+	tcp_trace_enabled = false;
+	pr_info("tcp_trace: disabled\n");
+}
+
+static void tcp_tracer_reset(struct trace_array *tr)
+{
+	tcp_tracer_stop(tr);
+}
+
+static int tcp_tracer_init(struct trace_array *tr)
+{
+	tcp_tr = tr;
+	tcp_tracer_start(tr);
+	return 0;
+}
+
+static struct tracer tcp_tracer __read_mostly = {
+	.name           = "tcp",
+	.init           = tcp_tracer_init,
+	.reset          = tcp_tracer_reset,
+	.start          = tcp_tracer_start,
+	.stop           = tcp_tracer_stop,
+	.print_line	= tcp_tracer_print_line,
+};
+
+static struct trace_event_functions tcp_trace_event_funcs;
+
+static struct trace_event tcp_trace_event = {
+	.type           = TRACE_TCP,
+	.funcs          = &tcp_trace_event_funcs,
+};
+
+static int __init init_tcp_tracer(void)
+{
+	if (!register_ftrace_event(&tcp_trace_event)) {
+		pr_warn("Cannot register TCP trace event\n");
+		return 1;
+	}
+
+	if (register_tracer(&tcp_tracer) != 0) {
+		pr_warn("Cannot register TCP tracer\n");
+		unregister_ftrace_event(&tcp_trace_event);
+		return 1;
+	}
+	return 0;
+}
+
+device_initcall(init_tcp_tracer);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 3255dfb..d88bdd2 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -38,6 +38,7 @@ enum trace_type {
 	TRACE_USER_STACK,
 	TRACE_BLK,
 	TRACE_BPUTS,
+	TRACE_TCP,
 
 	__TRACE_LAST_TYPE,
 };
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs
  2014-12-15  1:56 ` [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs Martin KaFai Lau
@ 2014-12-17 15:08   ` David Ahern
  0 siblings, 0 replies; 10+ messages in thread
From: David Ahern @ 2014-12-17 15:08 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev
  Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team

On 12/14/14 6:56 PM, Martin KaFai Lau wrote:
> +DECLARE_EVENT_CLASS(tcp,
> +	TP_PROTO(struct sock *sk),
> +	TP_ARGS(sk),
> +	TP_STRUCT__entry(
> +		__field(u8, ipv6)
> +		__array(u8, laddr, 16)
> +		__array(u8, raddr, 16)

You could store the addresses as
union {
     struct sockaddr_in      v4;
     struct sockaddr_in6     v6;
} sa;


and then use:

> +	TP_printk("local=%s:%d remote=%s:%d snd_cwnd=%u mss_cache=%u "
> +		  "ssthresh=%u srtt_us=%llu rto_ms=%u",

%pIS to print the addresses in a more readable format than what 
__print_hex will show.

I have a patch to perf (and by extension it applies to trace-cmd) to 
handle pI4, pI6 and pI6c. It readily extends to pIS.

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer
  2014-12-15  1:56 ` [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer Martin KaFai Lau
@ 2014-12-17 15:33   ` Arnaldo Carvalho de Melo
  2014-12-17 15:59     ` David Ahern
  0 siblings, 1 reply; 10+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-12-17 15:33 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team

Em Sun, Dec 14, 2014 at 05:56:44PM -0800, Martin KaFai Lau escreveu:
> The tcp tracer, which will be added in the later patch, depends
> on them to collect statistics.

> --- a/include/trace/events/tcp.h
<SNIP>
> +DECLARE_TRACE(tcp_sacks_rcv,
> +	     TP_PROTO(struct sock *sk, int num_sacks),
> +	     TP_ARGS(sk, num_sacks)
> +);

<SNIP>

> +++ b/net/ipv4/tcp_input.c
> @@ -1650,6 +1650,8 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
>  	int i, j;
>  	int first_sack_index;
>  
> +	trace_tcp_sacks_rcv(sk, num_sacks);
> +

In another message someone pointed out that we want some tracepoints, but
others would imply ABI, a drag on upstream to keep tons of set in stone
tracepoints, so what I was saying was like below, where one of the above
proposed tracepoints is implemented as a "wannabe tracepoint", i.e. a
dynamic probe, that will be as optimized as the kprobes_tracer can make it,
sometimes even using, IIRC, the ftrace mechanizms, if put on some suitable
place (function entry, etc, IIRC, Steven?).

On a random RHEL7 kernel I had laying around on a test machine:

[root@ssdandy ~]# perf probe -L tcp_sacktag_write_queue | head -20
<tcp_sacktag_write_queue@/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp_input.c:0>
      0  tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
         			u32 prior_snd_una)
      2  {
         	struct tcp_sock *tp = tcp_sk(sk);
      4  	const unsigned char *ptr = (skb_transport_header(ack_skb) +
         				    TCP_SKB_CB(ack_skb)->sacked);
         	struct tcp_sack_block_wire *sp_wire = (struct tcp_sack_block_wire *)(ptr+2);
         	struct tcp_sack_block sp[TCP_NUM_SACKS];
         	struct tcp_sack_block *cache;
         	struct tcp_sacktag_state state;
         	struct sk_buff *skb;
     11  	int num_sacks = min(TCP_NUM_SACKS, (ptr[1] - TCPOLEN_SACK_BASE) >> 3);
         	int used_sacks;
         	bool found_dup_sack = false;
         	int i, j;
         	int first_sack_index;
         
     17  	state.flag = 0;
     18  	state.reord = tp->packets_out;
[root@ssdandy ~]#

Available variables at tcp_sacktag_write_queue:17
        @<tcp_sacktag_write_queue+77>
                int     num_sacks
                struct sk_buff* ack_skb
                struct sock*    sk
                struct tcp_sack_block_wire*     sp_wire
                u32     prior_snd_una
                unsigned char*  ptr
[root@ssdandy ~]#

Ok, so we can insert a probe at that point and also we can collect the values of the
sk and num_sacks variables, so:

[root@ssdandy ~]# perf probe 'tcp_sacks_rcv=tcp_sacktag_write_queue:17 sk num_sacks'
Added new event:
  probe:tcp_sacks_rcv  (on tcp_sacktag_write_queue:17 with sk num_sacks)

You can now use it in all perf tools, such as:

	perf record -e probe:tcp_sacks_rcv -aR sleep 1

[root@ssdandy ~]

There you go, you have your wannabe tracepoint, dynamic:

[root@ssdandy ~]# perf record -a -g -e probe:tcp_sacks_rcv
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.362 MB perf.data (~15799 samples) ]

[root@ssdandy ~]# perf script
swapper 0 [000] 184175.932790: probe:tcp_sacks_rcv: (ffffffff8151e59d) sk=ffff880425761e00 num_sacks=0
              71e59e tcp_sacktag_write_queue (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              72216e tcp_ack (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              723cf8 tcp_rcv_state_process (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              72d158 tcp_v4_do_rcv (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              72f5c7 tcp_v4_rcv (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              709584 ip_local_deliver_finish (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              709858 ip_local_deliver (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              7091fd ip_rcv_finish (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              709ac4 ip_rcv (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              6cfdb6 __netif_receive_skb_core (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              6cffc8 __netif_receive_skb (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              6d0050 netif_receive_skb (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              6d0aa8 napi_gro_receive (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
               1b5bf e1000_receive_skb (/lib/modules/3.10.0-123.el7.x86_64/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko)
               1cbda e1000_clean_rx_irq (/lib/modules/3.10.0-123.el7.x86_64/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko)
               247dc e1000e_poll (/lib/modules/3.10.0-123.el7.x86_64/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko)
              6d041a net_rx_action (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              267047 __do_softirq (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              7f3a5c call_softirq (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              214d25 do_softirq (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              2673e5 irq_exit (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              7f4358 do_IRQ (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              7e94ad ret_from_intr (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              7c3927 rest_init (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
              e06fa7 start_kernel ([kernel.vmlinux].init.text)
              e065ee x86_64_start_reservations ([kernel.vmlinux].init.text)
              e06742 x86_64_start_kernel ([kernel.vmlinux].init.text)
<SNIP>

[root@ssdandy ~]# perf script -g python
generated Python script: perf-script.py
[root@ssdandy ~]# vim perf-script.py # Edit it to remove callchain printing, simplify some stuff
[root@ssdandy ~]# mv perf-script.py tcp_sack_rcv.py
[root@ssdandy ~]# cat tcp_sack_rcv.py 
import os, sys

sys.path.append(os.environ['PERF_EXEC_PATH'] + \
	'/scripts/python/Perf-Trace-Util/lib/Perf/Trace')

from perf_trace_context import *
from Core import *

def probe__tcp_sacks_rcv(event_name, context, common_cpu,
	common_secs, common_nsecs, common_pid, common_comm,
	common_callchain, __probe_ip, sk, num_sacks):
		print_header(event_name, common_cpu, common_secs, common_nsecs,
			common_pid, common_comm)

		print "__probe_ip=%#x, sk=%#x, num_sacks=%d" % \
		(__probe_ip, sk, num_sacks)

def print_header(event_name, cpu, secs, nsecs, pid, comm):
	print "%-18s %3u %05u.%09u %1u %-8s " % \
	(event_name, cpu, secs, nsecs, pid, comm),
[root@ssdandy ~]#

[root@ssdandy ~]# perf script -s tcp_sack_rcv.py  | head -10
Failed to open 64/libfreebl3.so, continuing without symbols
probe__tcp_sacks_rcv   0 184175.932790461 0 swapper   __probe_ip=0xffffffff8151e59d, sk=0xffff880425761e00, num_sacks=0
probe__tcp_sacks_rcv   0 184177.487455369 0 swapper   __probe_ip=0xffffffff8151e59d, sk=0xffff8804047e0780, num_sacks=0
probe__tcp_sacks_rcv   0 184177.588593040 0 swapper   __probe_ip=0xffffffff8151e59d, sk=0xffff8804256af800, num_sacks=0
probe__tcp_sacks_rcv   0 184178.741298627 0 swapper   __probe_ip=0xffffffff8151e59d, sk=0xffff8804256acb00, num_sacks=0
probe__tcp_sacks_rcv   0 184179.902089365 0 swapper   __probe_ip=0xffffffff8151e59d, sk=0xffff8804256ad280, num_sacks=0
probe__tcp_sacks_rcv   0 184180.802761942 0 swapper   __probe_ip=0xffffffff8151e59d, sk=0xffff8804256acb00, num_sacks=0
probe__tcp_sacks_rcv   0 184180.961373503 0 swapper   __probe_ip=0xffffffff8151e59d, sk=0xffff8804256af800, num_sacks=0
probe__tcp_sacks_rcv   0 184182.123660739 0 swapper   __probe_ip=0xffffffff8151e59d, sk=0xffff8804256ad280, num_sacks=0
probe__tcp_sacks_rcv   0 184182.387640636 0 swapper   __probe_ip=0xffffffff8151e59d, sk=0xffff8804256acb00, num_sacks=0
probe__tcp_sacks_rcv   0 184182.859420892 0 swapper   __probe_ip=0xffffffff8151e59d, sk=0xffff8804256af800, num_sacks=0
[root@ssdandy ~]#


>  	state.flag = 0;
>  	state.reord = tp->packets_out;
>  	state.rtt_us = -1L;
> @@ -2932,6 +2934,9 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
>  
>  	/* RFC6298: only reset backoff on valid RTT measurement. */
>  	inet_csk(sk)->icsk_backoff = 0;
> +
> +	trace_tcp_rtt_sample(sk, seq_rtt_us);
> +
>  	return true;
>  }
>  
> @@ -4232,6 +4237,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
>  	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOQUEUE);
>  	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
>  		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
> +	trace_tcp_ooo_rcv(sk);
>  
>  	skb1 = skb_peek_tail(&tp->out_of_order_queue);
>  	if (!skb1) {
> -- 
> 1.8.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer
  2014-12-17 15:33   ` Arnaldo Carvalho de Melo
@ 2014-12-17 15:59     ` David Ahern
  2014-12-17 17:02       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 10+ messages in thread
From: David Ahern @ 2014-12-17 15:59 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Martin KaFai Lau
  Cc: netdev, David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Josef Bacik, Kernel Team

On 12/17/14 8:33 AM, Arnaldo Carvalho de Melo wrote:
>
> On a random RHEL7 kernel I had laying around on a test machine:
>
> [root@ssdandy ~]# perf probe -L tcp_sacktag_write_queue | head -20
> <tcp_sacktag_write_queue@/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp_input.c:0>
>        0  tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
>           			u32 prior_snd_una)
>        2  {
>           	struct tcp_sock *tp = tcp_sk(sk);
>        4  	const unsigned char *ptr = (skb_transport_header(ack_skb) +
>           				    TCP_SKB_CB(ack_skb)->sacked);
>           	struct tcp_sack_block_wire *sp_wire = (struct tcp_sack_block_wire *)(ptr+2);
>           	struct tcp_sack_block sp[TCP_NUM_SACKS];
>           	struct tcp_sack_block *cache;
>           	struct tcp_sacktag_state state;
>           	struct sk_buff *skb;
>       11  	int num_sacks = min(TCP_NUM_SACKS, (ptr[1] - TCPOLEN_SACK_BASE) >> 3);
>           	int used_sacks;
>           	bool found_dup_sack = false;
>           	int i, j;
>           	int first_sack_index;
>
>       17  	state.flag = 0;
>       18  	state.reord = tp->packets_out;

But there are limitations/hassles with this approach. For starters I 
believe it requires vmlinux on box. The products I work on do not have 
vmlinux available in the runtime environment. I recall someone (Masami?) 
suggesting the ability to write the probe data to a file (ie., create 
the probe definition off box) and load the file to create the probe, so 
yes a solvable problem.

But with this approach it could very be that the function name and 
variable names differ with kernel version and that makes it hard to 
impossible to create a set of analysis commands.

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer
  2014-12-17 15:59     ` David Ahern
@ 2014-12-17 17:02       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 10+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-12-17 17:02 UTC (permalink / raw)
  To: David Ahern
  Cc: Martin KaFai Lau, netdev, David S. Miller, Hannes Frederic Sowa,
	Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team

Em Wed, Dec 17, 2014 at 08:59:40AM -0700, David Ahern escreveu:
> On 12/17/14 8:33 AM, Arnaldo Carvalho de Melo wrote:

> >On a random RHEL7 kernel I had laying around on a test machine:

> >[root@ssdandy ~]# perf probe -L tcp_sacktag_write_queue | head -20
> ><tcp_sacktag_write_queue@/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp_input.c:0>
> >       0  tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
> >          			u32 prior_snd_una)
> >       2  {
> >          	struct tcp_sock *tp = tcp_sk(sk);
> >       4  	const unsigned char *ptr = (skb_transport_header(ack_skb) +
> >          				    TCP_SKB_CB(ack_skb)->sacked);
> >          	struct tcp_sack_block_wire *sp_wire = (struct tcp_sack_block_wire *)(ptr+2);
> >          	struct tcp_sack_block sp[TCP_NUM_SACKS];
> >          	struct tcp_sack_block *cache;
> >          	struct tcp_sacktag_state state;
> >          	struct sk_buff *skb;
> >      11  	int num_sacks = min(TCP_NUM_SACKS, (ptr[1] - TCPOLEN_SACK_BASE) >> 3);
> >          	int used_sacks;
> >          	bool found_dup_sack = false;
> >          	int i, j;
> >          	int first_sack_index;
> >
> >      17  	state.flag = 0;
> >      18  	state.reord = tp->packets_out;
 
> But there are limitations/hassles with this approach. For starters I believe
> it requires vmlinux on box. The products I work on do not have vmlinux
> available in the runtime environment. I recall someone (Masami?) suggesting

Not necessarily, one can do all this on a development machine that has
that info and then end up with kprobe_trace expressions as described on:

Documentation/trace/kprobetrace.txt

> the ability to write the probe data to a file (ie., create the probe
> definition off box) and load the file to create the probe, so yes a solvable
> problem.

Exactly.
 
> But with this approach it could very be that the function name and variable
> names differ with kernel version and that makes it hard to impossible to
> create a set of analysis commands.

Well, if this is in a very much in flux code, then, there is no place
for a tracepoint there :-)

For instance: I've been using this definition:

commit c522739d72a341a3e74a369ce6298b9412813d3f
Author: Arnaldo Carvalho de Melo <acme@redhat.com>
Date:   Fri Sep 27 18:06:19 2013 -0300

    perf trace: Use vfs_getname hook if available
    
    Initially it tries to find a probe:vfs_getname that should be setup
    with:
    
     perf probe 'vfs_getname=getname_flags:65 pathname=result->name:string'

-----------------

For quite a while in tools/perf/builtin-trace.c

I.e. if this thing is in place, then I get mappings from a pointer to a
pathname that I use in a tool, 'perf trace', if not, it tries reading
/proc/, etc, which is suboptimal but then the only way to map a fd to a
pathname.

On RHEL7:

[root@ssdandy ~]# perf probe 'vfs_getname=getname_flags:65 pathname=result->name:string'
Added new event:
  probe:vfs_getname    (on getname_flags:65 with pathname=result->name:string)

You can now use it in all perf tools, such as:

	perf record -e probe:vfs_getname -aR sleep 1

[root@ssdandy ~]#
[root@ssdandy ~]# perf record -e probe:vfs_getname cat /etc/passwd > /dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (~812 samples) ]
[root@ssdandy ~]# perf script
 cat 24805 [005] 188716.046396: probe:vfs_getname: (ffffffff811bb7e3) pathname="/etc/ld.so.preload"
 cat 24805 [005] 188716.046408: probe:vfs_getname: (ffffffff811bb7e3) pathname="/etc/ld.so.cache"
 cat 24805 [005] 188716.046428: probe:vfs_getname: (ffffffff811bb7e3) pathname="/lib64/libc.so.6"
 cat 24805 [005] 188716.046675: probe:vfs_getname: (ffffffff811bb7e3) pathname="/usr/lib/locale/locale-archive"
 cat 24805 [005] 188716.046718: probe:vfs_getname: (ffffffff811bb7e3) pathname="/etc/passwd"
[root@ssdandy ~]# uname -r
3.10.0-123.el7.x86_64

And in fedora there was a change in how we must set up the probe:

[root@zoo ~]# perf probe 'vfs_getname=getname_flags:65 pathname=filename:string'
Added new event:
  probe:vfs_getname    (on getname_flags:65 with pathname=filename:string)

You can now use it in all perf tools, such as:

	perf record -e probe:vfs_getname -aR sleep 1

[root@zoo ~]# perf record -e probe:vfs_getname cat /etc/passwd > /dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (~687 samples) ]
[root@zoo ~]# perf script
 cat 27485 [000] 931.973139: probe:vfs_getname: (ffffffff8120ad13) pathname="/etc/ld.so.preload"
 cat 27485 [000] 931.973148: probe:vfs_getname: (ffffffff8120ad13) pathname="/etc/ld.so.cache"
 cat 27485 [000] 931.973169: probe:vfs_getname: (ffffffff8120ad13) pathname="/lib64/libc.so.6"
 cat 27485 [000] 931.973417: probe:vfs_getname: (ffffffff8120ad13) pathname="/usr/lib/locale/locale-archive"
 cat 27485 [000] 931.973488: probe:vfs_getname: (ffffffff8120ad13) pathname="/etc/passwd"
[root@zoo ~]# uname -r
3.17.4-200.fc20.x86_64
[root@zoo ~]#

What I could find in 'result->name'  changed to 'filename', but since I gave it
a standard name ("pathname"), its just when setting up the probes, something we
do at system start, i.e. just once, before the first 'perf record', we can deal
with that with some scripting that will do probe setup fallbacks.

All this while we're prototyping a tool, when we're 100% sure that a tracepoint
would provide any performance gains and that that area is so set in stone that
we can guarantee an ABI, go ahead and add the tracepoint.

- Arnaldo

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-12-17 17:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-15  1:56 [RFC PATCH net-next 0/5] tcp: TCP tracer Martin KaFai Lau
2014-12-15  1:56 ` [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs Martin KaFai Lau
2014-12-17 15:08   ` David Ahern
2014-12-15  1:56 ` [RFC PATCH net-next 2/5] tcp: A perf script for TCP tracepoints Martin KaFai Lau
2014-12-15  1:56 ` [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer Martin KaFai Lau
2014-12-17 15:33   ` Arnaldo Carvalho de Melo
2014-12-17 15:59     ` David Ahern
2014-12-17 17:02       ` Arnaldo Carvalho de Melo
2014-12-15  1:56 ` [RFC PATCH net-next 4/5] tcp: Introduce tcp_sk_trace and related structs Martin KaFai Lau
2014-12-15  1:56 ` [RFC PATCH net-next 5/5] tcp: Add TCP tracer Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).