TCP event tracking via netlink...

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* TCP event tracking via netlink...
@ 2007-12-05 13:30 David Miller
  2007-12-05 14:11 ` John Heffner
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: David Miller @ 2007-12-05 13:30 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

Ilpo, I was pondering the kind of debugging one does to find
congestion control issues and even SACK bugs and it's currently too
painful because there is no standard way to track state changes.

I assume you're using something like carefully crafted printk's,
kprobes, or even ad-hoc statistic counters.  That's what I used to do
:-)

With that in mind it occurred to me that we might want to do something
like a state change event generator.

Basically some application or even a daemon listens on this generic
netlink socket family we create.  The header of each event packet
indicates what socket the event is for and then there is some state
information.

Then you can look at a tcpdump and this state dump side by side and
see what the kernel decided to do.

Now there is the question of granularity.

A very important consideration in this is that we want this thing to
be enabled in the distributions, therefore it must be cheap.  Perhaps
one test at the end of the packet input processing.

So I say we pick some state to track (perhaps start with tcp_info)
and just push that at the end of every packet input run.  Also,
we add some minimal filtering capability (match on specific IP
address and/or port, for example).

Maybe if we want to get really fancy we can have some more-expensive
debug mode where detailed specific events get generated via some
macros we can scatter all over the place.  This won't be useful
for general user problem analysis, but it will be excellent for
developers.

Let me know if you think this is useful enough and I'll work on
an implementation we can start playing with.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 13:30 TCP event tracking via netlink David Miller
@ 2007-12-05 14:11 ` John Heffner
  2007-12-05 14:48   ` Evgeniy Polyakov
  2007-12-06  5:00   ` David Miller
  2007-12-05 16:53 ` Joe Perches
  2007-12-05 23:18 ` Ilpo Järvinen
  2 siblings, 2 replies; 21+ messages in thread
From: John Heffner @ 2007-12-05 14:11 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev

David Miller wrote:
> Ilpo, I was pondering the kind of debugging one does to find
> congestion control issues and even SACK bugs and it's currently too
> painful because there is no standard way to track state changes.
> 
> I assume you're using something like carefully crafted printk's,
> kprobes, or even ad-hoc statistic counters.  That's what I used to do
> :-)
> 
> With that in mind it occurred to me that we might want to do something
> like a state change event generator.
> 
> Basically some application or even a daemon listens on this generic
> netlink socket family we create.  The header of each event packet
> indicates what socket the event is for and then there is some state
> information.
> 
> Then you can look at a tcpdump and this state dump side by side and
> see what the kernel decided to do.
> 
> Now there is the question of granularity.
> 
> A very important consideration in this is that we want this thing to
> be enabled in the distributions, therefore it must be cheap.  Perhaps
> one test at the end of the packet input processing.
> 
> So I say we pick some state to track (perhaps start with tcp_info)
> and just push that at the end of every packet input run.  Also,
> we add some minimal filtering capability (match on specific IP
> address and/or port, for example).
> 
> Maybe if we want to get really fancy we can have some more-expensive
> debug mode where detailed specific events get generated via some
> macros we can scatter all over the place.  This won't be useful
> for general user problem analysis, but it will be excellent for
> developers.
> 
> Let me know if you think this is useful enough and I'll work on
> an implementation we can start playing with.


FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
http://caia.swin.edu.au/urp/newtcp/tools.html
http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf

   -John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 14:11 ` John Heffner
@ 2007-12-05 14:48   ` Evgeniy Polyakov
  2007-12-05 15:12     ` Samir Bellabes
  2007-12-06  5:03     ` David Miller
  2007-12-06  5:00   ` David Miller
  1 sibling, 2 replies; 21+ messages in thread
From: Evgeniy Polyakov @ 2007-12-05 14:48 UTC (permalink / raw)
  To: John Heffner; +Cc: David Miller, ilpo.jarvinen, netdev

Hi.

On Wed, Dec 05, 2007 at 09:11:01AM -0500, John Heffner (jheffner@psc.edu) wrote:
> >Maybe if we want to get really fancy we can have some more-expensive
> >debug mode where detailed specific events get generated via some
> >macros we can scatter all over the place.  This won't be useful
> >for general user problem analysis, but it will be excellent for
> >developers.
> >
> >Let me know if you think this is useful enough and I'll work on
> >an implementation we can start playing with.
> 
> 
> FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
> http://caia.swin.edu.au/urp/newtcp/tools.html
> http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf

And even more similar to this patch from Samir Bellabes of Mandriva:
http://lwn.net/Articles/202255/

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 14:48   ` Evgeniy Polyakov
@ 2007-12-05 15:12     ` Samir Bellabes
  2007-12-06  5:03     ` David Miller
  1 sibling, 0 replies; 21+ messages in thread
From: Samir Bellabes @ 2007-12-05 15:12 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: John Heffner, David Miller, ilpo.jarvinen, netdev

Evgeniy Polyakov <johnpol@2ka.mipt.ru> writes:

> Hi.
>
> On Wed, Dec 05, 2007 at 09:11:01AM -0500, John Heffner (jheffner@psc.edu) wrote:
>> >Maybe if we want to get really fancy we can have some more-expensive
>> >debug mode where detailed specific events get generated via some
>> >macros we can scatter all over the place.  This won't be useful
>> >for general user problem analysis, but it will be excellent for
>> >developers.
>> >
>> >Let me know if you think this is useful enough and I'll work on
>> >an implementation we can start playing with.
>> 
>> 
>> FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
>> http://caia.swin.edu.au/urp/newtcp/tools.html
>> http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf
>
> And even more similar to this patch from Samir Bellabes of Mandriva:
> http://lwn.net/Articles/202255/

Indeed, I was thinking about this idea. but yet, my goal is not to deal
with specific protocols like TCP, it's just to deal with the LSM hooks.
Anyway, the idea is the same, having a deamon is userspace to catch
informations. So why not a expansion? 

Lately, I'm moving the code to generic netlink, from connector.
regards,
sam

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 14:48   ` Evgeniy Polyakov
  2007-12-05 15:12     ` Samir Bellabes
@ 2007-12-06  5:03     ` David Miller
  2007-12-06 10:58       ` Evgeniy Polyakov
  1 sibling, 1 reply; 21+ messages in thread
From: David Miller @ 2007-12-06  5:03 UTC (permalink / raw)
  To: johnpol; +Cc: jheffner, ilpo.jarvinen, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Wed, 5 Dec 2007 17:48:43 +0300

> On Wed, Dec 05, 2007 at 09:11:01AM -0500, John Heffner (jheffner@psc.edu) wrote:
> > >Maybe if we want to get really fancy we can have some more-expensive
> > >debug mode where detailed specific events get generated via some
> > >macros we can scatter all over the place.  This won't be useful
> > >for general user problem analysis, but it will be excellent for
> > >developers.
> > >
> > >Let me know if you think this is useful enough and I'll work on
> > >an implementation we can start playing with.
> > 
> > 
> > FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
> > http://caia.swin.edu.au/urp/newtcp/tools.html
> > http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf
> 
> And even more similar to this patch from Samir Bellabes of Mandriva:
> http://lwn.net/Articles/202255/

I think this work is very different.

When I say "state" I mean something more significant than
CLOSE, ESTABLISHED, etc. which is what Samir's patches are
tracking.

I'm talking about all of the sequence numbers, SACK information,
congestion control knobs, etc. whose values are nearly impossible to
track on a packet to packet basis in order to diagnose problems.

Web100 provided facilities along these lines as well.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-06  5:03     ` David Miller
@ 2007-12-06 10:58       ` Evgeniy Polyakov
  0 siblings, 0 replies; 21+ messages in thread
From: Evgeniy Polyakov @ 2007-12-06 10:58 UTC (permalink / raw)
  To: David Miller; +Cc: jheffner, ilpo.jarvinen, netdev

On Wed, Dec 05, 2007 at 09:03:43PM -0800, David Miller (davem@davemloft.net) wrote:
> I think this work is very different.
> 
> When I say "state" I mean something more significant than
> CLOSE, ESTABLISHED, etc. which is what Samir's patches are
> tracking.
> 
> I'm talking about all of the sequence numbers, SACK information,
> congestion control knobs, etc. whose values are nearly impossible to
> track on a packet to packet basis in order to diagnose problems.

I pointed that work as a possible basis for collecting more info if you
needs including sequence numbers, window sizes and so on.
It just requires a useful structure layout placed, so that one would not
require to recreate the same bits again, so that it could be called from
any place inside the stack.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 14:11 ` John Heffner
  2007-12-05 14:48   ` Evgeniy Polyakov
@ 2007-12-06  5:00   ` David Miller
  1 sibling, 0 replies; 21+ messages in thread
From: David Miller @ 2007-12-06  5:00 UTC (permalink / raw)
  To: jheffner; +Cc: ilpo.jarvinen, netdev

From: John Heffner <jheffner@psc.edu>
Date: Wed, 05 Dec 2007 09:11:01 -0500

> FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
> http://caia.swin.edu.au/urp/newtcp/tools.html
> http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf

Yes, my proposal is very similar to this SIFTR work.

In their work they tap into the stack using the packet filtering
hooks.

In this way they avoid having to make TCP stack modifications, they
just look up the PCB and dump state, whereas we have more liberty to
do more serious surgery :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 13:30 TCP event tracking via netlink David Miller
  2007-12-05 14:11 ` John Heffner
@ 2007-12-05 16:53 ` Joe Perches
  2007-12-05 21:33   ` Stephen Hemminger
  2007-12-05 23:18 ` Ilpo Järvinen
  2 siblings, 1 reply; 21+ messages in thread
From: Joe Perches @ 2007-12-05 16:53 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev

> it occurred to me that we might want to do something
> like a state change event generator.

This could be a basis for an interesting TCP
performance tester.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 16:53 ` Joe Perches
@ 2007-12-05 21:33   ` Stephen Hemminger
  2007-12-05 22:15     ` Ilpo Järvinen
  2007-12-06 10:20     ` David Miller
  0 siblings, 2 replies; 21+ messages in thread
From: Stephen Hemminger @ 2007-12-05 21:33 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, ilpo.jarvinen, netdev

On Wed, 05 Dec 2007 08:53:07 -0800
Joe Perches <joe@perches.com> wrote:

> > it occurred to me that we might want to do something
> > like a state change event generator.
> 
> This could be a basis for an interesting TCP
> performance tester.

That is what tcpprobe does but it isn't detailed enough to address SACK
issues.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 21:33   ` Stephen Hemminger
@ 2007-12-05 22:15     ` Ilpo Järvinen
  2007-12-06  4:06       ` Stephen Hemminger
  2007-12-06 10:20     ` David Miller
  1 sibling, 1 reply; 21+ messages in thread
From: Ilpo Järvinen @ 2007-12-05 22:15 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Joe Perches, David Miller, Netdev

On Wed, 5 Dec 2007, Stephen Hemminger wrote:

> On Wed, 05 Dec 2007 08:53:07 -0800
> Joe Perches <joe@perches.com> wrote:
> 
> > > it occurred to me that we might want to do something
> > > like a state change event generator.
> > 
> > This could be a basis for an interesting TCP
> > performance tester.
> 
> That is what tcpprobe does but it isn't detailed enough to address SACK
> issues.

...It would be nice if that could be generalized so that the probe could 
be attached to some other functions than tcp_rcv_established instead.

If we convert remaining functions that don't have sk or tp as first 
argument so that sk is listed first (should be many with wrong ordering 
if any), then maybe a generic handler could be of type:

jtcp_entry(struct sock *sk, ...)

or when available:

jtcp_entry(struct sock *sk, struct sk_buff *ack, ...)


-- 
 i.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 22:15     ` Ilpo Järvinen
@ 2007-12-06  4:06       ` Stephen Hemminger
  0 siblings, 0 replies; 21+ messages in thread
From: Stephen Hemminger @ 2007-12-06  4:06 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Joe Perches, David Miller, Netdev

On Thu, 6 Dec 2007 00:15:49 +0200 (EET)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> On Wed, 5 Dec 2007, Stephen Hemminger wrote:
> 
> > On Wed, 05 Dec 2007 08:53:07 -0800
> > Joe Perches <joe@perches.com> wrote:
> > 
> > > > it occurred to me that we might want to do something
> > > > like a state change event generator.
> > > 
> > > This could be a basis for an interesting TCP
> > > performance tester.
> > 
> > That is what tcpprobe does but it isn't detailed enough to address SACK
> > issues.
> 
> ...It would be nice if that could be generalized so that the probe could 
> be attached to some other functions than tcp_rcv_established instead.
> 
> If we convert remaining functions that don't have sk or tp as first 
> argument so that sk is listed first (should be many with wrong ordering 
> if any), then maybe a generic handler could be of type:
> 
> jtcp_entry(struct sock *sk, ...)
> 
> or when available:
> 
> jtcp_entry(struct sock *sk, struct sk_buff *ack, ...)
> 
> 
> -- 
>  i.

An earlier version had hooks in send as well, it is trivial to extend. as long as
the prototypes match, any function arg ordering is okay.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 21:33   ` Stephen Hemminger
  2007-12-05 22:15     ` Ilpo Järvinen
@ 2007-12-06 10:20     ` David Miller
  2007-12-06 13:28       ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 21+ messages in thread
From: David Miller @ 2007-12-06 10:20 UTC (permalink / raw)
  To: shemminger; +Cc: joe, ilpo.jarvinen, netdev

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Wed, 5 Dec 2007 16:33:38 -0500

> On Wed, 05 Dec 2007 08:53:07 -0800
> Joe Perches <joe@perches.com> wrote:
> 
> > > it occurred to me that we might want to do something
> > > like a state change event generator.
> > 
> > This could be a basis for an interesting TCP
> > performance tester.
> 
> That is what tcpprobe does but it isn't detailed enough to address SACK
> issues.

Indeed, this could be done via the jprobe there.

Silly me I didn't do this in the implementation I whipped
up, which I'll likely correct.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-06 10:20     ` David Miller
@ 2007-12-06 13:28       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2007-12-06 13:28 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, joe, ilpo.jarvinen, netdev

Em Thu, Dec 06, 2007 at 02:20:58AM -0800, David Miller escreveu:
> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Wed, 5 Dec 2007 16:33:38 -0500
> 
> > On Wed, 05 Dec 2007 08:53:07 -0800
> > Joe Perches <joe@perches.com> wrote:
> > 
> > > > it occurred to me that we might want to do something
> > > > like a state change event generator.
> > > 
> > > This could be a basis for an interesting TCP
> > > performance tester.
> > 
> > That is what tcpprobe does but it isn't detailed enough to address SACK
> > issues.
> 
> Indeed, this could be done via the jprobe there.
> 
> Silly me I didn't do this in the implementation I whipped
> up, which I'll likely correct.

I have some experiments from the past on this area:

This is what is produced by ctracer + the ostra callgrapher when
tracking many sk_buff objects, tracing sk_buff routines and as well all
other structs that have a pointer to a sk_buff, i.e. where the sk_buff
can be get from the struct that has a pointer to it, tcp_sock is an
"alias" to struct inet_sock that is an "alias" to struct sock, etc, so
when tracing tcp_sock you also trace inet_connection_sock, inet_sock,
sock methods:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sk_buff/many_objects/

With just one object (that is reused, so appears many times):

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sk_buff/0xffff8101013130e8/

Following struct sock methods:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/many_objects/

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/

struct socket:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/socket/many_objects/

It works by using the DWARF information to generate a systemtap module
that in turn will create a relayfs channel where we store the traces and
a automatically reorganized struct with just the base types (int, char,
long, etc) and typedefs that end up being base types.

Example of the struct minisock recreated from the debugging information
and reorganized using the algorithms in pahole to save space, generated
by this tool, go to the bottom, where you'll find struct
ctracer__mini_sock and the collector, that from a full sized object
creates the mini struct.

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/ctracer_collector.struct.sock.c

And the systemtap module (the tcpprobe on steroids) automatically
generated:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/ctracer_methods.struct.sock.stp

This requires more work to:

. reduce the overhead
. filter out undesired functions creating a "project" with the functions desired using
  some gui editor
. specify lists of fields to put on the internal state to be collected, again using a
  gui or plain ctracer-edit using vi, instead of getting just base types
. Be able to say: collect just the fields on the second and fourth cacheline
. collectors for complex objects such as spinlocks, socket lock, mutexes

But since people are wanting to work on tools to watch state
transitions, fields changing, etc, I thought I should dust off the ostra
experiments and the more recent dwarves ctracer work I'm doing on my
copious spare time 8)

In the callgrapher there are some more interesting stuff:

Interface to see where fields changed:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/changes.html

In this page clicking on a field name, such as:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/sk_forward_alloc.png

You'll get graphs over time.

Code is in the dwarves repo at:

http://master.kernel.org/git/?p=linux/kernel/git/acme/pahole.git;a=summary

Thanks,

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 13:30 TCP event tracking via netlink David Miller
  2007-12-05 14:11 ` John Heffner
  2007-12-05 16:53 ` Joe Perches
@ 2007-12-05 23:18 ` Ilpo Järvinen
  2007-12-06 10:33   ` David Miller
  2 siblings, 1 reply; 21+ messages in thread
From: Ilpo Järvinen @ 2007-12-05 23:18 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev

On Wed, 5 Dec 2007, David Miller wrote:

> Ilpo, I was pondering the kind of debugging one does to find
> congestion control issues and even SACK bugs and it's currently too
> painful because there is no standard way to track state changes.

That's definately true.

> I assume you're using something like carefully crafted printk's,
> kprobes, or even ad-hoc statistic counters.  That's what I used to do
> :-)

No, that's not at all what I do :-). I usually look time-seq graphs 
expect for the cases when I just find things out by reading code (or
by just thinking of it). I'm so used to all things in the graphs that
I can quite easily spot any inconsistencies & TCP events and then look 
interesting parts in greater detail, very rarely something remains 
uncertain... However, instead of directly going to printks, etc. I almost 
always read the code first (usually it's not just couple of lines but tens 
of potential TCP execution paths involving more than a handful of 
functions to check what the end result would be). This has a nice 
side-effect that other things tend to show up as well. Only when things 
get nasty and I cannot figure out what it does wrong, only then I add 
specially placed ad-hoc printks.

One trick I also use, is to get the vars of the relevant flow from 
/proc/net/tcp in a while loop but it only works for my case because
I use links that are slow (even a small value sleep in the loop does
not hide much).

For other people reports, I occasionally have to write a validator patches 
like you might have notice because in a typical miscount case our 
BUG_TRAPs are too late because they occur only after outstanding window 
becomes zero that might be very distant point in time already from the 
cause.

Also, I'm planning an experiment with those markers thing to see if 
they are of any use when trying to gather some latency data about 
SACK processing because they seem light weight enough to not be 
disturbing.

> With that in mind it occurred to me that we might want to do something
> like a state change event generator.
> 
> Basically some application or even a daemon listens on this generic
> netlink socket family we create.  The header of each event packet
> indicates what socket the event is for and then there is some state
> information.
> 
> Then you can look at a tcpdump and this state dump side by side and
> see what the kernel decided to do.

Much of the info is available in tcpdump already, it's just hard to read 
without graphing it first because there are some many overlapping things 
to track in two-dimensional space.

...But yes, I have to admit that couple of problems come to my mind
where having some variable from tcp_sock would have made the problem
more obvious.

> Now there is the question of granularity.
> 
> A very important consideration in this is that we want this thing to
> be enabled in the distributions, therefore it must be cheap.  Perhaps
> one test at the end of the packet input processing.

Not sure what is the benefit of having distributions with it because 
those people hardly report problems anyway to here, they're just too 
happy with TCP performance unless we print something to their logs,
which implies that we must setup a *_ON() condition :-(.

Yes, often negleted problem is that most people are just too happy even 
something like TCP Tahoe or something as prehistoric. I've been surprised 
how badly TCP can break without nobody complaining as long as it doesn't 
crash (even any of the devs). Two key things seems to surface the most of 
the TCP related bugs: research people really staring at strange packet 
patterns (or code) and automatic WARN/BUG_ON checks triggered reports.
The latter reports include also corner cases which nobody would otherwise 
ever noticed (or at least before Linus releases 3.0 :-/).

IMHO, those invariant WARN/BUG_ON are the only alternative that scales to 
normal users well enough. The checks are simple enough so that it can be 
always on and then we just happen to print something to their log, and 
that's offensive enough for somebody to come up with a report... ;-)

> So I say we pick some state to track (perhaps start with tcp_info)
> and just push that at the end of every packet input run.  Also,
> we add some minimal filtering capability (match on specific IP
> address and/or port, for example).
>
> Maybe if we want to get really fancy we can have some more-expensive
> debug mode where detailed specific events get generated via some
> macros we can scatter all over the place.
>
> This won't be useful for general user problem analysis, but it will be 
> excellent for developers.

I would say that it to be generic enough, most function entrys and exits
should have to be covered because the need varies a lot, the processing in 
general is so complex that things would get too easily shadowed otherwise! 
In addition we need expensive mode++ which goes all the way down to the 
dirty details of the write queue, they're now dirtier than ever because 
the queue is split I dared to do.

Some problems are simply such that things cannot be accurately verified 
without high processing overhead until it's far too late (eg skb bits vs 
*_out counters). Maybe we should start to build an expensive state 
validator as well which would automatically check invariants of the write 
queue and tcp_sock in a straight forward, unoptimized manner? That would 
definately do a lot of work for us, just ask people to turn it on and it 
spits out everything that went wrong :-) (unless they really depend on 
very high-speed things and are therefore unhappy if we scan thousands of 
packets unnecessarily per ACK :-)). ...Early enough! ...That would work 
also for distros but there's always human judgement needed to decide 
whether the bug reporter will be happy when his TCP processing does no 
longer scale ;-).

For the simpler thing, why not just taking all TCP functions and doing 
some automated tool using kprobes to collect the information we need 
through the sk/tp available on almost every function call, some TCP 
specific code could then easily produce what we want from it? Ah, this is 
almost done already as noted by Stephen, would just need some 
generalization to be pluggable to other functions as well and more 
variables.

> Let me know if you think this is useful enough and I'll work on
> an implementation we can start playing with.

...Hopefully you found any of my comments useful.

-- 
 i.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-05 23:18 ` Ilpo Järvinen
@ 2007-12-06 10:33   ` David Miller
  2007-12-06 17:23     ` Stephen Hemminger
  2007-12-07 16:43     ` Ilpo Järvinen
  0 siblings, 2 replies; 21+ messages in thread
From: David Miller @ 2007-12-06 10:33 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 6 Dec 2007 01:18:28 +0200 (EET)

> On Wed, 5 Dec 2007, David Miller wrote:
> 
> > I assume you're using something like carefully crafted printk's,
> > kprobes, or even ad-hoc statistic counters.  That's what I used to do
> > :-)
> 
> No, that's not at all what I do :-). I usually look time-seq graphs 
> expect for the cases when I just find things out by reading code (or
> by just thinking of it).

Can you briefly detail what graph tools and command lines
you are using?

The last time I did graphing to analyze things, the tools
were hit-or-miss.

> Much of the info is available in tcpdump already, it's just hard to read 
> without graphing it first because there are some many overlapping things 
> to track in two-dimensional space.
> 
> ...But yes, I have to admit that couple of problems come to my mind
> where having some variable from tcp_sock would have made the problem
> more obvious.

The most important are the cwnd and ssthresh, which you could guess
using graphs but it is important to know on a packet to packet
basis why we might have sent a packet or not because this has
rippling effects down the rest of the RTT.

> Not sure what is the benefit of having distributions with it because 
> those people hardly report problems anyway to here, they're just too 
> happy with TCP performance unless we print something to their logs,
> which implies that we must setup a *_ON() condition :-(.

That may be true, but if we could integrate the information with
tcpdumps, we could gather internal state using tools the user
already has available.

Imagine if tcpdump printed out:

02:26:14.865805 IP $SRC > $DEST: . 11226:12686(1460) ack 0 win 108
	ss_thresh: 129 cwnd: 133 packets_out: 132

or something like that.

> Some problems are simply such that things cannot be accurately verified 
> without high processing overhead until it's far too late (eg skb bits vs 
> *_out counters). Maybe we should start to build an expensive state 
> validator as well which would automatically check invariants of the write 
> queue and tcp_sock in a straight forward, unoptimized manner? That would 
> definately do a lot of work for us, just ask people to turn it on and it 
> spits out everything that went wrong :-) (unless they really depend on 
> very high-speed things and are therefore unhappy if we scan thousands of 
> packets unnecessarily per ACK :-)). ...Early enough! ...That would work 
> also for distros but there's always human judgement needed to decide 
> whether the bug reporter will be happy when his TCP processing does no 
> longer scale ;-).

I think it's useful as a TCP_DEBUG config option or similar, sure.

But sometimes the algorithms are working as designed, it's just that
they provide poor pipe utilization and CWND analysis embedded inside
of a tcpdump would be one way to see that as well as determine the
flaw in the algorithm.

> ...Hopefully you found any of my comments useful.

Very much so, thanks.

I put together a sample implementation anyways just to show the idea,
against net-2.6.25 below.

It is untested since I didn't write the userland app yet to see that
proper things get logged.  Basically you could run a daemon that
writes per-connection traces into files based upon the incoming
netlink events.  Later, using the binary pcap file and these traces,
you can piece together traces like the above using the timestamps
etc. to match up pcap packets to ones from the TCP logger.

The userland tools could do analysis and print pre-cooked state diff
logs, like "this ACK raised CWND by one" or whatever else you wanted
to know.

It's nice that an expert like you can look at graphs and understand,
but we'd like to create more experts and besides reading code one
way to become an expert is to be able to extrace live real data
from the kernel's working state and try to understand how things
got that way.  This information is permanently lost currently.

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 56342c3..c0e61d0 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -170,6 +170,47 @@ struct tcp_md5sig {
 	__u8	tcpm_key[TCP_MD5SIG_MAXKEYLEN];		/* key (binary) */
 };
 
+/* TCP netlink event logger.  */
+struct tcp_log_key {
+	union {
+		__be32		a4;
+		__be32		a6[4];
+	} saddr, daddr;
+	__be16	sport;
+	__be16	dport;
+	unsigned short family;
+	unsigned short __pad;
+};
+
+struct tcp_log_stamp {
+	__u32	tv_sec;
+	__u32	tv_usec;
+};
+
+struct tcp_log_payload {
+	struct tcp_log_key	key;
+	struct tcp_log_stamp	stamp;
+	struct tcp_info		info;
+};
+
+enum {
+	TCP_LOG_A_UNSPEC = 0,
+	__TCP_LOG_A_MAX,
+};
+#define TCP_LOG_A_MAX		(__TCP_LOG_A_MAX - 1)
+
+#define TCP_LOG_GENL_NAME	"tcp_log"
+#define TCP_LOG_GENL_VERSION	1
+
+enum {
+	TCP_LOG_CMD_UNSPEC = 0,
+	TCP_LOG_CMD_HELLO,
+	TCP_LOG_CMD_GOODBYE,
+	TCP_LOG_CMD_EVENT,
+	__TCP_LOG_CMD_MAX,
+};
+#define TCP_LOG_CMD_MAX		(__TCP_LOG_CMD_MAX - 1)
+
 #ifdef __KERNEL__
 
 #include <linux/skbuff.h>
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9dbed0b..5ac82ea 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1730,6 +1730,19 @@ struct tcp_request_sock_ops {
 #endif
 };
 
+#define TCP_LOG_PID_INACTIVE	-1
+extern int tcp_log_pid;
+
+extern void tcp_do_log(struct sock *sk, ktime_t stamp);
+
+static inline void tcp_log(struct sock *sk, ktime_t stamp)
+{
+	if (likely(tcp_log_pid == TCP_LOG_PID_INACTIVE))
+		return;
+
+	tcp_do_log(sk, stamp);
+}
+
 extern void tcp_v4_init(struct net_proto_family *ops);
 extern void tcp_init(void);
 
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index ad40ef3..fa0cc1d 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -7,7 +7,7 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     ip_output.o ip_sockglue.o inet_hashtables.o \
 	     inet_timewait_sock.o inet_connection_sock.o \
 	     tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \
-	     tcp_minisocks.o tcp_cong.o \
+	     tcp_minisocks.o tcp_cong.o tcp_log.o \
 	     datagram.o raw.o udp.o udplite.o \
 	     arp.o icmp.o devinet.o af_inet.o  igmp.o \
 	     fib_frontend.o fib_semantics.o \
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c5fba12..a51cbd2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4577,6 +4577,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 			struct tcphdr *th, unsigned len)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	ktime_t stamp = skb->tstamp;
 
 	/*
 	 *	Header prediction.
@@ -4657,6 +4658,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 				tcp_ack(sk, skb, 0);
 				__kfree_skb(skb);
 				tcp_data_snd_check(sk);
+				tcp_log(sk, stamp);
 				return 0;
 			} else { /* Header too small */
 				TCP_INC_STATS_BH(TCP_MIB_INERRS);
@@ -4748,6 +4750,7 @@ no_ack:
 				__kfree_skb(skb);
 			else
 				sk->sk_data_ready(sk, 0);
+			tcp_log(sk, stamp);
 			return 0;
 		}
 	}
@@ -4800,6 +4803,7 @@ slow_path:
 		TCP_INC_STATS_BH(TCP_MIB_INERRS);
 		NET_INC_STATS_BH(LINUX_MIB_TCPABORTONSYN);
 		tcp_reset(sk);
+		tcp_log(sk, stamp);
 		return 1;
 	}
 
@@ -4817,6 +4821,7 @@ step5:
 
 	tcp_data_snd_check(sk);
 	tcp_ack_snd_check(sk);
+	tcp_log(sk, stamp);
 	return 0;
 
 csum_error:
@@ -4824,6 +4829,7 @@ csum_error:
 
 discard:
 	__kfree_skb(skb);
+	tcp_log(sk, stamp);
 	return 0;
 }
 
--- a/net/ipv4/tcp_log.c	2007-10-24 01:07:28.000000000 -0700
+++ b/net/ipv4/tcp_log.c	2007-12-06 01:06:26.000000000 -0800
@@ -0,0 +1,149 @@
+/* tcp_log.c: Netlink based TCP state change logger.
+ *
+ * Copyright (C) 2007 David S. Miller <davem@davemloft.net>
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/time.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+
+#include <net/genetlink.h>
+#include <net/inet_sock.h>
+#include <net/tcp.h>
+
+static struct genl_family tcp_log_family = {
+	.id		=	GENL_ID_GENERATE,
+	.name		=	TCP_LOG_GENL_NAME,
+	.version	=	TCP_LOG_GENL_VERSION,
+	.hdrsize	=	sizeof(struct tcp_log_payload),
+	.maxattr	=	TCP_LOG_A_MAX,
+};
+
+static unsigned int tcp_log_seqnum;
+
+int tcp_log_pid = TCP_LOG_PID_INACTIVE;
+EXPORT_SYMBOL(tcp_log_pid);
+
+static int tcp_log_hello(struct sk_buff *skb, struct genl_info *info)
+{
+	tcp_log_pid = info->snd_pid;
+	return 0;
+}
+
+static int tcp_log_goodbye(struct sk_buff *skb, struct genl_info *info)
+{
+	tcp_log_pid = TCP_LOG_PID_INACTIVE;
+	return 0;
+}
+
+static struct genl_ops tcp_log_hello_ops = {
+	.cmd		=	TCP_LOG_CMD_HELLO,
+	.doit		=	tcp_log_hello,
+};
+
+static struct genl_ops tcp_log_goodbye_ops = {
+	.cmd		=	TCP_LOG_CMD_GOODBYE,
+	.doit		=	tcp_log_goodbye,
+};
+
+static void fill_key(struct tcp_log_key *key, struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct ipv6_pinfo *np = inet6_sk(sk);
+
+	switch (sk->sk_family) {
+	case AF_INET:
+		key->saddr.a4 = inet->saddr;
+		key->daddr.a4 = inet->daddr;
+		break;
+	case AF_INET6:
+		memcpy(&key->saddr.a6, &np->saddr, sizeof(key->saddr.a6));
+		memcpy(&key->daddr.a6, &np->daddr, sizeof(key->daddr.a6));
+		break;
+	default:
+		BUG();
+		break;
+	}
+	key->sport = inet->sport;
+	key->dport = inet->dport;
+}
+
+void tcp_do_log(struct sock *sk, ktime_t stamp)
+{
+	struct tcp_log_payload *p;
+	struct sk_buff *skb;
+	struct timeval tv;
+	void *data;
+	int size;
+
+	size = nla_total_size(sizeof(struct tcp_log_payload));
+	skb = genlmsg_new(size, GFP_ATOMIC);
+	if (!skb)
+		return;
+
+	data = genlmsg_put(skb, 0, tcp_log_seqnum++,
+			   &tcp_log_family, 0, TCP_LOG_CMD_EVENT);
+	if (!data) {
+		nlmsg_free(skb);
+		return;
+	}
+	p = data;
+
+	fill_key(&p->key, sk);
+
+	if (stamp.tv64)
+		tv = ktime_to_timeval(stamp);
+	else
+		do_gettimeofday(&tv);
+
+	p->stamp.tv_sec = tv.tv_sec;
+	p->stamp.tv_usec = tv.tv_usec;
+
+	tcp_get_info(sk, &p->info);
+
+	if (genlmsg_end(skb, data) < 0) {
+		nlmsg_free(skb);
+		return;
+	}
+
+	genlmsg_unicast(skb, tcp_log_pid);
+}
+EXPORT_SYMBOL(tcp_do_log);
+
+static int __init tcp_log_init(void)
+{
+	int err = genl_register_family(&tcp_log_family);
+
+	if (err)
+		return err;
+
+	err = genl_register_ops(&tcp_log_family, &tcp_log_hello_ops);
+	if (err)
+		goto out_unregister_family;
+
+	err = genl_register_ops(&tcp_log_family, &tcp_log_goodbye_ops);
+	if (err)
+		goto out_unregister_hello;
+
+	return 0;
+
+out_unregister_hello:
+	genl_unregister_ops(&tcp_log_family, &tcp_log_hello_ops);
+
+out_unregister_family:
+	genl_unregister_family(&tcp_log_family);
+
+	return err;
+}
+
+static void __exit tcp_log_exit(void)
+{
+	genl_unregister_ops(&tcp_log_family, &tcp_log_goodbye_ops);
+	genl_unregister_ops(&tcp_log_family, &tcp_log_hello_ops);
+	genl_unregister_family(&tcp_log_family);
+}
+
+module_init(tcp_log_init);
+module_exit(tcp_log_exit);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-06 10:33   ` David Miller
@ 2007-12-06 17:23     ` Stephen Hemminger
  2007-12-07  6:51       ` David Miller
  2008-01-02  8:22       ` David Miller
  2007-12-07 16:43     ` Ilpo Järvinen
  1 sibling, 2 replies; 21+ messages in thread
From: Stephen Hemminger @ 2007-12-06 17:23 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev

On Thu, 06 Dec 2007 02:33:46 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Thu, 6 Dec 2007 01:18:28 +0200 (EET)
> 
> > On Wed, 5 Dec 2007, David Miller wrote:
> > 
> > > I assume you're using something like carefully crafted printk's,
> > > kprobes, or even ad-hoc statistic counters.  That's what I used to do
> > > :-)
> > 
> > No, that's not at all what I do :-). I usually look time-seq graphs 
> > expect for the cases when I just find things out by reading code (or
> > by just thinking of it).
> 
> Can you briefly detail what graph tools and command lines
> you are using?
> 
> The last time I did graphing to analyze things, the tools
> were hit-or-miss.
> 
> > Much of the info is available in tcpdump already, it's just hard to read 
> > without graphing it first because there are some many overlapping things 
> > to track in two-dimensional space.
> > 
> > ...But yes, I have to admit that couple of problems come to my mind
> > where having some variable from tcp_sock would have made the problem
> > more obvious.
> 
> The most important are the cwnd and ssthresh, which you could guess
> using graphs but it is important to know on a packet to packet
> basis why we might have sent a packet or not because this has
> rippling effects down the rest of the RTT.
> 
> > Not sure what is the benefit of having distributions with it because 
> > those people hardly report problems anyway to here, they're just too 
> > happy with TCP performance unless we print something to their logs,
> > which implies that we must setup a *_ON() condition :-(.
> 
> That may be true, but if we could integrate the information with
> tcpdumps, we could gather internal state using tools the user
> already has available.
> 
> Imagine if tcpdump printed out:
> 
> 02:26:14.865805 IP $SRC > $DEST: . 11226:12686(1460) ack 0 win 108
> 	ss_thresh: 129 cwnd: 133 packets_out: 132
> 
> or something like that.
> 
> > Some problems are simply such that things cannot be accurately verified 
> > without high processing overhead until it's far too late (eg skb bits vs 
> > *_out counters). Maybe we should start to build an expensive state 
> > validator as well which would automatically check invariants of the write 
> > queue and tcp_sock in a straight forward, unoptimized manner? That would 
> > definately do a lot of work for us, just ask people to turn it on and it 
> > spits out everything that went wrong :-) (unless they really depend on 
> > very high-speed things and are therefore unhappy if we scan thousands of 
> > packets unnecessarily per ACK :-)). ...Early enough! ...That would work 
> > also for distros but there's always human judgement needed to decide 
> > whether the bug reporter will be happy when his TCP processing does no 
> > longer scale ;-).
> 
> I think it's useful as a TCP_DEBUG config option or similar, sure.
> 
> But sometimes the algorithms are working as designed, it's just that
> they provide poor pipe utilization and CWND analysis embedded inside
> of a tcpdump would be one way to see that as well as determine the
> flaw in the algorithm.
> 
> > ...Hopefully you found any of my comments useful.
> 
> Very much so, thanks.
> 
> I put together a sample implementation anyways just to show the idea,
> against net-2.6.25 below.
> 
> It is untested since I didn't write the userland app yet to see that
> proper things get logged.  Basically you could run a daemon that
> writes per-connection traces into files based upon the incoming
> netlink events.  Later, using the binary pcap file and these traces,
> you can piece together traces like the above using the timestamps
> etc. to match up pcap packets to ones from the TCP logger.
> 
> The userland tools could do analysis and print pre-cooked state diff
> logs, like "this ACK raised CWND by one" or whatever else you wanted
> to know.
> 
> It's nice that an expert like you can look at graphs and understand,
> but we'd like to create more experts and besides reading code one
> way to become an expert is to be able to extrace live real data
> from the kernel's working state and try to understand how things
> got that way.  This information is permanently lost currently.


Tools and scripts for testing that generate graphs are at:
	git://git.kernel.org/pub/scm/tcptest/tcptest

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-06 17:23     ` Stephen Hemminger
@ 2007-12-07  6:51       ` David Miller
  2008-01-02  8:22       ` David Miller
  1 sibling, 0 replies; 21+ messages in thread
From: David Miller @ 2007-12-07  6:51 UTC (permalink / raw)
  To: shemminger; +Cc: ilpo.jarvinen, netdev

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Thu, 6 Dec 2007 09:23:12 -0800

> Tools and scripts for testing that generate graphs are at:
> 	git://git.kernel.org/pub/scm/tcptest/tcptest

I know about this, I'm just curious what exactly Ilpo is
using :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-06 17:23     ` Stephen Hemminger
  2007-12-07  6:51       ` David Miller
@ 2008-01-02  8:22       ` David Miller
  2008-01-02 11:05         ` Ilpo Järvinen
  1 sibling, 1 reply; 21+ messages in thread
From: David Miller @ 2008-01-02  8:22 UTC (permalink / raw)
  To: shemminger; +Cc: ilpo.jarvinen, netdev

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Thu, 6 Dec 2007 09:23:12 -0800

> Tools and scripts for testing that generate graphs are at:
> 	git://git.kernel.org/pub/scm/tcptest/tcptest

Did you move it somewhere else?

davem@sunset:~/src/GIT$ git clone git://git.kernel.org/pub/scm/tcptest/tcptest
Initialized empty Git repository in /home/davem/src/GIT/tcptest/.git/
fatal: The remote end hung up unexpectedly
fetch-pack from 'git://git.kernel.org/pub/scm/tcptest/tcptest' failed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2008-01-02  8:22       ` David Miller
@ 2008-01-02 11:05         ` Ilpo Järvinen
  2008-01-03  9:26           ` David Miller
  0 siblings, 1 reply; 21+ messages in thread
From: Ilpo Järvinen @ 2008-01-02 11:05 UTC (permalink / raw)
  To: David Miller; +Cc: Stephen Hemminger, Netdev

On Wed, 2 Jan 2008, David Miller wrote:

> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Thu, 6 Dec 2007 09:23:12 -0800
> 
> > Tools and scripts for testing that generate graphs are at:
> > 	git://git.kernel.org/pub/scm/tcptest/tcptest
> 
> Did you move it somewhere else?
> 
> davem@sunset:~/src/GIT$ git clone git://git.kernel.org/pub/scm/tcptest/tcptest
> Initialized empty Git repository in /home/davem/src/GIT/tcptest/.git/
> fatal: The remote end hung up unexpectedly
> fetch-pack from 'git://git.kernel.org/pub/scm/tcptest/tcptest' failed.

.../network/ was missing from the path :-).

$ git-remote show origin
* remote origin
  URL: git://git.kernel.org/pub/scm/network/tcptest/tcptest.git
  Remote branch(es) merged with 'git pull' while on branch master
    master
  Tracked remote branches
    master


-- 
 i.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2008-01-02 11:05         ` Ilpo Järvinen
@ 2008-01-03  9:26           ` David Miller
  0 siblings, 0 replies; 21+ messages in thread
From: David Miller @ 2008-01-03  9:26 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: shemminger, netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Wed, 2 Jan 2008 13:05:17 +0200 (EET)

> git://git.kernel.org/pub/scm/network/tcptest/tcptest.git

Thanks a lot Ilpo.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: TCP event tracking via netlink...
  2007-12-06 10:33   ` David Miller
  2007-12-06 17:23     ` Stephen Hemminger
@ 2007-12-07 16:43     ` Ilpo Järvinen
  1 sibling, 0 replies; 21+ messages in thread
From: Ilpo Järvinen @ 2007-12-07 16:43 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6292 bytes --]

On Thu, 6 Dec 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Thu, 6 Dec 2007 01:18:28 +0200 (EET)
> 
> > On Wed, 5 Dec 2007, David Miller wrote:
> > 
> > > I assume you're using something like carefully crafted printk's,
> > > kprobes, or even ad-hoc statistic counters.  That's what I used to do
> > > :-)
> > 
> > No, that's not at all what I do :-). I usually look time-seq graphs 
> > expect for the cases when I just find things out by reading code (or
> > by just thinking of it).
> 
> Can you briefly detail what graph tools and command lines
> you are using?

I have a tool called Sealion but it's behind NDA (making it open source 
has been talked for long but I don't have idea why it hasn't realized 
yet). It's mostly tcl/tk code is, by no means nice or clean desing nor 
quality (I'll leave details why I think it's that way out of this 
discussion :-)). Produces svgs. Usually I'm have the things I need in 
the standard sent+ACK+SACKs(+win) graph it produces. The result is quite 
similar to what tcptrace+xplot produces but xplot UI is really horrible, 
IMHO.

If I have to deal with tcpdump output only, it takes considerable amount 
of time to do computations with bc to come up with the same understanding 
by just reading tcpdumps.

> The last time I did graphing to analyze things, the tools
> were hit-or-miss.

Yeah, this is definately true. Open source graphing tools I know are 
really not that astonishing :-(. I've tried to look for better tools
as well but with little success.

> > Much of the info is available in tcpdump already, it's just hard to read 
> > without graphing it first because there are some many overlapping things 
> > to track in two-dimensional space.
> > 
> > ...But yes, I have to admit that couple of problems come to my mind
> > where having some variable from tcp_sock would have made the problem
> > more obvious.
> 
> The most important are the cwnd and ssthresh, which you could guess
> using graphs but it is important to know on a packet to packet
> basis why we might have sent a packet or not because this has
> rippling effects down the rest of the RTT.

Couple of points:

In order to evaluate validity of some action, one might need more than
one packet from the history.

Answer to the why we have sent a packet is rather simple (excluding RTOs): 
cwnd > packets_in_flight and data was available. No, it's not at all 
complicated. Though I might be too biased toward non-application limited 
cases which make the formula even simpler because everything is basically 
ACK clocked.

To really tell what caused changes between cwnd and/or packets_in_flight 
one usually needs some history or more fine-grained approach, once per 
packet is way too wide gap. It tells just what happened, not why, unless 
you're really familiar with the state machine and can make the right 
guess.

> > Not sure what is the benefit of having distributions with it because 
> > those people hardly report problems anyway to here, they're just too 
> > happy with TCP performance unless we print something to their logs,
> > which implies that we must setup a *_ON() condition :-(.
> 
> That may be true, but if we could integrate the information with
> tcpdumps, we could gather internal state using tools the user
> already has available.

It would definately help if we could, but that of course depends on 
getting the reports in the first place.

> Imagine if tcpdump printed out:
> 
> 02:26:14.865805 IP $SRC > $DEST: . 11226:12686(1460) ack 0 win 108
> 	ss_thresh: 129 cwnd: 133 packets_out: 132
> 
> or something like that.

How about this:

02:26:14.865805 IP $SRC > $DEST: . ack 11226 win 108 <...sack 1 {15606:18526}
17066:18526 0->S sacktag_one l0 s1 r0 f4 pc1 ...
11226:12686 ---- clean_rtx_queue ...
11226:12686 0->L mark_head_lost l1 s1 r0 f4 pc1 ...
12686:14146 0->L mark_head_lost l2 s1 r0 f4 pc1 ...
11226:12686 L->LRe retransmit_skb l2 s1 r1 f4 pc1 ...

...would make the bug in sack processing relatively obvious (yes, it 
has an intentional flaw in it, points from find it :-))... That would
be something I'd like to have right now.

> But sometimes the algorithms are working as designed, it's just that
> they provide poor pipe utilization and CWND analysis embedded inside
> of a tcpdump would be one way to see that as well as determine the
> flaw in the algorithm.

Fair enough.

> It is untested since I didn't write the userland app yet to see that
> proper things get logged.  Basically you could run a daemon that
> writes per-connection traces into files based upon the incoming
> netlink events.  Later, using the binary pcap file and these traces,
> you can piece together traces like the above using the timestamps
> etc. to match up pcap packets to ones from the TCP logger.
>
> The userland tools could do analysis and print pre-cooked state diff
> logs, like "this ACK raised CWND by one" or whatever else you wanted
> to know.

Obviously a collection of useful userland tools seems here at least as 
important as the existance of the interface.

> It's nice that an expert like you can look at graphs and understand,
> but we'd like to create more experts and besides reading code one
> way to become an expert is to be able to extrace live real data
> from the kernel's working state and try to understand how things
> got that way.  This information is permanently lost currently.

IMHO this problem is in such caliber that no human can track efficiently 
more than a couple of packets of a TCP flow from text only view, without 
headaches I mean, or are you able to do that at ease? And we're talking 
here about the people who have just begun to deal with TCP. ...For me 
especially those nearly identical seqnos all around are too overwhelming 
to track in any sane way and I'd expect that most feel the same way.

I'd state my point different way around (with the terms you chose): it is 
very difficult to become an expert without looking some graphs, we may 
disagree and that's fine :-). I think it's because one would then have no 
idea about larger picture (and about the very _relevant_ past/future) when 
looking just a single line at a time from the tcpdump (or equivalent). Sad 
thing is that a good tool to do the visulization might not exist.

-- 
 i.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2008-01-03  9:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-05 13:30 TCP event tracking via netlink David Miller
2007-12-05 14:11 ` John Heffner
2007-12-05 14:48   ` Evgeniy Polyakov
2007-12-05 15:12     ` Samir Bellabes
2007-12-06  5:03     ` David Miller
2007-12-06 10:58       ` Evgeniy Polyakov
2007-12-06  5:00   ` David Miller
2007-12-05 16:53 ` Joe Perches
2007-12-05 21:33   ` Stephen Hemminger
2007-12-05 22:15     ` Ilpo Järvinen
2007-12-06  4:06       ` Stephen Hemminger
2007-12-06 10:20     ` David Miller
2007-12-06 13:28       ` Arnaldo Carvalho de Melo
2007-12-05 23:18 ` Ilpo Järvinen
2007-12-06 10:33   ` David Miller
2007-12-06 17:23     ` Stephen Hemminger
2007-12-07  6:51       ` David Miller
2008-01-02  8:22       ` David Miller
2008-01-02 11:05         ` Ilpo Järvinen
2008-01-03  9:26           ` David Miller
2007-12-07 16:43     ` Ilpo Järvinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).