Re: A top 10 statistics module?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: A top 10 statistics module?
       [not found] <200504201804.j3KI4WJ19340@isis.cs3-inc.com>
@ 2005-04-21  5:25 ` Wang Jian
       [not found]   ` <16999.19485.501493.784034@isis.cs3-inc.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Wang Jian @ 2005-04-21  5:25 UTC (permalink / raw)
  To: netfilter-devel

Hi Don Cohen,


On Wed, 20 Apr 2005 11:04:32 -0700, don-nfil1@isis.cs3-inc.com (Don Cohen) wrote:

> [not sent to list - that tends to lead to spam
>  so feel free to post contents but please leave my address out of it]
> 
>   I have a customer who needs functionality that list top 10 hosts 
> Top 10 in what sense?  Those that send the most packets?  Those that
> have the most connections?

For tc class, this usually means bytes sent.

> 
> My first suggestion is that this be generalized to allow specification
> of the key, i.e., you should be able to collect statistics on TOS or
> ip length, etc.
> This begins to lead to the slippery slope of more and more general
> languages, e.g., you might want to classify by ranges or combine
> several fields (classify by TOS + protocol ...).
> 

Yes, usually a solution for a special purpose can be generalized into
common solution lately.

> My second is to forget about the 10 - just classify and count the
> packets (maybe also bytes?) in each class.  Then let the user throw
> out the data he doesn't want.
> 

I know your meaning. But it is not enough in this case.

Actually, Getting this class traffic out can be done in user space by
re-interpreting iptables filters (which set nfmark) in tcpdump filter
syntax. And then, analysis of tcpdump output can give out top 10. But,
this is a little complicated for automation and integration.

And, the kernel has done this once. Do it again in userspace seems to be
a waste?

And, when you want to investigate two or more classes at the same time,
you have more trouble to do it.

>   (listener or talker). It normally done in userspace, but in this case,
> 
> I'm imagining here that you want to classify by source address and
> separately by destination address.  So if you really want to add the
> two together then you'd have two instances of the kernel module, and
> in user space you'd read the results for each, sort them and then add
> them together.
> 

No. A TOPHOST rule can specify how to classify, by source or by
destination address.

>   the 10 hosts is for a tc class. Moreover, it is expected that 2 or more
>   tc classes' top 10 are collected at the same time.
> 
>   So I think this is better handled in kernel space, because the classid
>   and/or nfmark is only seen in kernel space.
> 
> It wasn't obvious that you cared about class or nfmark.
> Of course, these could be recomputed (in most cases) in user space.
> 

It can be done in user space. But not convenient.

>   The idea is that a rule like
> 
>    -m mark --mark 0x1 -j TOPHOST --count 10 --name FILENAME
> 
> It came as a surprise to me that you would want to apply the
> statistics counter to only those packets that match some other
> pattern as above.  I'm curious about why this is required.
> If it's not then it seems to me that it would be easier to
> process tcpdump files or perhaps read raw sockets.
> 

This is definitely the original requirement. It can be done in user
space, but not convenient for automation.

>   will collect top10 IPs (using conntrack flow account) and export the
> I don't see what this has to do with conntrack.
> (Perhaps it has more if you only care about IP addresses.)

You catch me.

Forget this part. I wrote the mail when I was thinking. That was part of
my thought but it is wrong. I didn't delete it before I hit send.

> 
>   information under /proc/net/stat/top10/FILENAME based on the source
>   address. (You may need add -i to indicate the direction)
> 
>   Of course, the top10 can be used to match any other criteria beside the
>   nfmark. It can even collect top10 of all traffic.
> 
>   Top10 is used to monitor a while and then disabled. It could be expensive,
>   but is useful to investigate.
> 
>   I will implement it anyway to complete the task, but before I code, I am
>   willing to listen to any one who has comment and suggestion.
> ====
>   To be generally useful, the table must be bounded.  That will result in
>   inaccurate order statistics, but often that doesn't matter much, if the
>   table is an order of magnitude or two larger than the required order
>   statistic, (i.e., 100-1000 entries for estimating the top 10).
> 
> One question is whether you need the results in real time.
> If not, then the limited space seems not so important.
> 

It should be in realtime.

> For cases where there could be a very large number of entries you
> could do something like this:
> hash values to an index, for each index store the first value that
> hashes to that index, then one counter for that value and one counter
> for all others that hash to the same index.
> This loses some possibly valuable data, of course.  But you can change
> the hash at each time interval so that the same two values are
> unlikely to continually interfere.

This, may be good in overhead, but the counter is meaningless then.
Alghouth the top10 is not accurate, the counter should make sense.


-- 
  lark

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <16999.19485.501493.784034@isis.cs3-inc.com>]

* Re: A top 10 statistics module?
       [not found]   ` <16999.19485.501493.784034@isis.cs3-inc.com>
@ 2005-04-21  7:12     ` Wang Jian
  0 siblings, 0 replies; 5+ messages in thread
From: Wang Jian @ 2005-04-21  7:12 UTC (permalink / raw)
  To: netfilter-devel

Hi Don Cohen,


On Wed, 20 Apr 2005 23:45:49 -0700, don-nfil1@isis.cs3-inc.com (Don Cohen) wrote:

> Do you mind if I ask exactly what statistics you really want to
> collect?  I actually have a module that does something very similar.
> If, for instance, you wanted to know which of the IP addresses in 
> your network was used as source or destination address for how many
> packets/bytes over some interval it would tell you.  But it only
> classifies packets into a fixed number of sets, so you could find
> out for IP packets or UDP but not for those to port 80.
> 
>  > And, the kernel has done this once. Do it again in userspace seems to be
>  > a waste?
> First, the kernel need not have done it, second, even if it has,
> this is not so bad if the machine is wasting lots of cycles anyhow.
> 

For my original purpose, the kernel have done it (set nfmark). The
rules that match and set nfmark can be from one to many.

>  > It can be done in user space. But not convenient.
> Lots of things are much more convenient in user space.
> 

I know programming in user space is much simpler. But for this case, it
is not. Kernel space has information that user space should use a lot of
code to get. Then there is a duplication effort here.


>  > This is definitely the original requirement. It can be done in user
>  > space, but not convenient for automation.
> Is this somehow related to the fact that you're specifically trying
> to check things that you already classify in netfilter?
> Otherwise I don't see that automation is easier in the kernel.
> In fact you need something in user space to collect the data from
> the proc file anyhow.

Yes. netfilter rules classify tc class. then, we need to see who is the
top host in this class.

Attach another rule to collect top hosts, and then cat
/proc/net/stat/tophost/classN, we get it.

> 
>  > It should be in realtime.
> The module I have actually does export to /proc where a user space
> program then collects the data.
> 
>  > This, may be good in overhead, but the counter is meaningless then.
>  > Alghouth the top10 is not accurate, the counter should make sense.
> I think the counter does make sense.

the collision may render the counter meaningless when ordering, and it
may be misleading and leads to wrong conclusion.

> Of course, there's no problem at all if you're counting things that
> have relatively small numbers of classes, such as the machines in
> your network.  And my impression is that this is what you want.

Yes, it's what my customer wants to get.

-- 
  lark

^ permalink raw reply	[flat|nested] 5+ messages in thread

* A top 10 statistics module?
@ 2005-04-20 12:40 Wang Jian
  2005-04-20 14:52 ` Bill Rugolsky Jr.
  0 siblings, 1 reply; 5+ messages in thread
From: Wang Jian @ 2005-04-20 12:40 UTC (permalink / raw)
  To: netfilter-devel

Hi,

I have a customer who needs functionality that list top 10 hosts 
(listener or talker). It normally done in userspace, but in this case,
the 10 hosts is for a tc class. Moreover, it is expected that 2 or more
tc classes' top 10 are collected at the same time.

So I think this is better handled in kernel space, because the classid
and/or nfmark is only seen in kernel space.

The idea is that a rule like

 -m mark --mark 0x1 -j TOPHOST --count 10 --name FILENAME

will collect top10 IPs (using conntrack flow account) and export the
information under /proc/net/stat/top10/FILENAME based on the source
address. (You may need add -i to indicate the direction)

Of course, the top10 can be used to match any other criteria beside the
nfmark. It can even collect top10 of all traffic.

Top10 is used to monitor a while and then disabled. It could be expensive,
but is useful to investigate.

I will implement it anyway to complete the task, but before I code, I am
willing to listen to any one who has comment and suggestion.

-- 
  lark

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A top 10 statistics module?
  2005-04-20 12:40 Wang Jian
@ 2005-04-20 14:52 ` Bill Rugolsky Jr.
  2005-04-20 16:14   ` Wang Jian
  0 siblings, 1 reply; 5+ messages in thread
From: Bill Rugolsky Jr. @ 2005-04-20 14:52 UTC (permalink / raw)
  To: Wang Jian; +Cc: netfilter-devel

On Wed, Apr 20, 2005 at 08:40:20PM +0800, Wang Jian wrote:
> Top10 is used to monitor a while and then disabled. It could be expensive,
> but is useful to investigate.
> 
> I will implement it anyway to complete the task, but before I code, I am
> willing to listen to any one who has comment and suggestion.

To be generally useful, the table must be bounded.  That will result in
inaccurate order statistics, but often that doesn't matter much, if the
table is an order of magnitude or two larger than the required order
statistic, (i.e., 100-1000 entries for estimating the top 10).

Given that the table must be bounded, there needs to be a replacement
policy once it fills up.  The best choice of replacement strategy of
course depends on the distribution of new entries; the problem is similar
to that of the code table management in certain data compression algorithms.
Heuristics and data structures tend to be somewhat intimate, as fast
updates are required.

A common heuristic is to prioritize the entries via a scaled frequency
that decays with a time (or packet count, etc.) scaling constant, and
replace the lowest frequency members in the table with the new entries.
This heuristic lends itself to being implemented with a heap-based
priority queue whose top is the next entry to be expired.
Complexity: O(log N)

Other heuristics derive from various queuing models, the simplest
being "move-to-front" -- every time an entry is referenced, it is moved
to the front of the active list; entries are expired from the
tail of the list. Complexity: O(1)

In the distant past I've used the scaled frequency heuristic to good
effect on long time-series data.

Regards,

	Bill Rugolsky

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A top 10 statistics module?
  2005-04-20 14:52 ` Bill Rugolsky Jr.
@ 2005-04-20 16:14   ` Wang Jian
  0 siblings, 0 replies; 5+ messages in thread
From: Wang Jian @ 2005-04-20 16:14 UTC (permalink / raw)
  To: Bill Rugolsky Jr.; +Cc: netfilter-devel

Hi Bill Rugolsky Jr.,

Thanks for your input.

On Wed, 20 Apr 2005 10:52:35 -0400, "Bill Rugolsky Jr." <brugolsky@telemetry-investments.com> wrote:

> On Wed, Apr 20, 2005 at 08:40:20PM +0800, Wang Jian wrote:
> > Top10 is used to monitor a while and then disabled. It could be expensive,
> > but is useful to investigate.
> > 
> > I will implement it anyway to complete the task, but before I code, I am
> > willing to listen to any one who has comment and suggestion.
> 
> To be generally useful, the table must be bounded.  That will result in
> inaccurate order statistics, but often that doesn't matter much, if the
> table is an order of magnitude or two larger than the required order
> statistic, (i.e., 100-1000 entries for estimating the top 10).
> 
> Given that the table must be bounded, there needs to be a replacement
> policy once it fills up.  The best choice of replacement strategy of
> course depends on the distribution of new entries; the problem is similar
> to that of the code table management in certain data compression algorithms.
> Heuristics and data structures tend to be somewhat intimate, as fast
> updates are required.
> 
> A common heuristic is to prioritize the entries via a scaled frequency
> that decays with a time (or packet count, etc.) scaling constant, and
> replace the lowest frequency members in the table with the new entries.
> This heuristic lends itself to being implemented with a heap-based
> priority queue whose top is the next entry to be expired.
> Complexity: O(log N)

This is the first thing appears in my mind :)

> 
> Other heuristics derive from various queuing models, the simplest
> being "move-to-front" -- every time an entry is referenced, it is moved
> to the front of the active list; entries are expired from the
> tail of the list. Complexity: O(1)

This is very easy to implement. I will try this first :)

> 
> In the distant past I've used the scaled frequency heuristic to good
> effect on long time-series data.
> 
> Regards,
> 
> 	Bill Rugolsky



-- 
  lark

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-04-21  7:12 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200504201804.j3KI4WJ19340@isis.cs3-inc.com>
2005-04-21  5:25 ` A top 10 statistics module? Wang Jian
     [not found]   ` <16999.19485.501493.784034@isis.cs3-inc.com>
2005-04-21  7:12     ` Wang Jian
2005-04-20 12:40 Wang Jian
2005-04-20 14:52 ` Bill Rugolsky Jr.
2005-04-20 16:14   ` Wang Jian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.