netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* occasionally corrupted network stats in /proc/net/dev
@ 2008-01-14 17:20 Mark Seger
  2008-01-14 17:38 ` Eric Dumazet
  2008-01-14 18:08 ` Ben Greear
  0 siblings, 2 replies; 9+ messages in thread
From: Mark Seger @ 2008-01-14 17:20 UTC (permalink / raw)
  To: netdev

I had posted the following on linux-net and haven't see any responses 
possibly because nobody had any or that list is obsolete.  I have been 
told this is the current list for everything networking on linux so I 
thought I'd try again...

I suspect the answer will be that it is what it is, but here's the 
deal.  I have a tool I use for monitoring network traffic among other 
things - see http://collectl.sourceforge.net/ - and one of its benefits  
is that you can run it continuously as a daemon (similar to sar) and 
generate data in a format suitable for plotting.  This means that you 
can automate your entire network monitoring infrastructure at fairly 
fine granularity, down to second if you like.  Actually 1-second level 
monitoring will provide incorrect data on earlier kernels because the 
stats aren't updated on 1 second boundaries and you need to monitor at 
an interval of 0.9765 seconds, but that's a different story which is 
explained at http://collectl.sourceforge.net/NetworkStats.html

But more importantly, I've found that occasionally (not that often) 
there is bogus data reported from /proc/net/dev.  While I don't have a 
lot of details on this it seems to only show up in 64 bit kernels.  Look 
at the following samples taken at 1 second intervals:

 eth0:135115809 1024897    0    0    0     0          0         9 
135458926  910340    0    0    0     0       0          0
 eth0:135118023 1024923    0    0    0     0          0         9 
135460952  910363    0    0    0     0       0          0
 eth0:        0  884620    0    0    0     0          0    909397   
9687563 1049736    0    0    0     0       0          0
 eth0:135121189 1024957    0    0    0     0          0         9 
135464222  910400    0    0    0     0       0          0
 eth0:135129565 1024995    0    0    0     0          0         9 
135473687  910435    0    0    0     0       0          0

see the middle sample?  When I look at the change between samples it 
generates a really big number since the difference is assumed to be 
caused a counter wrapping.  The problem is it's not always 
straightforward when there is bad data.  For example if the original and 
bogus values are close enough it's not even clear there is a problem.

So the obvious question is, is there any way to prevent the bogus data 
from getting reported?   If not, is there any way to set the values to 
something to indicate that the correct values can't be determined?  
Clearly this problem would be visible to any tool that looks at /proc 
but since many tools are not automated or don't take it to the level I 
do, nobody probably notices.  As for the counter update frequency, even 
though they now appear to be updated closer to a 1 second boundary it 
also means tools that can monitor at sub-second intervals will report 
incorrect data since the counters only change once a second.

-mark


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: occasionally corrupted network stats in /proc/net/dev
  2008-01-14 17:20 occasionally corrupted network stats in /proc/net/dev Mark Seger
@ 2008-01-14 17:38 ` Eric Dumazet
  2008-01-14 18:08 ` Ben Greear
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2008-01-14 17:38 UTC (permalink / raw)
  To: Mark Seger; +Cc: netdev

Mark Seger a écrit :
> I had posted the following on linux-net and haven't see any responses 
> possibly because nobody had any or that list is obsolete.  I have been 
> told this is the current list for everything networking on linux so I 
> thought I'd try again...
>
> I suspect the answer will be that it is what it is, but here's the 
> deal.  I have a tool I use for monitoring network traffic among other 
> things - see http://collectl.sourceforge.net/ - and one of its 
> benefits  is that you can run it continuously as a daemon (similar to 
> sar) and generate data in a format suitable for plotting.  This means 
> that you can automate your entire network monitoring infrastructure at 
> fairly fine granularity, down to second if you like.  Actually 
> 1-second level monitoring will provide incorrect data on earlier 
> kernels because the stats aren't updated on 1 second boundaries and 
> you need to monitor at an interval of 0.9765 seconds, but that's a 
> different story which is explained at 
> http://collectl.sourceforge.net/NetworkStats.html
>
> But more importantly, I've found that occasionally (not that often) 
> there is bogus data reported from /proc/net/dev.  While I don't have a 
> lot of details on this it seems to only show up in 64 bit kernels.  
> Look at the following samples taken at 1 second intervals:
>
> eth0:135115809 1024897    0    0    0     0          0         9 
> 135458926  910340    0    0    0     0       0          0
> eth0:135118023 1024923    0    0    0     0          0         9 
> 135460952  910363    0    0    0     0       0          0
> eth0:        0  884620    0    0    0     0          0    909397   
> 9687563 1049736    0    0    0     0       0          0
> eth0:135121189 1024957    0    0    0     0          0         9 
> 135464222  910400    0    0    0     0       0          0
> eth0:135129565 1024995    0    0    0     0          0         9 
> 135473687  910435    0    0    0     0       0          0
>
> see the middle sample?  When I look at the change between samples it 
> generates a really big number since the difference is assumed to be 
> caused a counter wrapping.  The problem is it's not always 
> straightforward when there is bad data.  For example if the original 
> and bogus values are close enough it's not even clear there is a problem.
>
> So the obvious question is, is there any way to prevent the bogus data 
> from getting reported?   If not, is there any way to set the values to 
> something to indicate that the correct values can't be determined?  
> Clearly this problem would be visible to any tool that looks at /proc 
> but since many tools are not automated or don't take it to the level I 
> do, nobody probably notices.  As for the counter update frequency, 
> even though they now appear to be updated closer to a 1 second 
> boundary it also means tools that can monitor at sub-second intervals 
> will report incorrect data since the counters only change once a second.
What is the NIC used for eth0 (and driver name)

Which version of linux kernel do you run ?





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: occasionally corrupted network stats in /proc/net/dev
  2008-01-14 17:20 occasionally corrupted network stats in /proc/net/dev Mark Seger
  2008-01-14 17:38 ` Eric Dumazet
@ 2008-01-14 18:08 ` Ben Greear
  2008-01-14 18:24   ` Mark Seger
  1 sibling, 1 reply; 9+ messages in thread
From: Ben Greear @ 2008-01-14 18:08 UTC (permalink / raw)
  To: Mark Seger; +Cc: netdev

Mark Seger wrote:
> I had posted the following on linux-net and haven't see any responses 
> possibly because nobody had any or that list is obsolete.  I have been 
> told this is the current list for everything networking on linux so I 
> thought I'd try again...
Do you see this with multiple network drivers, or just with one 
particular driver.  If so, which one?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: occasionally corrupted network stats in /proc/net/dev
  2008-01-14 18:08 ` Ben Greear
@ 2008-01-14 18:24   ` Mark Seger
  2008-01-14 18:51     ` Mark Seger
  0 siblings, 1 reply; 9+ messages in thread
From: Mark Seger @ 2008-01-14 18:24 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev

I'll try to get data on the other systems reporting it and as I said it 
does not  happen all that often AND you have to be looking for it.  The 
system I've personally seen it happen on several times is running 
RHEL4/U4 which redhat numbers 2.6.9-42 and from modinfo I see:
version:        7.0.33-k2-NAPI 51E97FEE51D0772AFC89130
description:    Intel(R) PRO/1000 Network Driver

-mark

Ben Greear wrote:
> Mark Seger wrote:
>> I had posted the following on linux-net and haven't see any responses 
>> possibly because nobody had any or that list is obsolete.  I have 
>> been told this is the current list for everything networking on linux 
>> so I thought I'd try again...
> Do you see this with multiple network drivers, or just with one 
> particular driver.  If so, which one?
>
> Thanks,
> Ben
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: occasionally corrupted network stats in /proc/net/dev
  2008-01-14 18:24   ` Mark Seger
@ 2008-01-14 18:51     ` Mark Seger
  2008-01-14 19:01       ` Ben Greear
  2008-01-14 19:12       ` Eric Dumazet
  0 siblings, 2 replies; 9+ messages in thread
From: Mark Seger @ 2008-01-14 18:51 UTC (permalink / raw)
  To: Mark Seger; +Cc: Ben Greear, netdev

Ignore that last one as it was pointed out to me that we have both nic 
installed on many of our systems and ethtool told me the one associated 
with the nic is actually the broadcom one.

version:        1.4.38 E1B1EC867DEEB8027B2DA0F
license:        GPL
description:    Broadcom NetXtreme II BCM5706/5708 Driver

-mark

Mark Seger wrote:
> I'll try to get data on the other systems reporting it and as I said 
> it does not  happen all that often AND you have to be looking for it.  
> The system I've personally seen it happen on several times is running 
> RHEL4/U4 which redhat numbers 2.6.9-42 and from modinfo I see:
> version:        7.0.33-k2-NAPI 51E97FEE51D0772AFC89130
> description:    Intel(R) PRO/1000 Network Driver
>
> -mark
>
> Ben Greear wrote:
>> Mark Seger wrote:
>>> I had posted the following on linux-net and haven't see any 
>>> responses possibly because nobody had any or that list is obsolete.  
>>> I have been told this is the current list for everything networking 
>>> on linux so I thought I'd try again...
>> Do you see this with multiple network drivers, or just with one 
>> particular driver.  If so, which one?
>>
>> Thanks,
>> Ben
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: occasionally corrupted network stats in /proc/net/dev
  2008-01-14 18:51     ` Mark Seger
@ 2008-01-14 19:01       ` Ben Greear
  2008-01-14 19:12       ` Eric Dumazet
  1 sibling, 0 replies; 9+ messages in thread
From: Ben Greear @ 2008-01-14 19:01 UTC (permalink / raw)
  To: Mark Seger; +Cc: netdev

Mark Seger wrote:
> Ignore that last one as it was pointed out to me that we have both nic 
> installed on many of our systems and ethtool told me the one 
> associated with the nic is actually the broadcom one.
>
> version:        1.4.38 E1B1EC867DEEB8027B2DA0F
> license:        GPL
> description:    Broadcom NetXtreme II BCM5706/5708 Driver
Ok, we do a similar stats polling, though through a private ioctl I 
hacked into the kernel to
get the netdev->stats struct with a memcpy.  I haven't noticed any 
problems with counters
in the e1000 driver.   I haven't done enough testing on bcm drivers to 
ascertain whether it's
reliable or not w/regard to stats.

If you can reproduce the problem with e1000, it would be worth looking 
at the logic that prints
out the proc interface text for problems..and if you cannot, then maybe 
it's the bcm driver that
is at issue.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: occasionally corrupted network stats in /proc/net/dev
  2008-01-14 18:51     ` Mark Seger
  2008-01-14 19:01       ` Ben Greear
@ 2008-01-14 19:12       ` Eric Dumazet
  2008-01-14 20:41         ` Michael Chan
  1 sibling, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2008-01-14 19:12 UTC (permalink / raw)
  To: Mark Seger; +Cc: Ben Greear, netdev, mchan

Mark Seger a écrit :
> Ignore that last one as it was pointed out to me that we have both nic 
> installed on many of our systems and ethtool told me the one 
> associated with the nic is actually the broadcom one.
>
> version:        1.4.38 E1B1EC867DEEB8027B2DA0F
> license:        GPL
> description:    Broadcom NetXtreme II BCM5706/5708 Driver
>

I remember some tg3 chips actually have bugs when reporting stats.... 
once in a while

CCed to Michael Chan to get some details.


> -mark
>
> Mark Seger wrote:
>> I'll try to get data on the other systems reporting it and as I said 
>> it does not  happen all that often AND you have to be looking for 
>> it.  The system I've personally seen it happen on several times is 
>> running RHEL4/U4 which redhat numbers 2.6.9-42 and from modinfo I see:
>> version:        7.0.33-k2-NAPI 51E97FEE51D0772AFC89130
>> description:    Intel(R) PRO/1000 Network Driver
>>
>> -mark
>>
>> Ben Greear wrote:
>>> Mark Seger wrote:
>>>> I had posted the following on linux-net and haven't see any 
>>>> responses possibly because nobody had any or that list is 
>>>> obsolete.  I have been told this is the current list for everything 
>>>> networking on linux so I thought I'd try again...
>>> Do you see this with multiple network drivers, or just with one 
>>> particular driver.  If so, which one?
>>>
>>> Thanks,
>>> Ben
>>>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: occasionally corrupted network stats in /proc/net/dev
  2008-01-14 20:41         ` Michael Chan
@ 2008-01-14 20:05           ` Mark Seger
  0 siblings, 0 replies; 9+ messages in thread
From: Mark Seger @ 2008-01-14 20:05 UTC (permalink / raw)
  To: Michael Chan; +Cc: Eric Dumazet, Ben Greear, netdev

outstanding!  I'm just happy to hear it's not a bug in my monitoring 
code...  8-)
-mark

Michael Chan wrote:
> On Mon, 2008-01-14 at 20:12 +0100, Eric Dumazet wrote:
>   
>> Mark Seger a écrit :
>>     
>>> Ignore that last one as it was pointed out to me that we have both nic 
>>> installed on many of our systems and ethtool told me the one 
>>> associated with the nic is actually the broadcom one.
>>>
>>> version:        1.4.38 E1B1EC867DEEB8027B2DA0F
>>> license:        GPL
>>> description:    Broadcom NetXtreme II BCM5706/5708 Driver
>>>
>>>       
>> I remember some tg3 chips actually have bugs when reporting stats.... 
>> once in a while
>>
>> CCed to Michael Chan to get some details.
>>     
>
> Yes, that's right.  Some BNX2 chips have this problem and we have a
> workaround:
>
> http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=02537b0676930b1bd9aff2139e0e645c79986931
>
> The chip sometimes DMA wrong counter values if the chip is also
> internally gathering the counters at the time of the DMA.
>
> Driver 1.5.11 and later versions have this workaround.
>
>
>   


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: occasionally corrupted network stats in /proc/net/dev
  2008-01-14 19:12       ` Eric Dumazet
@ 2008-01-14 20:41         ` Michael Chan
  2008-01-14 20:05           ` Mark Seger
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Chan @ 2008-01-14 20:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Mark Seger, Ben Greear, netdev

On Mon, 2008-01-14 at 20:12 +0100, Eric Dumazet wrote:
> Mark Seger a écrit :
> > Ignore that last one as it was pointed out to me that we have both nic 
> > installed on many of our systems and ethtool told me the one 
> > associated with the nic is actually the broadcom one.
> >
> > version:        1.4.38 E1B1EC867DEEB8027B2DA0F
> > license:        GPL
> > description:    Broadcom NetXtreme II BCM5706/5708 Driver
> >
> 
> I remember some tg3 chips actually have bugs when reporting stats.... 
> once in a while
> 
> CCed to Michael Chan to get some details.

Yes, that's right.  Some BNX2 chips have this problem and we have a
workaround:

http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=02537b0676930b1bd9aff2139e0e645c79986931

The chip sometimes DMA wrong counter values if the chip is also
internally gathering the counters at the time of the DMA.

Driver 1.5.11 and later versions have this workaround.



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-01-14 21:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-14 17:20 occasionally corrupted network stats in /proc/net/dev Mark Seger
2008-01-14 17:38 ` Eric Dumazet
2008-01-14 18:08 ` Ben Greear
2008-01-14 18:24   ` Mark Seger
2008-01-14 18:51     ` Mark Seger
2008-01-14 19:01       ` Ben Greear
2008-01-14 19:12       ` Eric Dumazet
2008-01-14 20:41         ` Michael Chan
2008-01-14 20:05           ` Mark Seger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).