netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2004-01-22 21:18 Krishna Kumar
  2004-01-22 22:10 ` David S. Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar @ 2004-01-22 21:18 UTC (permalink / raw)
  To: kuznet; +Cc: David S. Miller, kuznet, mashirle, netdev, netdev-bounce,
	Shirley Ma


[-- Attachment #1.1: Type: text/plain, Size: 3921 bytes --]





Alexei,

That is a good point you raised, we don't want to read the counter while
the writer
might change and overflow of one word can result in a really corrupt value.

If 64bit counters is a good idea to implement, what I find OK to do is to
penalize the
readers (proc filesystem interface or netlink) but make sure that the
writers don't get
penalized by being forced to serialize, in effect writers must run as fast
as if there were
no other readers. I could think of a hack to make that happen (hopefully
not too ugly :-) :

#if 64_bit_system
the old code is OK here.
#else
__u64 get_sync_data(void *mib[], int nr)
{
      __u64 res1, res2;
      __u64 res3;

      res1 = *((__u64 *) (((void *) per_cpu_ptr(mib[0], i)) + sizeof
(__u64) * nr)));
      synchronize_kernel();
      res2 = *((__u64 *) (((void *) per_cpu_ptr(mib[0], i)) + sizeof
(__u64) * nr)));
      if (res2 < res1) {
            / * Overflow, sync and re-read, the next read is guaranteed to
be greater */
            synchronize_kernel();
            res2 = *((__u64 *) (((void *) per_cpu_ptr(mib[0], i)) + sizeof
(__u64) * nr)));
      }

      /* similar code for mib[1], add both into res3

      return res3;
}
#endif

static __u64
 fold_field(void *mib[], int nr)
{
      ...
      res += get_sync_data(mib, nr);
      ...
}

The value can reduce only once every 4gig increments, which means that
reading res2 after the first
sync_kernel will be less than res1 very rarely (once a few days on a fast
ethernet card). In case res2
is less than res1, doing another sync_kernel and rereading of res2 is
guaranteed to return a value
greater than res1 because another 4Gig iterations of increments couldn't
happen in the time for one
context switch of all cpus. The sync_kernel is needed so that we don't read
the value faster than the
writer is updating the two words.

Does that sound realistic for implementing 64 bit counters ? Or do you have
better or simpler
suggestions ?

Thanks,

- KK



|---------+---------------------------->
|         |           kuznet@ms2.inr.ac|
|         |           .ru              |
|         |           Sent by:         |
|         |           netdev-bounce@oss|
|         |           .sgi.com         |
|         |                            |
|         |                            |
|         |           01/22/2004 10:26 |
|         |           AM               |
|         |                            |
|---------+---------------------------->
  >-----------------------------------------------------------------------------------------------------------------|
  |                                                                                                                 |
  |       To:       Shirley Ma/Beaverton/IBM@IBMUS                                                                  |
  |       cc:       davem@redhat.com (David S. Miller), mashirle@us.ltcfwd.linux.ibm.com, kuznet@ms2.inr.ac.ru,     |
  |        netdev@oss.sgi.com                                                                                       |
  |       Subject:  Re: [PATCH]snmp6 64-bit counter support in proc.c                                               |
  |                                                                                                                 |
  >-----------------------------------------------------------------------------------------------------------------|




Hello!

> Did you hear different voices?

Here is a little warning. It will give corrupt values on 32 bit archs
when update with 32 bit overflow happens while value is folded.

To do 64 bit arithmetics you need either to serialize reader wrt writer
or to do some funny tricks with detecting overflows while reading and
special sequence of operations at update with proper barriers, which
will be reflected in performance anyway. Essentially, this haemorhoids
is the reason why they stayed 32 bit.

Alexey



[-- Attachment #1.2: Type: text/html, Size: 3992 bytes --]

[-- Attachment #2: pic20320.gif --]
[-- Type: image/gif, Size: 1255 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2004-01-28 20:15 Shirley Ma
  0 siblings, 0 replies; 24+ messages in thread
From: Shirley Ma @ 2004-01-28 20:15 UTC (permalink / raw)
  To: Shirley Ma; +Cc: David S. Miller, Krishna Kumar, kuznet, mashirle, netdev

[-- Attachment #1: Type: text/plain, Size: 249 bytes --]





This solution doesn't favor writers also. How about do some udelay in the
reader to avoid modifing the writers?

Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone: (503) 578-7638
FAX:      (503) 578-3228

[-- Attachment #2: Type: text/html, Size: 308 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2004-01-28 20:09 Shirley Ma
  0 siblings, 0 replies; 24+ messages in thread
From: Shirley Ma @ 2004-01-28 20:09 UTC (permalink / raw)
  To: David S. Miller; +Cc: Krishna Kumar, kuznet, mashirle, netdev

[-- Attachment #1: Type: text/plain, Size: 523 bytes --]





> Perhaps there are some better ideas?

Yes. Dave, could we create a new interface to avoid the spin_lock?

Example 1:
modify write_seqlock(set_lock *s) to write_seqlock(seq_lock *s, int lock),
it will modify all calling routines.

Example 2:
write a new interface write_seqlock_percpu(), which gets rid of the
spin_lock, we can call by a different name, since there is no lock.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone: (503) 578-7638
FAX:      (503) 578-3228

[-- Attachment #2: Type: text/html, Size: 655 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2004-01-28 19:19 Krishna Kumar
  2004-01-28 19:33 ` David S. Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar @ 2004-01-28 19:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: kuznet, mashirle, netdev, Shirley Ma


[-- Attachment #1.1: Type: text/plain, Size: 3011 bytes --]





Hi Dave,

> We can overflow the lower-half TWICE in the res1=...res2=... sequence
> you do, and this res2 reread in the if branch can overflow itself again,

The snippet assumed that the number will not overflow 4G times twice within
two reads, which I thought was quite a reasonable assumption.

I think a fast and clean solution is using Seq locks, which is how
jiffies_64
is implemented on 32 bit systems. The writers always get the spin lock (can
be per cpu with zero contention/thrashing) and increment the 'seq' number
of
seqlock data struct (all inside write_seqlock), then modify the 64 bit
counter,
and release the lock which increases the 'seq' again. The readers don't get
a lock at all, they just read the 'seq' before and after the reading of the
64 bit
counter, and if it changed, it needs to repeat this cycle of reads. In this
case,
no comparison of high/low order words are also needed. Does that sound
better ?

thanks,

- KK



|---------+---------------------------->
|         |           "David S. Miller"|
|         |           <davem@redhat.com|
|         |           >                |
|         |                            |
|         |           01/28/2004 11:09 |
|         |           AM               |
|         |                            |
|---------+---------------------------->
  >-----------------------------------------------------------------------------------------------------------------|
  |                                                                                                                 |
  |       To:       Krishna Kumar/Beaverton/IBM@IBMUS                                                               |
  |       cc:       kuznet@ms2.inr.ac.ru, mashirle@us.ltcfwd.linux.ibm.com, netdev@oss.sgi.com, Shirley             |
  |        Ma/Beaverton/IBM@IBMUS                                                                                   |
  |       Subject:  Re: [PATCH]snmp6 64-bit counter support in proc.c                                               |
  |                                                                                                                 |
  >-----------------------------------------------------------------------------------------------------------------|




On Thu, 22 Jan 2004 18:45:18 -0800
Krishna Kumar <kumarkr@us.ibm.com> wrote:

>             / * Overflow, sync and re-read, the next read is guaranteed
to
> be greater
>              * than res1.
>              */
>             synchronize_kernel();
>             res2 = *((__u64 *) (((void *) per_cpu_ptr(mib[0], i)) +
sizeof
> (__u64) * nr)));

I don't understand how your scheme can work.

We're trying to detect if the upper 32-bit half of the 64-bit value
belongs with the lower-half.  We can overflow the lower-half TWICE
in the res1=...res2=... sequence you do, and this res2 reread in the if
branch can overflow itself again, combining a lower and upper half which
do not belong to each other.

This is not so trivial to fix, see? :)


[-- Attachment #1.2: Type: text/html, Size: 3248 bytes --]

[-- Attachment #2: pic14735.gif --]
[-- Type: image/gif, Size: 1255 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2004-01-23 18:06 Shirley Ma
  0 siblings, 0 replies; 24+ messages in thread
From: Shirley Ma @ 2004-01-23 18:06 UTC (permalink / raw)
  To: David S. Miller; +Cc: Krishna Kumar, kuznet, mashirle, netdev

[-- Attachment #1: Type: text/plain, Size: 253 bytes --]





> correctness > performance

OK, I am fixing the reader and will resubmit it when next release is
available.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone: (503) 578-7638
FAX:      (503) 578-3228

[-- Attachment #2: Type: text/html, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2004-01-23  2:57 Krishna Kumar
  0 siblings, 0 replies; 24+ messages in thread
From: Krishna Kumar @ 2004-01-23  2:57 UTC (permalink / raw)
  To: David S. Miller; +Cc: kuznet, mashirle, netdev, Shirley Ma

[-- Attachment #1: Type: text/plain, Size: 259 bytes --]






> correctness > performance

I agree with this statement, if every bug is explained as a "very tiny
race"
leads to one large defective product :-)

and performance also need not be hit much if writers are not penalized as I
had suggested earlier...

- KK

[-- Attachment #2: Type: text/html, Size: 535 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2004-01-23  2:45 Krishna Kumar
  2004-01-28 19:09 ` David S. Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar @ 2004-01-23  2:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: kuznet, mashirle, netdev, Shirley Ma

[-- Attachment #1: Type: text/plain, Size: 1600 bytes --]





Hi dave,

> > Do you really care about this situation? It only happens as a race
within
> > one instruction in 4 billion. It will slow everytime if we do this way.

> correctness > performance

> If MRTG hits this case and my net usage graphs have weird spikes
> in them as a result, can I assign the bugzilla entry to you? :-)

If so, do you think the solution that I proposed earlier would work ? No
doubt it is quite ugly to behold :-)
The one issue with it is one extra cache loading in 99.999999999% of cases
(2 instead of 1) and two
extra loading the remaining % of cases, but this is executed rarely and the
user can always wait for
data instead of penalizing the kernel writers. (synchronize_kernel is also
a little heavy, so maybe there
is a more lightweight mechanism to make sure that writer is finished).

thanks,

- KK

__u64 get_sync_data(void *mib[], int nr)
{
      __u64 res1, res2;
      __u64 res3;

      res1 = *((__u64 *) (((void *) per_cpu_ptr(mib[0], i)) + sizeof
(__u64) * nr)));
      synchronize_kernel();
      res2 = *((__u64 *) (((void *) per_cpu_ptr(mib[0], i)) + sizeof
(__u64) * nr)));
      if (res2 < res1) {
            / * Overflow, sync and re-read, the next read is guaranteed to
be greater
             * than res1.
             */
            synchronize_kernel();
            res2 = *((__u64 *) (((void *) per_cpu_ptr(mib[0], i)) + sizeof
(__u64) * nr)));
      }

      /* similar code for mib[1], add both into res3

      return res3;
}
#endif

static __u64
 fold_field(void *mib[], int nr)
{
      ...
      res += get_sync_data(mib, nr);
      ...
}

[-- Attachment #2: Type: text/html, Size: 1867 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2004-01-23  1:08 Shirley Ma
  2004-01-23  1:43 ` David S. Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Shirley Ma @ 2004-01-23  1:08 UTC (permalink / raw)
  To: David S. Miller; +Cc: Krishna Kumar, kuznet, mashirle, netdev

[-- Attachment #1: Type: text/plain, Size: 291 bytes --]





Do you really care about this situation? It only happens as a race within
one instruction in 4 billion. It will slow everytime if we do this way.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone: (503) 578-7638
FAX:      (503) 578-3228

[-- Attachment #2: Type: text/html, Size: 359 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2004-01-22 22:50 Krishna Kumar
  2004-01-23  0:35 ` David S. Miller
  0 siblings, 1 reply; 24+ messages in thread
From: Krishna Kumar @ 2004-01-22 22:50 UTC (permalink / raw)
  To: David S. Miller; +Cc: kuznet, mashirle, netdev, Shirley Ma


[-- Attachment #1.1: Type: text/plain, Size: 2814 bytes --]





Dave,

Isn't memory barrier used to stop re-ordering of instructions and needs to
be present
in both reader as well as writer to have synchronization since mb is for
the CPU on
which it is executing ? In this case : suppose stat is getting incremented
from
00000000 FFFFFFFF to 00000001 00000000, and stat was read after the low
word was
incremented to 0 (with overflow set), then low1 = 0 and low2 can get
executed before the
low2 is incremented on the other processor, so low2 is still zero. We
return zero, when
the value should be 4G. That why I felt that we need to read second after
making sure
the writer is over, that doesn't assume that writer is faster than reader
and works in all
cases.

Am I wrong here ?

thanks,

- KK



|---------+---------------------------->
|         |           "David S. Miller"|
|         |           <davem@redhat.com|
|         |           >                |
|         |           Sent by:         |
|         |           netdev-bounce@oss|
|         |           .sgi.com         |
|         |                            |
|         |                            |
|         |           01/22/2004 02:10 |
|         |           PM               |
|         |                            |
|---------+---------------------------->
  >-----------------------------------------------------------------------------------------------------------------|
  |                                                                                                                 |
  |       To:       Krishna Kumar/Beaverton/IBM@IBMUS                                                               |
  |       cc:       kuznet@ms2.inr.ac.ru, mashirle@us.ltcfwd.linux.ibm.com, netdev@oss.sgi.com,                     |
  |        netdev-bounce@oss.sgi.com, Shirley Ma/Beaverton/IBM@IBMUS                                                |
  |       Subject:  Re: [PATCH]snmp6 64-bit counter support in proc.c                                               |
  |                                                                                                                 |
  >-----------------------------------------------------------------------------------------------------------------|





The most portable and simple algorithm to solve this on the reader
side is (and I recommend we don't special case this on 64-bit platforms
just to get wider testing):

             u32 high, low1, low2;

             do {
                         low1 = stat & 0xffffffff;
                         rmb();
                         high = stat >> 32;
                         rmb();
                         low2 = stat & 0xffffffff;
             } while (low2 < low1);

Something like that.  The idea is to sample the lower 32-bit twice
and if it overflows resample both high and low halfs.



[-- Attachment #1.2: Type: text/html, Size: 2808 bytes --]

[-- Attachment #2: pic24167.gif --]
[-- Type: image/gif, Size: 1255 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2004-01-21 23:44 Shirley Ma
  0 siblings, 0 replies; 24+ messages in thread
From: Shirley Ma @ 2004-01-21 23:44 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / 吉藤英明
  Cc: davem, mashirle, kuznet, netdev, yoshfuji

[-- Attachment #1: Type: text/plain, Size: 181 bytes --]





Thanks for the input. I will fix it.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone: (503) 578-7638
FAX:      (503) 578-3228

[-- Attachment #2: Type: text/html, Size: 244 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2004-01-21 19:45 Shirley Ma
  2004-01-21 20:27 ` YOSHIFUJI Hideaki / 吉藤英明
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Shirley Ma @ 2004-01-21 19:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: mashirle, kuznet, netdev

[-- Attachment #1: Type: text/plain, Size: 474 bytes --]





Hi, Dave,

> So I'm going to let this sit for another day or two so people can voice
any
objections they may have.

Did you hear different voices? If not could you please check in this patch?
I am going to submit another patch about new IPv6 MIBs system & interface
statistics counters for your review, which depends on this one.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone: (503) 578-7638
FAX:      (503) 578-3228

[-- Attachment #2: Type: text/html, Size: 597 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: [PATCH]snmp6 64-bit counter support in proc.c
@ 2003-12-05 21:55 Shirley Ma
  0 siblings, 0 replies; 24+ messages in thread
From: Shirley Ma @ 2003-12-05 21:55 UTC (permalink / raw)
  To: David S. Miller; +Cc: mashirle, kuznet, netdev





Thanks, Dave. When I created the patch, I was concern about this.
I will change the MIBs counters to 'u32' or 'u64'.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone: (503) 578-7638
FAX:      (503) 578-3228



"David S. Miller" <davem@redhat.com> on 12/05/2003 12:31:33 PM

To:    mashirle@us.ltcfwd.linux.ibm.com
cc:    kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, Shirley
       Ma/Beaverton/IBM@IBMUS
Subject:    Re: [PATCH]snmp6 64-bit counter support in proc.c


On Fri, 5 Dec 2003 12:14:47 -0800
Shirley Ma <mashirle@us.ibm.com> wrote:

> I add this because there are some 64-bit counters in the new IPv6 MIBs.
> This patch has been tested agaist linux-2.6.0-test9, and cleanly applied
to
> linux-2.6.0-test11.

"sizeof(unsigned long)" evaluates to 8 on 64-bit systems,
yet you assume it always evaluated to 4 as on 32-bit systems.

Maybe it would be wiser to explicitly use 'u32' and 'u64' for
the types of the snmp counters?

This has always been a sore area.

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: IPv6 MIB:ipv6PrefixTable implementation
@ 2003-12-02 12:40 kuznet
  2003-12-05 20:14 ` [PATCH]snmp6 64-bit counter support in proc.c Shirley Ma
  0 siblings, 1 reply; 24+ messages in thread
From: kuznet @ 2003-12-02 12:40 UTC (permalink / raw)
  To: Shirley Ma; +Cc: netdev, xma

Hello!

> One implementation detail question, do you think I need to save all the o=
> ther=20
> Prefix Objects: Type, Origin(addrconf, manually, dhcp, others),

These two things should be stored right now. Existing implementation is
quite a mess, but we definitely want to remember origin of each route
in "protocol" and another flags, they are of common interest.


> AutonomoueFlag, AdvPreferredLiftTime and ValidLifeTime

ValidLifeTime is "expires" on this route.

What's about AdvPreferredLiftTime I am puzzled a little,
preferred time is not an attribute of a prefix at all,
it is attribute of address, is not it? Unless the prefix is used
to install local address it does not make sense, right?

Alexey

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2004-01-28 20:15 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-22 21:18 [PATCH]snmp6 64-bit counter support in proc.c Krishna Kumar
2004-01-22 22:10 ` David S. Miller
  -- strict thread matches above, loose matches on Subject: below --
2004-01-28 20:15 Shirley Ma
2004-01-28 20:09 Shirley Ma
2004-01-28 19:19 Krishna Kumar
2004-01-28 19:33 ` David S. Miller
2004-01-23 18:06 Shirley Ma
2004-01-23  2:57 Krishna Kumar
2004-01-23  2:45 Krishna Kumar
2004-01-28 19:09 ` David S. Miller
2004-01-23  1:08 Shirley Ma
2004-01-23  1:43 ` David S. Miller
2004-01-22 22:50 Krishna Kumar
2004-01-23  0:35 ` David S. Miller
2004-01-21 23:44 Shirley Ma
2004-01-21 19:45 Shirley Ma
2004-01-21 20:27 ` YOSHIFUJI Hideaki / 吉藤英明
2004-01-21 22:05 ` David S. Miller
2004-01-22 18:26 ` kuznet
2003-12-05 21:55 Shirley Ma
2003-12-02 12:40 IPv6 MIB:ipv6PrefixTable implementation kuznet
2003-12-05 20:14 ` [PATCH]snmp6 64-bit counter support in proc.c Shirley Ma
2003-12-05 20:31   ` David S. Miller
2004-01-14 22:52     ` Shirley Ma
2004-01-15  8:57       ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).