do_gettimeofday

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* do_gettimeofday
@ 2003-10-02 18:32 Steve Modica
  2003-10-02 19:29 ` do_gettimeofday Andi Kleen
  2003-10-02 19:56 ` do_gettimeofday Stephen Hemminger
  0 siblings, 2 replies; 12+ messages in thread
From: Steve Modica @ 2003-10-02 18:32 UTC (permalink / raw)
  To: netdev

We've been doing some experiments here with large numbers of adapters on a 64p Linux system. 

When running 8 threads and 8 cpus, the do_gettimeofday code starts to use a lot of time.

It turns out that if a driver does not timestamp an incoming packet, the upper layer will stamp it for you in
PSCHED_GET_TIME(stamp).  What happens then is multiple cpus start fighting over the cacheline for the system clock and things get bad. 

One possible solution to this is to have the driver do the stamp using xtime. A number of ATM drivers do this now. In testing, it helps a lot.

Here's a sample diff for the tg3.c driver:

===========================================================================
Index: linux/linux/drivers/net/tg3.c
===========================================================================

--- /usr/tmp/TmpDir.8948-0/linux/linux/drivers/net/tg3.c_1.23   Thu Oct  2 13:30:21 2003
+++ linux/linux/drivers/net/tg3.c       Wed Oct  1 14:27:54 2003
@@ -2019,6 +2019,7 @@
                        skb->ip_summed = CHECKSUM_NONE;

                skb->protocol = eth_type_trans(skb, tp->dev);
+               skb->stamp = xtime;
 #if TG3_VLAN_TAG_USED
                if (tp->vlgrp != NULL &&
                    desc->type_flags & RXD_FLAG_VLAN) {

It's been suggested that we make this tuneable so it's easy to enable and disable. There was concern as to whether xtime would be accurate enough for all possible uses of ->stamp.  

Does anyone have any comments on this?

Regards!
Steve

-- 
Steve Modica
work: 651-683-3224
mobile: 651-261-3201
MTS-Technical Lead
"Give a man a fish, and he will eat for a day, hit him with a fish and
he leaves you alone" - me

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: do_gettimeofday
  2003-10-02 18:32 do_gettimeofday Steve Modica
@ 2003-10-02 19:29 ` Andi Kleen
  2003-10-02 19:56 ` do_gettimeofday Stephen Hemminger
  1 sibling, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2003-10-02 19:29 UTC (permalink / raw)
  To: Steve Modica; +Cc: netdev

On Thu, Oct 02, 2003 at 01:32:27PM -0500, Steve Modica wrote:
> We've been doing some experiments here with large numbers of adapters on a 
> 64p Linux system. 
> When running 8 threads and 8 cpus, the do_gettimeofday code starts to use a 
> lot of time.

That's a known problem. The funny thing is that the only users
of this time stamp is SO_TIMESTAMP, which is rarely used (except tcpdump)
and something in DECnet. IMHO the right fix is to add a global
counter that counts all all sockets that use SO_TIMESTAMP and when it's
zero never call it. Decnet could be probably fixed to just use jiffies
like TCP does.

Drawback is that when you enable SO_TIMESTAMP there is a small time window
when the packets are not time stamped yet. The socket layer can just fill
in the current time though.

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: do_gettimeofday
  2003-10-02 18:32 do_gettimeofday Steve Modica
  2003-10-02 19:29 ` do_gettimeofday Andi Kleen
@ 2003-10-02 19:56 ` Stephen Hemminger
  2003-10-02 20:46   ` do_gettimeofday Mitchell Blank Jr
  2003-10-03  7:41   ` do_gettimeofday David S. Miller
  1 sibling, 2 replies; 12+ messages in thread
From: Stephen Hemminger @ 2003-10-02 19:56 UTC (permalink / raw)
  To: Steve Modica; +Cc: netdev

On Thu, 02 Oct 2003 13:32:27 -0500
Steve Modica <modica@sgi.com> wrote:

> We've been doing some experiments here with large numbers of adapters on a 64p Linux system. 
> 
> When running 8 threads and 8 cpus, the do_gettimeofday code starts to use a lot of time.
> 
> It turns out that if a driver does not timestamp an incoming packet, the upper layer will stamp it for you in
> PSCHED_GET_TIME(stamp).  What happens then is multiple cpus start fighting over the cacheline for the system clock and things get bad. 
> 
> One possible solution to this is to have the driver do the stamp using xtime. A number of ATM drivers do this now. In testing, it helps a lot.
> 
> Here's a sample diff for the tg3.c driver:
> 
> ===========================================================================
> Index: linux/linux/drivers/net/tg3.c
> ===========================================================================
>  
> --- /usr/tmp/TmpDir.8948-0/linux/linux/drivers/net/tg3.c_1.23   Thu Oct  2 13:30:21 2003
> +++ linux/linux/drivers/net/tg3.c       Wed Oct  1 14:27:54 2003
> @@ -2019,6 +2019,7 @@
>                         skb->ip_summed = CHECKSUM_NONE;
>   
>                 skb->protocol = eth_type_trans(skb, tp->dev);
> +               skb->stamp = xtime;
>  #if TG3_VLAN_TAG_USED
>                 if (tp->vlgrp != NULL &&
>                     desc->type_flags & RXD_FLAG_VLAN) {
> 
> 
> It's been suggested that we make this tuneable so it's easy to enable and disable. There was concern as to whether xtime would be accurate enough for all possible uses of ->stamp.  
> 
> Does anyone have any comments on this?

Two problems:
	a. xtime is limited to HZ resolution which is insufficient for more advanced
	   packet schedulers and rtt estimation.
	b. unlocked access to xtime is unsafe because it is not atomic!

ATM is busted if it does this.

gettimeofday on 2.6 should be cheap for many systems because of the lockless
seqlock.  Unfortunately, some architectures (not sure about ia64) have problems
with TSC synchronization which make life messy.   

It might be possible to introduce a per-cpu monotonic clock that is lockless
for use in network code, but that is a moderately painful undertaking which
is beyond the scope of getting 2.6.0 out.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: do_gettimeofday
  2003-10-02 19:56 ` do_gettimeofday Stephen Hemminger
@ 2003-10-02 20:46   ` Mitchell Blank Jr
  2003-10-03  7:41   ` do_gettimeofday David S. Miller
  1 sibling, 0 replies; 12+ messages in thread
From: Mitchell Blank Jr @ 2003-10-02 20:46 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Steve Modica, netdev

Stephen Hemminger wrote:
> Two problems:
> 	a. xtime is limited to HZ resolution which is insufficient for more advanced
> 	   packet schedulers and rtt estimation.
> 	b. unlocked access to xtime is unsafe because it is not atomic!
> 
> ATM is busted if it does this.

It got fixed in 2.5 (when skb->stamp got changed to nanosecond resolution
so it broke the compile to do it the old way)  You can use LXR to see all
of the xtime users as of 2.6.0-test2:
  http://lxr.linux.no/ident?v=2.6.0-test2&i=xtime

The reason that ATM _had_ been using xtime was not for performance.  When
the ATM code was originally written (during the 1.X kernels) all network
drivers used xtime directly.  At some point the network drivers were
mass-updated to use do_gettimeofday() but ATM had not been merged into
the main tree yet so it missed the conversion.

-Mitch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: do_gettimeofday
  2003-10-02 19:56 ` do_gettimeofday Stephen Hemminger
  2003-10-02 20:46   ` do_gettimeofday Mitchell Blank Jr
@ 2003-10-03  7:41   ` David S. Miller
  2003-10-03  8:26     ` do_gettimeofday Mitchell Blank Jr
  1 sibling, 1 reply; 12+ messages in thread
From: David S. Miller @ 2003-10-03  7:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: modica, netdev

On Thu, 2 Oct 2003 12:56:25 -0700
Stephen Hemminger <shemminger@osdl.org> wrote:

> It might be possible to introduce a per-cpu monotonic clock that is lockless
> for use in network code, but that is a moderately painful undertaking which
> is beyond the scope of getting 2.6.0 out.

Yes, this issue is well known and it gets brought up again from time
to time.

And it's by no means just SO_TIMESTAMP that uses the skb->stamp
values, even IPV4/IPV6 fragmentation uses these things.

The SunRPC and RXRPC layers use it as well.

It really is an arch-specific issue of how to "optimize" this the
best, that's why it's hard to decide what the interface is that an
arch needs to provide.

But at the base I say we need three things:

1) Some kind of fast_timestamp_t, the property is that this stores
   enough information at time "T" such that at time "T + something"
   the fast_timestamp_t can be converted what the timeval was back at
   time "T".

   For networking, make skb->stamp into this type.

2) store_fast_timestamp(fast_timestamp_t *)

   For networking, change do_gettimeofday(&skb->stamp) into
   store_fast_timestamp(&skb->stamp)

3) fast_timestamp_to_timeval(arch_timestamp_t *, struct timeval *)

   For networking, change things that read the skb->stamp value
   into calls to fast_timestamp_to_timeval().

It is defined that the timeval given by fast_timestamp_to_timeval()
needs to be the same thing that do_gettimeofday() would have recorded
at the time store_fast_timestamp() was called.

Here is the default generic implementation that would go into
asm-generic/faststamp.h:

1) fast_timestamp_t is struct timeval
2) store_fast_timestamp() is gettimeofday()
3) fast_timestamp_to_timeval() merely copies the fast_timestamp_t
   into the passed in timeval.

And here is how an example implementation could work on sparc64:

1) fast_timestamp_t is a u64

2) store_fast_timestamp() reads the cpu cycle counter

3) fast_timestamp_to_timeval() records the difference between the
   current cpu cycle counter and the one recorded, it takes a sample
   of the current xtime value and adjusts it accordingly to account
   for the cpu cycle counter difference.

This only works because sparc64's cpu cycle counters are synchronized
across all cpus, they increase monotonically, and are guarenteed not
to overflow for at least 10 years.

Alpha, for example, cannot do it this way because it's cpu cycle counter
register overflows too quickly to be useful.

Platforms with inter-cpu TSC synchronization issues will have some
troubles doing the same trick too, because one must handle properly
the case where the fast timestamp is converted to a timeval on a different
cpu on which the fast timestamp was recorded.

Regardless, we could put the infrastructure in there now and arch folks
can work on implementations.  The generic implementation code, which is
what everyone will end up with at first, will cancel out to what we have
currently.

This is a pretty powerful idea that could be applied to other places,
not just the networking.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: do_gettimeofday
  2003-10-03  7:41   ` do_gettimeofday David S. Miller
@ 2003-10-03  8:26     ` Mitchell Blank Jr
  2003-10-03  8:27       ` do_gettimeofday David S. Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Mitchell Blank Jr @ 2003-10-03  8:26 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

David S. Miller wrote:
> 3) fast_timestamp_to_timeval(arch_timestamp_t *, struct timeval *)
> 
>    For networking, change things that read the skb->stamp value
>    into calls to fast_timestamp_to_timeval().

Are there any common cases where skb->stamp is looked at more than
once?  If so I might recommend changing the API to be more like:

	const struct timeval *skb_timestamp(struct skbuff *skb);

where the generic form would just be:

	typedef struct {
		struct timeval tv;
	} fast_timestamp_t;

	static inline const struct timeval *skb_timestamp(struct skbuff *skb) {
		return &skb->faststamp.tv;
	}

...but an arch could accelerate it with:

	typedef struct {
		union {
			struct timeval tv;
			u64 tsc;
		}
		int is_converted;
	} fast_timestamp_t;

	/* Caller must be sure we have exclusive ownership of this skbuff */
	const struct timeval *skb_timestamp(struct skbuff *skb) {
		if (!skb->faststamp.is_converted) {
			tsc_to_timeval(&skb->faststamp.tv, skb->faststamp.tsc);
			skb->faststamp.is_converted = 1;
		}
		return &skb->faststamp.tv;
	}

If we could hide "is_converted" as a flag somewhere else this would have zero
storage penalty (since most archs would have a fast-stamp at least
as big as a timeval)

I dunno, just an idea.

> Platforms with inter-cpu TSC synchronization issues will have some
> troubles doing the same trick too, because one must handle properly
> the case where the fast timestamp is converted to a timeval on a different
> cpu on which the fast timestamp was recorded.

Yeah, you'd probably have something like

	typedef struct {
		union {
			struct timeval tv;
			struct {
				u64 tsc;
#ifdef CONFIG_SMP
				unsigned int cpu_id;
#endif /* CONFIG_SMP */
			} fast;
		}
		int is_converted;
	} fast_timestamp_t;

And then skb_timestamp() would have to rummage around in the per-cpu timer
state for whatever processor started the packet.  (This is why I thought
it might be good to cache the result - you don't want to thrash those
cachelines more than once if you can help it)

-Mitch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: do_gettimeofday
  2003-10-03  8:26     ` do_gettimeofday Mitchell Blank Jr
@ 2003-10-03  8:27       ` David S. Miller
  2003-10-03  8:48         ` do_gettimeofday Mitchell Blank Jr
  0 siblings, 1 reply; 12+ messages in thread
From: David S. Miller @ 2003-10-03  8:27 UTC (permalink / raw)
  To: Mitchell Blank Jr; +Cc: netdev

On Fri, 3 Oct 2003 01:26:42 -0700
Mitchell Blank Jr <mitch@sfgoth.com> wrote:

> Are there any common cases where skb->stamp is looked at more than
> once?

Yes, the packet scheduler can cause this to happen.

>If so I might recommend changing the API to be more like:
> 
> 	const struct timeval *skb_timestamp(struct skbuff *skb);

Please no, making this a SKB or networking specific interface
make it nearly valueless and we might as well just stay with the
stuff we have.

> > Platforms with inter-cpu TSC synchronization issues will have some
> > troubles doing the same trick too, because one must handle properly
> > the case where the fast timestamp is converted to a timeval on a different
> > cpu on which the fast timestamp was recorded.
> 
> Yeah, you'd probably have something like

Doesn't work as-is.  You'd have to not only store the timestamp and
the cpu it was stored on, but also cross-call to that cpu to compute
the correct timeval.  That's really expensive and probably
do_gettimeofday() is going to be faster in the long run compared to
such a scheme.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: do_gettimeofday
  2003-10-03  8:27       ` do_gettimeofday David S. Miller
@ 2003-10-03  8:48         ` Mitchell Blank Jr
  2003-10-03  8:52           ` do_gettimeofday David S. Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Mitchell Blank Jr @ 2003-10-03  8:48 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

David S. Miller wrote:
> Doesn't work as-is.  You'd have to not only store the timestamp and
> the cpu it was stored on, but also cross-call to that cpu to compute
> the correct timeval.

That's definately the worst case.  You could have each CPU periodically
store its current {tsc,timeval} tuple in a per-cpu location and extrapolate
from that.

> That's really expensive and probably
> do_gettimeofday() is going to be faster in the long run compared to
> such a scheme.

It all depends on what percentage of skb's have ->stamp computed on a
CPU different from the one they came it on.  For the common users of
->stamp won't they have stayed on the same CPU?  The worst case of
doing a cross-cpu-call should only happen relatively rarely.

-Mitch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: do_gettimeofday
  2003-10-03  8:48         ` do_gettimeofday Mitchell Blank Jr
@ 2003-10-03  8:52           ` David S. Miller
  2003-10-03  9:26             ` do_gettimeofday Mitchell Blank Jr
  0 siblings, 1 reply; 12+ messages in thread
From: David S. Miller @ 2003-10-03  8:52 UTC (permalink / raw)
  To: Mitchell Blank Jr; +Cc: netdev

On Fri, 3 Oct 2003 01:48:47 -0700
Mitchell Blank Jr <mitch@sfgoth.com> wrote:

> David S. Miller wrote:
> > Doesn't work as-is.  You'd have to not only store the timestamp and
> > the cpu it was stored on, but also cross-call to that cpu to compute
> > the correct timeval.
> 
> That's definately the worst case.  You could have each CPU periodically
> store its current {tsc,timeval} tuple in a per-cpu location and extrapolate
> from that.

Right, that would work.

There is the weird issue (with both the sparc64 example and your's
here) of whether we should care about what happens when settimeofday()
occurs.  We probably shouldn't worry about it... as it gives weird
results even currently.

> It all depends on what percentage of skb's have ->stamp computed on a
> CPU different from the one they came it on.  For the common users of
> ->stamp won't they have stayed on the same CPU?  The worst case of
> doing a cross-cpu-call should only happen relatively rarely.

No, they typically won't.  The packet comes in on cpu X, we stamp
it on X, and we do a wakeup of tcpdump which will typically get
scheduled first onto some other processor before X is done processing
incoming packets.  The higher the packet load the more likely this will
happen.

But forget this, as your dual tsc+timeval recording idea would work
well and doesn't need a cross-cpu call.  Although we'd need to think
about how costly the cacheline activity is going to be with your idea
compared to the seqlocked accesses to xtime.  This is mainly a product
of how often you intend to update the tsc+timeval thingy.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: do_gettimeofday
  2003-10-03  9:26             ` do_gettimeofday Mitchell Blank Jr
@ 2003-10-03  9:23               ` David S. Miller
  2003-10-03 16:42               ` do_gettimeofday Ben Greear
  1 sibling, 0 replies; 12+ messages in thread
From: David S. Miller @ 2003-10-03  9:23 UTC (permalink / raw)
  To: Mitchell Blank Jr; +Cc: netdev

On Fri, 3 Oct 2003 02:26:17 -0700
Mitchell Blank Jr <mitch@sfgoth.com> wrote:

> I was more thinking about the other timestamp users.  I don't consider
> tcpdump something that needs as much optimization.

Well, it's is the fact that usage of the timestamp is rare which we're
trying to take advantage of.

The whole idea is that fast_timestamp_to_timeval() can be a bit slow
or suboptimal in order to make store_fast_timestamp() a lot faster
or access less shared or locked state.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: do_gettimeofday
  2003-10-03  8:52           ` do_gettimeofday David S. Miller
@ 2003-10-03  9:26             ` Mitchell Blank Jr
  2003-10-03  9:23               ` do_gettimeofday David S. Miller
  2003-10-03 16:42               ` do_gettimeofday Ben Greear
  0 siblings, 2 replies; 12+ messages in thread
From: Mitchell Blank Jr @ 2003-10-03  9:26 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

David S. Miller wrote:
> There is the weird issue (with both the sparc64 example and your's
> here) of whether we should care about what happens when settimeofday()
> occurs.  We probably shouldn't worry about it... as it gives weird
> results even currently.

Nah.  If anything you'll get better results since you're computing
the timeval later.

This is another argument for caching the computation though - otherwise
a settimeofday() could cause the packet timestamp to change drasically
from one observation to the next :-)

> > The worst case of
> > doing a cross-cpu-call should only happen relatively rarely.
> 
> No, they typically won't.  The packet comes in on cpu X, we stamp
> it on X, and we do a wakeup of tcpdump which will typically get
> scheduled first onto some other processor before X is done processing
> incoming packets.  The higher the packet load the more likely this will
> happen.

I was more thinking about the other timestamp users.  I don't consider
tcpdump something that needs as much optimization.  If we really wanted
to we could have set a per-interface flag that says "someone will want
the timestamp so compute it in the bh while we're still on the same
processor"  But see below - there probably isn't much cost anyways...

> This is mainly a product
> of how often you intend to update the tsc+timeval thingy.

You could compute it relatively frequently and then only actually
copy it to the hot cacheline if its diverged significantly from whats
there.  This would make the writes almost never happen (maybe once a
minute)

-Mitch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: do_gettimeofday
  2003-10-03  9:26             ` do_gettimeofday Mitchell Blank Jr
  2003-10-03  9:23               ` do_gettimeofday David S. Miller
@ 2003-10-03 16:42               ` Ben Greear
  1 sibling, 0 replies; 12+ messages in thread
From: Ben Greear @ 2003-10-03 16:42 UTC (permalink / raw)
  To: Mitchell Blank Jr; +Cc: David S. Miller, netdev

Mitchell Blank Jr wrote:
> David S. Miller wrote:
> 
>>There is the weird issue (with both the sparc64 example and your's
>>here) of whether we should care about what happens when settimeofday()
>>occurs.  We probably shouldn't worry about it... as it gives weird
>>results even currently.
> 
> 
> Nah.  If anything you'll get better results since you're computing
> the timeval later.
> 
> This is another argument for caching the computation though - otherwise
> a settimeofday() could cause the packet timestamp to change drasically
> from one observation to the next :-)

It would also be nice to be able to set a flag on raw sockets to just have the 'raw' timestamp passed
up to user-space.  In many cases, the relative timestamp may be all that is needed,
and user-space could optimize the conversion to gettimeofday as needed.

Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2003-10-03 16:42 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-10-02 18:32 do_gettimeofday Steve Modica
2003-10-02 19:29 ` do_gettimeofday Andi Kleen
2003-10-02 19:56 ` do_gettimeofday Stephen Hemminger
2003-10-02 20:46   ` do_gettimeofday Mitchell Blank Jr
2003-10-03  7:41   ` do_gettimeofday David S. Miller
2003-10-03  8:26     ` do_gettimeofday Mitchell Blank Jr
2003-10-03  8:27       ` do_gettimeofday David S. Miller
2003-10-03  8:48         ` do_gettimeofday Mitchell Blank Jr
2003-10-03  8:52           ` do_gettimeofday David S. Miller
2003-10-03  9:26             ` do_gettimeofday Mitchell Blank Jr
2003-10-03  9:23               ` do_gettimeofday David S. Miller
2003-10-03 16:42               ` do_gettimeofday Ben Greear

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).