From mboxrd@z Thu Jan  1 00:00:00 1970
From: Patrick Ohly <patrick.ohly@gmx.de>
Subject: Re: [RFC] support for IEEE 1588
Date: Fri, 04 Jul 2008 15:37:21 +0200
Message-ID: <1215178641.7277.71.camel@localhost>
References: <200807040147.11148.opurdila@ixiacom.com>
Mime-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
To: netdev@vger.kernel.org
Return-path: <netdev-owner@vger.kernel.org>
Received: from main.gmane.org ([80.91.229.2]:42385 "EHLO ciao.gmane.org"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1755004AbYGDOZI (ORCPT <rfc822;netdev@vger.kernel.org>);
	Fri, 4 Jul 2008 10:25:08 -0400
Received: from root by ciao.gmane.org with local (Exim 4.43)
	id 1KEmDe-0000sS-U7
	for netdev@vger.kernel.org; Fri, 04 Jul 2008 14:25:02 +0000
Received: from fce2e.f.ppp-pool.de ([195.4.206.46])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <netdev@vger.kernel.org>; Fri, 04 Jul 2008 14:25:02 +0000
Received: from patrick.ohly by fce2e.f.ppp-pool.de with local (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <netdev@vger.kernel.org>; Fri, 04 Jul 2008 14:25:02 +0000
In-Reply-To: <200807040147.11148.opurdila@ixiacom.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

Hallo Tavi,

Interesting initiative. I'm employed by Intel and had the chance to do
some exploratory work on software PTP support for Intel's new 82576
Gigabit Ethernet Controller [1], which introduces hardware time stamping
for PTP packets. I modified the open source PTPd so that it uses the
more accurate hardware time stamps instead of time stamps generated by
the Linux IP stack. The advantage was 50x higher accuracy under load.
You can read more about that in a paper [2].

[1] http://download.intel.com/design/network/ProdBrf/320025.pdf
[2] http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf

In order to get these time stamps and read the clock inside the NIC
which generates these time stamps, we had to add ioctl() calls to the
igb driver - not nice and certainly not a suitable long-term solution.
If there is a consensus on a better user space API and the Linux IP
stack gets a general framework for PTP, then perhaps it could also be
used with Intel's new NICs. Note that I'm not speaking in any official
capacity for Intel here, just expressing my own opinion (and hope). I'm
not even in the network team.

I cannot release the PTPd and igb patches right now because that would
require legal approval, but if there is interest I can get that process
started. There's no reason not to do that.

So, let's move on to Tavi's proposal:

On Fri, 2008-07-04 at 01:47 +0300, Octavian Purdila wrote:
> 1. RX path
> - add a new field in skb to keep the hardware stamp (hwstamp)
> - add a new socket flag to enable RX stamping 
> - add a new control message to retrieve the hwstamp from the skb to user-space 
> application (for UDP and maybe PF_PACKET)

I agree.

Currently there is something similar with SO_TIMESTAMP and
SCM_TIMESTAMP, but the problem with those is that only a timeval is
returned, i.e., accuracy is limited to microseconds. To make full use of
hardware time stamps we'll want a timespec with nanoseconds.

We also need something more flexible than SO_TIMESTAMP. Depending on
what the user space program wants to measure, it would be useful to time
stamp
      * the various flavors of PTP packets (v1/v2/802.1as,
        SYNC/DELAY_REQUEST) selectively
      * all packets

The hardware might not be capable of supporting all modes, but at least
the API should support them and provide room for future extensions.

It would be possible to fall back to time stamping using system time if
the hardware is incapable of implementing the requested operation.
Depending on how that fallback is implemented, PTPd's accuracy might be
improved even without any hardware support.

> 2. TX path - this is a bit more complicated since we need a new mechanism to 
> wait for a packet transmission on wire, from users-space.
> - add a new flag for the skb to request TX stamping
> - add a new control message to propagate the TX stamping request from 
> userspace to the skb

Forgive me my ignorance, can you provide more details how that would
work?

How about adding a new flag for send/sendto/sendmsg() instead of a new
control message?

> - when the driver will send the packet will get the stamp from the TX 
> completion ring; the driver will then propagate the stamp either to
> (a) the skb stamp field, or (b) some special structure - this to avoid keeping 
> the skb around
> - the special structure or the skb will be linked to a special queue in the 
> socket and a POLLPRI event will be generated
> - the application will use recvmsg and will receive a new control message 
> which contains the timestamp from the socket special queue 

Sounds a bit complicated to me. The trick currently used by PTPd might
be more elegant and/or require less changes: it enables looping of
outgoing packets with IP_MULTICAST_LOOP. The RX timestamp of the looped
packet is then used as approximation for the TX time stamp of the
original outgoing packet. Clearly this is inaccurate, in particular
under load, but it is very easy to use.

When a driver gets a skb with the request to generate a TX time stamp,
it could send the packet, upon completion obtain the time stamp from the
hardware and feed the packet and the time stamp back to the upper layers
as if it had just been received. Would that work?

The user space then obtains TX time stamps just like RX time stamps and
can use the payload to determine what kind of time stamp it got. That
also avoids the need for special cookies to detect packet loss or
reordering.

So far all that we get out of this is access to the raw time stamps.
There may be some use for that, as Tavi said, but it would be a lot more
interesting if the kernel would transform the raw time stamps into
system time stamps if the user space process wants that. Then it can be
used by a modified PTPd to synchronize the system time inside a cluster
a lot more accurately than it is currently possible with NTP (think
sub-microsecond accuracy instead of milliseconds).

On Fri, 2008-07-04 at 03:42 +0300, Octavian Purdila wrote:
> I guess we could try to do a simple sync between the host clock and the hw 
> clock by getting the initial delta between the two. But since the two clocks 
> are not in sync, they will diverge in time.

For the paper I tried out two different ways of synchronizing the system
time with the NIC time. The one called "Assisted System Time" could be
implemented relatively easily inside the IP stack: the driver only has
to provide access to the NIC's hardware clock. Then the layer above it
can sample the system time/NIC time offset at regular intervals; when
they drift apart, that drift rate can be tracked as part of the
measurements and be taken into account when transforming from one time
base into the other. The other method ("Two-Level PTP") is more
complicated and didn't bring much benefit.

Bye, Patrick