From mboxrd@z Thu Jan 1 00:00:00 1970 From: Patrick Ohly Subject: Re: [RFC] support for IEEE 1588 Date: Fri, 04 Jul 2008 15:37:21 +0200 Message-ID: <1215178641.7277.71.camel@localhost> References: <200807040147.11148.opurdila@ixiacom.com> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit To: netdev@vger.kernel.org Return-path: Received: from main.gmane.org ([80.91.229.2]:42385 "EHLO ciao.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755004AbYGDOZI (ORCPT ); Fri, 4 Jul 2008 10:25:08 -0400 Received: from root by ciao.gmane.org with local (Exim 4.43) id 1KEmDe-0000sS-U7 for netdev@vger.kernel.org; Fri, 04 Jul 2008 14:25:02 +0000 Received: from fce2e.f.ppp-pool.de ([195.4.206.46]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 04 Jul 2008 14:25:02 +0000 Received: from patrick.ohly by fce2e.f.ppp-pool.de with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 04 Jul 2008 14:25:02 +0000 In-Reply-To: <200807040147.11148.opurdila@ixiacom.com> Sender: netdev-owner@vger.kernel.org List-ID: Hallo Tavi, Interesting initiative. I'm employed by Intel and had the chance to do some exploratory work on software PTP support for Intel's new 82576 Gigabit Ethernet Controller [1], which introduces hardware time stamping for PTP packets. I modified the open source PTPd so that it uses the more accurate hardware time stamps instead of time stamps generated by the Linux IP stack. The advantage was 50x higher accuracy under load. You can read more about that in a paper [2]. [1] http://download.intel.com/design/network/ProdBrf/320025.pdf [2] http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf In order to get these time stamps and read the clock inside the NIC which generates these time stamps, we had to add ioctl() calls to the igb driver - not nice and certainly not a suitable long-term solution. If there is a consensus on a better user space API and the Linux IP stack gets a general framework for PTP, then perhaps it could also be used with Intel's new NICs. Note that I'm not speaking in any official capacity for Intel here, just expressing my own opinion (and hope). I'm not even in the network team. I cannot release the PTPd and igb patches right now because that would require legal approval, but if there is interest I can get that process started. There's no reason not to do that. So, let's move on to Tavi's proposal: On Fri, 2008-07-04 at 01:47 +0300, Octavian Purdila wrote: > 1. RX path > - add a new field in skb to keep the hardware stamp (hwstamp) > - add a new socket flag to enable RX stamping > - add a new control message to retrieve the hwstamp from the skb to user-space > application (for UDP and maybe PF_PACKET) I agree. Currently there is something similar with SO_TIMESTAMP and SCM_TIMESTAMP, but the problem with those is that only a timeval is returned, i.e., accuracy is limited to microseconds. To make full use of hardware time stamps we'll want a timespec with nanoseconds. We also need something more flexible than SO_TIMESTAMP. Depending on what the user space program wants to measure, it would be useful to time stamp * the various flavors of PTP packets (v1/v2/802.1as, SYNC/DELAY_REQUEST) selectively * all packets The hardware might not be capable of supporting all modes, but at least the API should support them and provide room for future extensions. It would be possible to fall back to time stamping using system time if the hardware is incapable of implementing the requested operation. Depending on how that fallback is implemented, PTPd's accuracy might be improved even without any hardware support. > 2. TX path - this is a bit more complicated since we need a new mechanism to > wait for a packet transmission on wire, from users-space. > - add a new flag for the skb to request TX stamping > - add a new control message to propagate the TX stamping request from > userspace to the skb Forgive me my ignorance, can you provide more details how that would work? How about adding a new flag for send/sendto/sendmsg() instead of a new control message? > - when the driver will send the packet will get the stamp from the TX > completion ring; the driver will then propagate the stamp either to > (a) the skb stamp field, or (b) some special structure - this to avoid keeping > the skb around > - the special structure or the skb will be linked to a special queue in the > socket and a POLLPRI event will be generated > - the application will use recvmsg and will receive a new control message > which contains the timestamp from the socket special queue Sounds a bit complicated to me. The trick currently used by PTPd might be more elegant and/or require less changes: it enables looping of outgoing packets with IP_MULTICAST_LOOP. The RX timestamp of the looped packet is then used as approximation for the TX time stamp of the original outgoing packet. Clearly this is inaccurate, in particular under load, but it is very easy to use. When a driver gets a skb with the request to generate a TX time stamp, it could send the packet, upon completion obtain the time stamp from the hardware and feed the packet and the time stamp back to the upper layers as if it had just been received. Would that work? The user space then obtains TX time stamps just like RX time stamps and can use the payload to determine what kind of time stamp it got. That also avoids the need for special cookies to detect packet loss or reordering. So far all that we get out of this is access to the raw time stamps. There may be some use for that, as Tavi said, but it would be a lot more interesting if the kernel would transform the raw time stamps into system time stamps if the user space process wants that. Then it can be used by a modified PTPd to synchronize the system time inside a cluster a lot more accurately than it is currently possible with NTP (think sub-microsecond accuracy instead of milliseconds). On Fri, 2008-07-04 at 03:42 +0300, Octavian Purdila wrote: > I guess we could try to do a simple sync between the host clock and the hw > clock by getting the initial delta between the two. But since the two clocks > are not in sync, they will diverge in time. For the paper I tried out two different ways of synchronizing the system time with the NIC time. The one called "Assisted System Time" could be implemented relatively easily inside the IP stack: the driver only has to provide access to the NIC's hardware clock. Then the layer above it can sample the system time/NIC time offset at regular intervals; when they drift apart, that drift rate can be tracked as part of the measurements and be taken into account when transforming from one time base into the other. The other method ("Two-Level PTP") is more complicated and didn't bring much benefit. Bye, Patrick