From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261785AbVCCOvY (ORCPT ); Thu, 3 Mar 2005 09:51:24 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261786AbVCCOvY (ORCPT ); Thu, 3 Mar 2005 09:51:24 -0500 Received: from wproxy.gmail.com ([64.233.184.200]:62564 "EHLO wproxy.gmail.com") by vger.kernel.org with ESMTP id S261785AbVCCOuz (ORCPT ); Thu, 3 Mar 2005 09:50:55 -0500 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:mime-version:content-type:content-transfer-encoding; b=VirHJBKjDOwSv1b1C17Dqcc7tDkuWUXCKghvg+2Z/Z1RLA+PFlvFCBqZyPVV7XGe12uk0CfUEJGHR60GI3w+Y8PSH/2vWsBOwLfoE+uVEVu5wZpjh0fqzwTqCiyfzr2zuLNc2GTJrT3Y8Aj2TMB3+egtI5zixwUJehfEF6qS6uU= Message-ID: <6c58e3190503030650595cbd5@mail.gmail.com> Date: Thu, 3 Mar 2005 16:50:52 +0200 From: Dror Cohen Reply-To: Dror Cohen To: linux-kernel@vger.kernel.org Subject: Time Drift Compensation on Linux Clusters Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Hi all, While working on a Linux cluster with kernel version 2.4.27 we've encountered a consistent clock drift problem. We have devised a fix for this problem which is based on the Pentium's TSC clock. We'd appreciate any comments on the validity of the proposed solution and on whether it might cause unexpected (and undesired) problems in other parts of the kernel. Following is a detailed description of the problem and the fix, Thanks everyone, Dror XIV ltd. ------------------------------------------------------------------------------------------------------------ Time Drift Compensation on Linux Clusters 1. Background During the operation of the 2.4.27 kernel under intense IO conditions a clock drift was detected. The system time, as received by gettimeofday() began lagging behind the wall clock. All PCs have a battery driven hardware clock. This clock is used at boot time to initialize the system time. Linux then uses the Programmable Interval Timer (PIT) to keep an internal accounting of the flow of time. The PIT is set to raise an interrupt every fixed interval. The number of interrupt per second is defined by the kernel constant HZ, which typically equals 100 (10 on Alpha machines). The interrupt handler kernel/timer.c:do_timer() increments a kernel wide variable named jiffies and invokes the bottom half to take care of timers and scheduling. 2. Analysis In IO intensive environments a many CPU cycles are spent handling interrupts. This is done by the device driver code for the different IO devices. Typically, while dealing with the hardware, device drivers block other IRQs, including the PIT's (timer) IRQ. When debugging drivers it is not uncommon to use printk() inside driver code while IRQs are still blocked. This method is by nature blocking and can only be run once even on SMP machines. This is because messages printed by printk() must reach the console before any other operation might cause the kernel to panic or block. In cluster environments a serial console can be used to access machines remotely. One drawback of using such a console is its longer response time which significantly lengthens the blocking time of printk(). Putting all this together results in an interrupt rich environments in which many interrupt handlers block the IRQs for long periods of time. This matters when 'long' becomes longer than 1 second / HZ, which means that the CPU misses timer interrupts. Each time a time interrupt is missed the system looses one jiffy. These lost jiffies accumulate into a consisting clock drift. 3. Fix The fix proposed here relies on a quality of the Pentium processor and thus is irrelevant to other architectures or to older (80386, 80486) CPUs. The Pentium has an internal cycles counter called the Time Stamp Counter (TSC). This counter is a 64 bit register which is incremented internally every CPU cycle and can be read using a special assembly instruction. On a 1GHz machine this register wraps to zero once every 584 years. The suggested fix adds code to the timer interrupt handler which tracks the value of this counter. When an interrupt occurs, the jiffies counter is increased not just by one but by the number of 1 second / HZ intervals that actually passed according to the TSC value change. This involves modifying two kernel files: kernel/timer.c: modify the interrupt handler do_timer() 698a699,703 > #ifdef X86_TSC_DRIFT_COMPENSATION > __u64 xtdc_step; > __u64 xtdc_last = 0; > #endif > 700a706,712 > #ifdef X86_TSC_DRIFT_COMPENSATION > __u64 cycles = get_cycles(); > while (xtdc_last < cycles) { > (*(unsigned long *)&jiffies)++; > xtdc_last += xtdc_step; > } > #else 701a714,715 > #endif > arch/i386/kernel/time.c modify calibrate_tsc() to get the initial TSC count and to calculate the number of TSC cycles per 1 second / HZ interval 768a769,777 > #ifdef X86_TSC_DRIFT_COMPENSATION > > extern __u64 xtdc_step; > extern __u64 xtdc_last; > #define CALIBRATE_LATCH (4 * LATCH) > #define CALIBRATE_TIME (4 * 1000020/HZ) > > #else > 771a781,782 > #endif > 805a817,820 > #ifdef X86_TSC_DRIFT_COMPENSATION > xtdc_last = (((__u64) endhigh) << 32) + (__u64) endlow; > #endif > 811a827,830 > > #ifdef X86_TSC_DRIFT_COMPENSATION > xtdc_step = ((((__u64) endhigh) << 32) + (__u64) endlow) >> 2; > #endif