From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753463Ab0CQP7j (ORCPT ); Wed, 17 Mar 2010 11:59:39 -0400 Received: from e33.co.us.ibm.com ([32.97.110.151]:40978 "EHLO e33.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753161Ab0CQP7i (ORCPT ); Wed, 17 Mar 2010 11:59:38 -0400 Subject: Re: [PATCH] Timekeeping: Fix dead lock in update_wall_time by correct shift convertion. From: john stultz To: Sonic Zhang Cc: Andrew Morton , Thomas Gleixner , Linux Kernel In-Reply-To: <4e5ebad51003162214s3430aa39xde29a4ba9c1b7d8c@mail.gmail.com> References: <1268735629.5075.8.camel@eight.analog.com> <1268763512.1676.7.camel@work-vm> <4e5ebad51003161958n6db293f0y6da99483bf96ada0@mail.gmail.com> <1268797285.3130.90.camel@localhost.localdomain> <4e5ebad51003162214s3430aa39xde29a4ba9c1b7d8c@mail.gmail.com> Content-Type: text/plain; charset="UTF-8" Date: Wed, 17 Mar 2010 08:59:22 -0700 Message-ID: <1268841562.1996.16.camel@work-vm> Mime-Version: 1.0 X-Mailer: Evolution 2.28.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2010-03-17 at 13:14 +0800, Sonic Zhang wrote: > With you new workaround, no dead loop. But are you sure this doesn't > overflow the ntp_error after thousands of loops? > > timekeeper.ntp_error += tick_length << shift; > timekeeper.ntp_error -= timekeeper.xtime_interval << > (timekeeper.ntp_error_shift + shift); At some point, yes it could overflow, but for that to happen, we'd have to have accumulated over 4 seconds of time error between calls to update_wall_time. At a max error rate of 500ppm, that would mean over two hours of delay between calls. The time subsystem can try to accommodate reasonable stalls in the system, but i think there will always be windows in which KGDB could cause the system to not recover (ie: i know quite of bit of scsi hardware have heartbeat requirements, so I could imagine kgdb causing those watchdogs to trigger and reset the device). One approach would be to have KGDB suspend the timekeeping core, much as is done over suspend/resume. This should be able to protect us from any overflows, but I suspect its unlikely that we'd want to go run other kernel stuff when breaking into KGDB. Thanks again for the testing. I'll try to send out an improved version of the fix for testing later today. If you could confirm it works as well, I'd appreciate it. thanks -john