From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f67.google.com ([74.125.82.67]:36594 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752906AbcBJQ2b (ORCPT ); Wed, 10 Feb 2016 11:28:31 -0500 Received: by mail-wm0-f67.google.com with SMTP id 128so5263884wmz.3 for ; Wed, 10 Feb 2016 08:28:30 -0800 (PST) Subject: Re: [PATCH 5/6] drm: Prevent vblank counter jumps with timestamp based update method. To: Daniel Vetter , =?UTF-8?B?VmlsbGUgU3lyasOkbMOk?= References: <1454894009-15466-1-git-send-email-mario.kleiner.de@gmail.com> <1454894009-15466-6-git-send-email-mario.kleiner.de@gmail.com> <20160209100917.GP11240@phenom.ffwll.local> <56B9EF5A.6050902@gmail.com> <20160209141149.GP23290@intel.com> <20160209150347.GZ11240@phenom.ffwll.local> Cc: dri-devel@lists.freedesktop.org, linux@bernd-steinhauser.de, stable@vger.kernel.org, michel@daenzer.net, vbabka@suse.cz, daniel.vetter@ffwll.ch, alexander.deucher@amd.com, christian.koenig@amd.com From: Mario Kleiner Message-ID: <56BB652A.1010107@gmail.com> Date: Wed, 10 Feb 2016 17:28:26 +0100 MIME-Version: 1.0 In-Reply-To: <20160209150347.GZ11240@phenom.ffwll.local> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Sender: stable-owner@vger.kernel.org List-ID: On 02/09/2016 04:03 PM, Daniel Vetter wrote: > On Tue, Feb 09, 2016 at 04:11:49PM +0200, Ville Syrj�l� wrote: >> On Tue, Feb 09, 2016 at 02:53:30PM +0100, Mario Kleiner wrote: >>> On 02/09/2016 11:09 AM, Daniel Vetter wrote: >>>> On Mon, Feb 08, 2016 at 02:13:28AM +0100, Mario Kleiner wrote: >>>>> The changes to drm_update_vblank_count() in Linux 4.4 added a >>>>> method to emulate a hardware vblank counter by use of high >>>>> precision vblank timestamps if a kms driver supports those, >>>>> but doesn't suppport hw vblank counters. >>>>> >>>>> That method assumes that the old timestamp from a previous >>>>> invocation is valid, but that is not always the case. E.g., >>>>> if drm_reset_vblank_timestamp() gets called during drm_vblank_on() >>>>> or drm_update_vblank_count() gets called outside vblank irq and >>>>> the high precision timestamping can't deliver a precise timestamp, >>>>> ie. drm_get_last_vbltimestamp() delivers a return value of false, >>>>> then those functions will initialize the old timestamp to zero >>>>> to mark it as invalid. >>>>> >>>>> A following call to drm_update_vblank_count() would then calculate >>>>> elapsed time with vblank irqs off as current vblank timestamp minus >>>>> the zero old timestamp and compute a software vblank counter increment >>>>> that corresponds to system uptime, causing a large forward jump of the >>>>> software vblank counter. That jump in turn can cause too long waits >>>>> in drmWaitVblank and very long delays in delivery of vblank events, >>>>> resulting in hangs of userspace clients. >>>>> >>>>> This problem can be observed on nouveau-kms during machine >>>>> suspend->resume cycles, where drm_vblank_off is called during >>>>> suspend, drm_vblank_on is called during resume and the first >>>>> queries to drm_get_last_vbltimestamp() don't deliver high >>>>> precision timestamps, resulting in a large harmful counter jump. >>>> >>>> Why does nouveau enable vblank interrupts before it can get valid >>>> timestamps? That sounds like the core bug here, and papering over that in >>>> the vblank code feels very wrong to me. Maybe we should instead just not >>>> sample the vblank at all if the timestamp fails and splat a big WARN_ON >>>> about this, to give driver writers a chance to fix up their mess? >>>> -Daniel >>>> >>> >>> The high precision timestamping is allowed to fail for a kms driver >>> under some conditions which are not implementation errors of the driver, >>> or getting disabled by user override, so we should handle that robustly. >>> We handle it robustly everywhere else in the drm, so we should do it >>> here as well. >>> >>> E.g., nouveau generally supports timestamping/scanout position queries, >>> but can't support it on old pre NV-50 hardware with some output type >>> (either on analog VGA, or DVI-D, can't remember atm. which one is >>> unsupported). >> >> I think the surprising thing here is that it fails first and then >> succeeds on the same crtc/output combo presumably. > > Yeah exactly this. Failing consistently is ok imo (and probably should be > handled). Failing first and then later on working (without changing the > mode config in between) seems suspicous. That shouldn't ever happen really > and seems like a driver bug (like enabling vblanks too early, before the > pipe is fully up&running). > -Daniel > >> >> I guess in theory the driver could fail during random times for whatever >> reason, though I tend to think that the driver should either make it >> robust or not even pretend to support it. >> >> I suppose one failure mode the driver can't really do anything about is >> some random SMI crap or something stalling the timestamp queries enough >> that we fail the precisions tests and exhaust the retry limits. So yeah, >> making it robust against that kind of nastyness sounds line it might be >> a good idea. Though perhaps it should be something a bit more severe >> than a DRM_DEBUG since I think it really shouldn't happen when the >> driver and system are otherwise sane. Of course if it routinely fails >> with some driver then simply making it spew errors into dmesg isn't >> so nice, unless the driver also gets fixed. >> I think i have an idea what might go wrong with nouveau, so i'll see if i can add a fixup patch. There's another scenario where this zero-ts case can be hit. If the driver drm_vblank_init()'s - setting all timestamps to zero - and then code starts using vblanks (drm_vblank_get()) without drm_vblank_on beforehand, which is afaics the case with nouveau. Unless that's considered an error as well, we'd need to init the timestamps to something non-zero but harmless like 1 usecs at drm_vblank_init() time? What makes sense as output here? DRM_WARN_ONCE? -mario >>> >>> There are also new Soc drivers showing up which support those methods, >>> but at least i didn't verify or test if their implementations are good >>> enough for the needs of the new drm_udpate_vblank_count() which is more >>> sensitive to kms drivers being even slightly off. >>> >>> -mario >>> >>>>> >>>>> Fix this by checking if the old timestamp used for this calculations >>>>> is zero == invalid. If so, perform a counter increment of +1 to >>>>> prevent large counter jumps and reinitialize the timestamps to >>>>> sane values. >>>>> >>>>> Signed-off-by: Mario Kleiner >>>>> Cc: # 4.4+ >>>>> Cc: michel@daenzer.net >>>>> Cc: vbabka@suse.cz >>>>> Cc: ville.syrjala@linux.intel.com >>>>> Cc: daniel.vetter@ffwll.ch >>>>> Cc: dri-devel@lists.freedesktop.org >>>>> Cc: alexander.deucher@amd.com >>>>> Cc: christian.koenig@amd.com >>>>> --- >>>>> drivers/gpu/drm/drm_irq.c | 7 +++++++ >>>>> 1 file changed, 7 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c >>>>> index fb17c45..88bdf19 100644 >>>>> --- a/drivers/gpu/drm/drm_irq.c >>>>> +++ b/drivers/gpu/drm/drm_irq.c >>>>> @@ -216,6 +216,13 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe, >>>>> DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored." >>>>> " diff_ns = %lld, framedur_ns = %d)\n", >>>>> pipe, (long long) diff_ns, framedur_ns); >>>>> + >>>>> + /* No valid t_old to calculate diff? Bump +1 to force reinit. */ >>>>> + if (t_old->tv_sec == 0 && t_old->tv_usec == 0) { >>>>> + DRM_DEBUG_VBL("crtc %u: No baseline ts. Bump +1.\n", >>>>> + pipe); >>>>> + diff = 1; >>>>> + } >>>>> } else { >>>>> /* some kind of default for drivers w/o accurate vbl timestamping */ >>>>> diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0; >>>>> -- >>>>> 1.9.1 >>>>> >>>> >> >> -- >> Ville Syrj�l� >> Intel OTC >