From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756712AbcBDM1I (ORCPT ); Thu, 4 Feb 2016 07:27:08 -0500 Received: from foss.arm.com ([217.140.101.70]:41244 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756221AbcBDM1F (ORCPT ); Thu, 4 Feb 2016 07:27:05 -0500 Date: Thu, 4 Feb 2016 12:27:45 +0000 From: Juri Lelli To: Steven Rostedt Cc: Peter Zijlstra , Ingo Molnar , LKML , Clark Williams , John Kacur , Daniel Bristot de Oliveira , Juri Lelli Subject: Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets Message-ID: <20160204122745.GC29586@e106622-lin> References: <20160203135550.5f95ecb2@gandalf.local.home> <20160204095448.GE12132@e106622-lin> <20160204120412.GA29586@e106622-lin> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160204120412.GA29586@e106622-lin> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/02/16 12:04, Juri Lelli wrote: > On 04/02/16 09:54, Juri Lelli wrote: > > Hi Steve, > > > > first of all thanks a lot for your detailed report, if only all bug > > reports were like this.. :) > > > > On 03/02/16 13:55, Steven Rostedt wrote: > > [...] > > > > > Right. I think this is the same thing that happens after hotplug. IIRC > > the code paths are actually the same. The problem is that hotplug or > > cpuset reconfiguration operations are destructive w.r.t. root_domains, > > so we lose bandwidth information when that happens. The problem is that > > we only store cumulative information regarding bandwidth in root_domain, > > while information about which task belongs to which cpuset is store in > > cpuset data structures. > > > > I tried to fix this a while back, but my tentative was broken, I failed > > to get locking right and, even though it seemed to fix the issue for me, > > it was prone to race conditions. You might still want to have a look at > > that for reference: https://lkml.org/lkml/2015/9/2/162 > > > > [...] > > > > > It's good that we can recover, but that's still a bug yes :/. > > > > I'll try to see if my broken patch make what you are seeing apparently > > disappear, so that we can at least confirm that we are seeing the same > > problem; you could do the same if you want, I pushed that here > > > > No it doesn't solve this :/. I placed restoring code in the hotplug > workfn, so updates generated by toggling sched_load_balance don't get > caught, of course. But, this at least tells us that we should solve this > someplace else. > Well, if I call an unlocked version of my cpuset_hotplug_update_rd() from kernel/cpuset.c:update_flag() the issue seems to go away. But, we end up overcommitting the default null domain (try to toggle sched_load_ balance multiple times). I updated the branch, but I still think we should solve this differently. Best, - Juri