From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tvrtko Ursulin Subject: Re: [RFC v3 00/12] DRM scheduling cgroup controller Date: Thu, 26 Jan 2023 17:57:24 +0000 Message-ID: References: <20230112165609.1083270-1-tvrtko.ursulin@linux.intel.com> <20230123154239.GA24348@blackbody.suse.cz> <371f3ce5-3468-b91d-d688-7e89499ff347@linux.intel.com> <20230126130050.GA22442@blackbody.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674755851; x=1706291851; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=BQUEAqzlE148mAgRStfAJ82IWpXoVVw/1OpuqObioe4=; b=aNCjuJfx62MGdq9qxugYoSpKZ5Bp6IoB3jukXxZvZzeqyRuo+w3GQgz9 UkVdHyIBzMhxikOjsESo4VC7Y4H+y9DDthS0afiyZVzFf5gBWT0FcnrKJ RS3AR5UNp6+xuS/RZiwKE8/geG5w8+Lpo1Glj9ylhDvONq3w1LMeObUIF btxXZDpH+wvryLm3h3wcdFN0R6Gz7hlaJXT49I1yZb041+YcNGSQa6UlN LrkTm4TqIx8RU0rZ+yMuQZsUSmxUH2Ysur9QubcToJGt9f33GVoXkJ2sy eBrf1jKt8d2P9NWbj0bu3nusW7UrXOmc6/5qMHbqlESD5CYAJCkKr5lGl A==; Content-Language: en-US In-Reply-To: List-ID: Content-Type: text/plain; charset="iso-8859-1"; format="flowed" To: Tejun Heo , =?UTF-8?Q?Michal_Koutn=c3=bd?= Cc: Intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Johannes Weiner , Zefan Li , Dave Airlie , Daniel Vetter , Rob Clark , =?UTF-8?Q?St=c3=a9phane_Marchesin?= , "T . J . Mercier" , Kenny.Ho-5C7GfCeVMHo@public.gmane.org, =?UTF-8?Q?Christian_K=c3=b6nig?= , Brian Welty , Tvrtko Ursulin Hi, (Two replies in one, hope you will manage to navigate it.) On 26/01/2023 17:04, Tejun Heo wrote: > Hello, >=20 > On Thu, Jan 26, 2023 at 02:00:50PM +0100, Michal Koutn=C3=BD wrote: >> On Wed, Jan 25, 2023 at 06:11:35PM +0000, Tvrtko Ursulin wrote: >>> I don't immediately see how you envisage the half-userspace implementat= ion >>> would look like in terms of what functionality/new APIs would be provid= ed by >>> the kernel? >> >> Output: >> drm.stat (with consumed time(s)) >> >> Input: >> drm.throttle (alternatives) >> - a) writing 0,1 (in rough analogy to your proposed >> notifications) >> - b) writing duration (in loose analogy to memory.reclaim) >> - for how long GPU work should be backed off >> >> An userspace agent sitting between these two and it'd do the measurement >> and calculation depending on given policies (weighting, throttling) and >> apply respective controls. Right, I wouldn't recommend drm.throttle as ABI since my idea is to=20 enable drivers to do as good job as they individually can. Eg. some may=20 be able to be much smarter than simple throttling, or some may start of=20 simpler and later gain a better implementation. Some may even have=20 differing capability or granularity depending on the GPU model they are=20 driving, like in the case of i915. So even if the RFC shows just a simple i915 implementation, the=20 controller itself shouldn't prevent a smarter approach (via exposed=20 ABI). And neither this simple i915 implementation works equally well for=20 all supported GPU generations! This will be a theme common for many DRM=20 drivers. Secondly, doing this in userspace would require the ability to get some=20 sort of an atomic snapshot of the whole tree hierarchy to account for=20 changes in layout of the tree and task migrations. Or some retry logic=20 with some added ABI fields to enable it. Even then I think the only thing we would be able to move to userspace=20 is the tree-walking logic and that sounds like not that much kernel code=20 saved to trade for increased inefficiency. >> (In resemblance of e.g. https://denji.github.io/cpulimit/) >=20 > Yeah, things like this can be done from userspace but if we're gonna build > the infrastructure to allow that in gpu drivers and so on, I don't see why > we wouldn't add a generic in-kernel control layer if we can implement a > proper weight based control. We can of course also expose .max style > interface to allow userspace to do whatever they wanna do with it. Yes agreed, and to re-stress out, the ABI as proposed does not preclude=20 changing from scanning to charging or whatever. The idea was for it to=20 be compatible in concept with the CPU controller and also avoid baking=20 in the controlling method to individual drivers. >>> Problem there is to find a suitable point to charge at. If for a moment= we >>> limit the discussion to i915, out of the box we could having charging >>> happening at several thousand times per second to effectively never. Th= is is >>> to illustrate the GPU context execution dynamics which range from many = small >>> packets of work to multi-minute, or longer. For the latter to be accoun= ted >>> for we'd still need some periodic scanning, which would then perhaps go= per >>> driver. For the former we'd have thousands of needless updates per seco= nd. >>> >>> Hence my thinking was to pay both the cost of accounting and collecting= the >>> usage data once per actionable event, where the latter is controlled by= some >>> reasonable scanning period/frequency. >>> >>> In addition to that, a few DRM drivers already support GPU usage queryi= ng >>> via fdinfo, so that being externally triggered, it is next to trivial to >>> wire all those DRM drivers into such common DRM cgroup controller frame= work. >>> All that every driver needs to implement on top is the "over budget" >>> callback. >> >> I'd also like show comparison with CPU accounting and controller. >> There is tick-based (~sampling) measurement of various components of CPU >> time (task_group_account_field()). But the actual schedulling (weights) >> or throttling is based on precise accounting (update_curr()). >> >> So, if the goal is to have precise and guaranteed limits, it shouldn't >> (cannot) be based on sampling. OTOH, if it must be sampling based due to >> variability of the device landscape, it could be advisory mechanism with >> the userspace component. I don't think precise and guaranteed limits are feasible given the=20 heterogeneous nature of DRM driver capabilities, but I also don't think=20 sticking an userspace component in the middle is the way to go. > As for the specific control mechanism, yeah, charge based interface would= be > more conventional and my suspicion is that transposing the current > implementation that way likely isn't too difficult. It just pushes "am I > over the limit?" decisions to the specific drivers with the core layer > telling them how much under/over budget they are. I'm curious what other = As I have tried to explain in my previous reply, I don't think real time=20 charging is feasible. Because frequency of charging events can both be=20 too high and too low. Too high that it doesn't bring value apart from=20 increased processing times, where it is not useful to send out=20 notification at the same rate, and too low in the sense that some sort=20 of periodic query would then be needed in every driver implementation to=20 enable all classes of GPU clients to be properly handled. I just don't see any positives to the charging approach in the DRM=20 landscape, but for sure see some negatives. (If we ignore the positive=20 of it being a more typical approach, but then I think that is not enough=20 to outweigh the negatives.) gpu > driver folks think about the current RFC tho. Is at least AMD on board wi= th > the approach? Yes I am keenly awaiting comments from the DRM colleagues as well. Regards, Tvrtko P.S. Note that Michal's idea to simplify client tracking is on my TODO=20 list. If that works out some patches, the series itself actually, would=20 become even simpler.