From mboxrd@z Thu Jan 1 00:00:00 1970 From: Francisco Jerez Subject: Re: [PATCH] drm/nouveau: fix __nouveau_fence_wait performance regression Date: Wed, 09 Mar 2011 18:34:36 +0100 Message-ID: <87fwqwb3kj.fsf@riseup.net> References: <20110213203804.GA5395@joi.lan> <20110304164905.GA2743@joi.lan> <1299536668.29441.3.camel@nisroch> <20110307232256.GA2680@joi.lan> <87lj0qzaut.fsf@riseup.net> <20110308121628.GA20238@joi.lan> <871v2hzin7.fsf@riseup.net> <20110309001404.GA22679@joi.lan> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0435099001==" Return-path: In-Reply-To: <20110309001404.GA22679-OI9uyE9O0yo@public.gmane.org> (Marcin Slusarz's message of "Wed, 9 Mar 2011 01:14:04 +0100") List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: nouveau-bounces+gcfxn-nouveau=m.gmane.org-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Errors-To: nouveau-bounces+gcfxn-nouveau=m.gmane.org-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org To: Marcin Slusarz Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org, Ben Skeggs List-Id: nouveau.vger.kernel.org --===============0435099001== Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Marcin Slusarz writes: > On Tue, Mar 08, 2011 at 05:22:52PM +0100, Francisco Jerez wrote: >> Marcin Slusarz writes: >>=20 >> > On Tue, Mar 08, 2011 at 01:58:50AM +0100, Francisco Jerez wrote: >> >> Marcin Slusarz writes: >> >>=20 >> >> > On Tue, Mar 08, 2011 at 08:24:26AM +1000, Ben Skeggs wrote: >> >> >> On Mon, 2011-03-07 at 18:18 +0000, Maarten Maathuis wrote: >> >> >> > On Fri, Mar 4, 2011 at 4:49 PM, Marcin Slusarz wrote: >> >> >> > > On Sun, Feb 13, 2011 at 09:38:04PM +0100, Marcin Slusarz wrote: >> >> >> > >> Combination of locking and interchannel synchronization chang= es >> >> >> > >> uncovered poor behaviour of nouveau_fence_wait, which on HZ= =3D100 >> >> >> > >> configuration could waste up to 10 ms per call. >> >> >> > >> Depending on application, it lead to 10-30% FPS regression. >> >> >> > >> To fix it, shorten thread sleep time to 0.1 ms and ensure >> >> >> > >> spinning happens for at least one *full* tick. >> >> >> > >> >> >> >> > >> Signed-off-by: Marcin Slusarz >> >> >> > >> --- >> >> >> > >> drivers/gpu/drm/nouveau/nouveau_fence.c | 10 ++++++++-- >> >> >> > >> 1 files changed, 8 insertions(+), 2 deletions(-) >> >> >> > >> >> >> >> > >> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/driver= s/gpu/drm/nouveau/nouveau_fence.c >> >> >> > >> index 221b846..75ba5e2 100644 >> >> >> > >> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c >> >> >> > >> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c >> >> >> > >> @@ -27,6 +27,9 @@ >> >> >> > >> #include "drmP.h" >> >> >> > >> #include "drm.h" >> >> >> > >> >> >> >> > >> +#include >> >> >> > >> +#include >> >> >> > >> + >> >> >> > >> #include "nouveau_drv.h" >> >> >> > >> #include "nouveau_ramht.h" >> >> >> > >> #include "nouveau_dma.h" >> >> >> > >> @@ -230,9 +233,12 @@ int >> >> >> > >> __nouveau_fence_wait(void *sync_obj, void *sync_arg, bool la= zy, bool intr) >> >> >> > >> { >> >> >> > >> unsigned long timeout =3D jiffies + (3 * DRM_HZ); >> >> >> > >> - unsigned long sleep_time =3D jiffies + 1; >> >> >> > >> + unsigned long sleep_time =3D jiffies + 2; >> >> >> > >> + ktime_t t; >> >> >> > >> int ret =3D 0; >> >> >> > >> >> >> >> > >> + t =3D ktime_set(0, NSEC_PER_MSEC / 10); >> >> >> > >> + >> >> >> > >> while (1) { >> >> >> > >> if (__nouveau_fence_signalled(sync_obj, sync_ar= g)) >> >> >> > >> break; >> >> >> > >> @@ -245,7 +251,7 @@ __nouveau_fence_wait(void *sync_obj, void= *sync_arg, bool lazy, bool intr) >> >> >> > >> __set_current_state(intr ? TASK_INTERRUPTIBLE >> >> >> > >> : TASK_UNINTERRUPTIBLE); >> >> >> > >> if (lazy && time_after_eq(jiffies, sleep_time)) >> >> >> > >> - schedule_timeout(1); >> >> >> > >> + schedule_hrtimeout(&t, HRTIMER_MODE_REL= ); >> >> >> > >> >> >> >> > >> if (intr && signal_pending(current)) { >> >> >> > >> ret =3D -ERESTARTSYS; >> >> >> > >> -- >> >> >> > >> 1.7.4.rc3 >> >> >> > >> >> >> >> > > >> >> >> > > ping again >> >> >> > > _______________________________________________ >> >> >> > > Nouveau mailing list >> >> >> > > Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >> >> >> > > http://lists.freedesktop.org/mailman/listinfo/nouveau >> >> >> > > >> >> >> >=20 >> >> >> > This looks ok to me, but I would like to get Ben Skeggs ok on th= is one >> >> >> > as well. So i've CC'ed him, hopefully he'll notice :-) >> >> >> Ah sorry, I have actually looked at this quite a while back but ca= me to >> >> >> no solid conclusion. >> >> >>=20 >> >> >> While yes, I did see some minor performance improvement from it, I= also >> >> >> notice that now we once again get 100% CPU usage while an app is w= aiting >> >> >> for the GPU a lot.. >> >> > >> >> > It's not "minor" performance improvement: >> >> > >> >> > without this patch (FPS): >> >> > nexuiz: 53 >> >> > wop: 181 >> >> > tremulous: 157 >> >> > wsw0.5: 89 >> >> > glxgears: 730 >> >> > >> >> > with: >> >> > nexuiz: 63 (+18%) >> >> > wop: 248 (+37%) >> >> > tremulous: 156 (-0.6%) >> >> > wsw0.5: 91 (+2%) >> >> > glxgears: 1054 (+44%) >> >> > >> >> > >> >> > Ok, so you are worried about CPU usage... Let's see what will happe= n if >> >> > I remove spinning added by "drm/nouveau: Spin for a bit in=20 >> >> > nouveau_fence_wait() before yielding the CPU". >> >> > >> >> > reduced version (attached): >> >> > nexuiz: 62 >> >> > wop: 248 >> >> > trem: 157 >> >> > wsw0.5: 90 >> >> > glxgears: 1055 >> >> > >> >> > Good enough? >> >>=20 >> >> Remember to exercise some software fallbacks as well (e.g. something >> >> using core fonts), software fallbacks were the main users of the >> >> spinning you've removed. >> > >> > corefonts are pretty fast (measured "time dmesg"): >> > >> > without (spinning + timeout 10ms): 0.08s >> > with (spinning + hrtimeout 0.1ms): 0.08s >> > reduced (no spinning + hrtimeout 0.1ms): 0.25s >> > old (no spinning + timeout 10ms): 13s >> > >> Ah, so it's still trading one performance regression for another, and >> you could make everyone happy at the same time. >>=20 >> > So I think "no spinning + hrtimeout 0.1ms" is a reasonable compromise.= .. >> > >> What's the CPU usage difference between the spinning and the no-spinning >> cases?=20 > > 1 cpu set to performance mode > > spinning + hrtimeout 0.1ms: > FPS usr sys > nexuiz: 63 46.60 + 52.36 > wop: 248 57.54 + 41.99 > trem: 156 92.40 + 7.30 > wsw0.5: 91 52.91 + 46.37 > glxgears: 1054 10.00 + 90.00 > corefonts: 42.86 + 54.29 0.08s(time) > > So it fills the CPU in almost 100%... > > no spinning + hrtimeout 0.1ms: > FPS usr sys > nexuiz: 62 49.97 + 8.42 > wop: 248 58.04 + 22.04 > trem: 157 92.42 + 6.92 > wsw0.5: 90 51.69 + 4.58 > glxgears: 1055 11.45 + 11.05 > corefonts: 20.52 + 7.82 0.25s > > OK. > So I did some more tests: > > no spinning + hrtimeout 0.01ms: > FPS usr sys > nexuiz: 63 49.50 + 14.01 > wop: 245 57.21 + 24.66 > trem: 148 89.31 + 10.01 > wsw0.5: 91 52.92 + 11.10 > glxgears: 1055 10.61 + 22.34 > corefonts: 38.24 + 27.45 0.09s > > tremulous FPS is down, sys CPU usage is down, but not so good as in > "no spinning, 0.1ms", corefonts are almost as fast > > --- > > no spinning + hrtimeout 0.01ms increasing by factor x2, max at 1ms: > nexuiz: 62 48.38 + 8.00 > wop: 245 56.55 + 19.92 > trem: 149 90.46 + 8.86 > wsw0.5: 92 52.10 + 4.56 > glxgears: 1026 11.68 + 9.87 > corefonts: 46.60 + 25.24 0.09s > > almost like "no spinning + 0.01ms", but glxgears FPS is down > > --- > > no spinning + hrtimeout 0.001ms: > nexuiz: 63 52.04 + 16.13 > wop: 246 58.94 + 22.59 > trem: 155 92.73 + 6.38 > wsw0.5: 91 54.39 + 16.88 > glxgears: 1055 10.62 + 30.55 > corefonts: 53.01 + 28.92 0.07s > > tremulous FPS is back, sys CPU usage is sometimes bigger, sometimes small= er, > corefonts are fast, but take a lot of CPU time > > --- > > no spinning + hrtimeout 0.001ms increasing by factor x2, max at 1ms > nexuiz: 62 49.46 + 6.70 > wop: 247 58.80 + 17.95 > trem: 156 92.16 + 7.28 > wsw0.5: 91 52.79 + 4.03 > glxgears: 1050 11.44 + 12.54 > corefonts: 36.05 + 32.56 0.07s > > glxgears FPS is a bit down, sys CPU times are as good (or better) as in > "no spinning, 0.1ms", corefonts are fast > > this is the best, patch below > >> It's likely to be negligible for most applications aside from the >> ones using queries and fallbacks intensively, and in those two cases I >> agree with you that optimizing for low CPU usage doesn't make a huge lot >> of sense, getting low latency is already hard enough. >>=20 >> If I'm wrong and the initial spinning affects the overall CPU usage >> negatively, then we have two different use cases with different latency >> requirements and the DRM API needs to be fixed (though, there're maybe >> other solutions to explore first, like, start with a really small >> hrtimeout and increase it exponentially up to some cut-off value). >>=20 >> > BTW, old behaviour (no spinning + timeout 10ms) affects other workload= s too >> > nexuiz: 50 >> > wop: 153 >> > tremulous: 155 >> > wsw0.5: 89 >> > glxgears: 100 (!) >> > >> >> Anyway, software fallbacks and occlusion queries are the only two pla= ces >> >> (that I can think of now) where we need the low latency your patch >> >> gives, and, as Ben already pointed out, we probably want to keep CPU >> >> usage at minimum in every other case. As a middle ground, the "lazy" >> >> flag (or rather, a "hog" flag?) could be exposed all the way up to >> >> userspace, and those two cases fixed to set the flag differently. >> >>=20 >> >> What do you think? >> > >> > I'm not sure. I think optimizing for low CPU usage is not the best what >> > we can do right now. 3D performance is still too low behind blob. >> > Let's fix 3D perf first and think about CPU usage later. >> > >>=20 >> IMHO, switching to lazy waits was the right choice at this stage, it >> doesn't make optimizing for "3D performance" any harder, quite the >> opposite, it helps to pinpoint some poorly-pipelining programming >> practices by making the already existing performance problem more >> obvious. >>=20 >> >> > >> >> > --- >> >> > From: Marcin Slusarz >> >> > Subject: [PATCH] drm/nouveau: fix __nouveau_fence_wait performance = regression >> >> > >> >> > Combination of locking and interchannel synchronization changes >> >> > uncovered poor behaviour of nouveau_fence_wait, which on HZ=3D100 >> >> > configuration could waste up to 10 ms per call. >> >> > Depending on application, it lead to 10-30% FPS regression. >> >> > >> >> > To fix it, shorten thread sleep time to 0.1 ms. >> >> > >> >> > Additionally, remove spinning (added by "drm/nouveau: Spin for >> >> > a bit in nouveau_fence_wait() before yielding the CPU"), because >> >> > it's not needed anymore. >> >> > >> >> > Signed-off-by: Marcin Slusarz >> >> > --- >> >> > drivers/gpu/drm/nouveau/nouveau_fence.c | 11 ++++++++--- >> >> > 1 files changed, 8 insertions(+), 3 deletions(-) >> >> > >> >> > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/= drm/nouveau/nouveau_fence.c >> >> > index a244702..010243b 100644 >> >> > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c >> >> > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c >> >> > @@ -27,6 +27,9 @@ >> >> > #include "drmP.h" >> >> > #include "drm.h" >> >> >=20=20 >> >> > +#include >> >> > +#include >> >> > + >> >> > #include "nouveau_drv.h" >> >> > #include "nouveau_ramht.h" >> >> > #include "nouveau_dma.h" >> >> > @@ -229,9 +232,11 @@ int >> >> > __nouveau_fence_wait(void *sync_obj, void *sync_arg, bool lazy, bo= ol intr) >> >> > { >> >> > unsigned long timeout =3D jiffies + (3 * DRM_HZ); >> >> > - unsigned long sleep_time =3D jiffies + 1; >> >> > + ktime_t t; >> >> > int ret =3D 0; >> >> >=20=20 >> >> > + t =3D ktime_set(0, NSEC_PER_MSEC / 10); >> >> > + >> >> > while (1) { >> >> > if (__nouveau_fence_signalled(sync_obj, sync_arg)) >> >> > break; >> >> > @@ -243,8 +248,8 @@ __nouveau_fence_wait(void *sync_obj, void *sync= _arg, bool lazy, bool intr) >> >> >=20=20 >> >> > __set_current_state(intr ? TASK_INTERRUPTIBLE >> >> > : TASK_UNINTERRUPTIBLE); >> >> > - if (lazy && time_after_eq(jiffies, sleep_time)) >> >> > - schedule_timeout(1); >> >> > + if (lazy) >> >> > + schedule_hrtimeout(&t, HRTIMER_MODE_REL); >> >> >=20=20 >> >> > if (intr && signal_pending(current)) { >> >> > ret =3D -ERESTARTSYS; > > > > --- > From: Marcin Slusarz > Subject: [PATCH] drm/nouveau: fix __nouveau_fence_wait performance > > Commit fcccab2e4eb8d579837481054cc2cb28eea0baef > ("drm/nouveau: Use lazy fence waits when doing software interchannel sync= ") > turned on lazy waits. Unfortunately __nouveau_fence_wait was not optimized > for this case and on HZ=3D100 kernel wasted up to 10 ms per call. > That patch isn't the culprit of your problem, you're unlikely to be hitting the failure path of nouveau_fence_sync(), and that's the only thing I changed with that commit. Lazy waits were enabled earlier by 21e86c1c8a844bf978f8fc431a59c9f5a578812d. That aside it looks good, I've pushed it after fixing the commit description. Thank you. > Depending on application, it lead to 10-30% FPS regression. > > Fix it. > > Signed-off-by: Marcin Slusarz > --- > drivers/gpu/drm/nouveau/nouveau_fence.c | 17 ++++++++++++++--- > 1 files changed, 14 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/no= uveau/nouveau_fence.c > index a244702..78d32dd 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c > @@ -27,6 +27,9 @@ > #include "drmP.h" > #include "drm.h" >=20=20 > +#include > +#include > + > #include "nouveau_drv.h" > #include "nouveau_ramht.h" > #include "nouveau_dma.h" > @@ -229,9 +232,12 @@ int > __nouveau_fence_wait(void *sync_obj, void *sync_arg, bool lazy, bool int= r) > { > unsigned long timeout =3D jiffies + (3 * DRM_HZ); > - unsigned long sleep_time =3D jiffies + 1; > + unsigned long sleep_time =3D NSEC_PER_MSEC / 1000; > + ktime_t t; > int ret =3D 0; >=20=20 > + t =3D ktime_set(0, sleep_time); > + > while (1) { > if (__nouveau_fence_signalled(sync_obj, sync_arg)) > break; > @@ -243,8 +249,13 @@ __nouveau_fence_wait(void *sync_obj, void *sync_arg,= bool lazy, bool intr) >=20=20 > __set_current_state(intr ? TASK_INTERRUPTIBLE > : TASK_UNINTERRUPTIBLE); > - if (lazy && time_after_eq(jiffies, sleep_time)) > - schedule_timeout(1); > + if (lazy) { > + schedule_hrtimeout(&t, HRTIMER_MODE_REL); > + sleep_time *=3D 2; > + if (sleep_time > NSEC_PER_MSEC) > + sleep_time =3D NSEC_PER_MSEC; > + t =3D ktime_set(0, sleep_time); > + } >=20=20 > if (intr && signal_pending(current)) { > ret =3D -ERESTARTSYS; --=-=-=-- --==-=-= Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iF4EAREIAAYFAk13ui0ACgkQg5k4nX1Sv1t/JQD/YzreI+ir1LIdyq6w6QpC8Pnu ihynDO29FbNGQswBL60BAJN8QvcTFZV7n/8CjtmA7loce5s7lETP8yxYPmWwQP5p =Em/+ -----END PGP SIGNATURE----- --==-=-=-- --===============0435099001== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Nouveau mailing list Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org http://lists.freedesktop.org/mailman/listinfo/nouveau --===============0435099001==--