All of lore.kernel.org
 help / color / mirror / Atom feed
* How to design a DRM KMS driver exposing 2D compositing?
@ 2014-08-11 10:38 Pekka Paalanen
  2014-08-11 10:57 ` Damien Lespiau
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Pekka Paalanen @ 2014-08-11 10:38 UTC (permalink / raw)
  To: dri-devel

Hi,

there is some hardware than can do 2D compositing with an arbitrary
number of planes. I'm not sure what the absolute maximum number of
planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size,
etc. planes can be used at once. A driver would be able to check those
before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is
able to composite on the fly in scanout, just like the usual overlay
hardware blocks in CRTCs. When the composition complexity goes up, the
driver can fall back to compositing into a buffer rather than on the
fly in scanout. This fallback needs to be completely transparent to the
user space, implying only additional latency if anything.

These 2D compositing features should be exposed to user space through a
standard kernel ABI, hopefully an existing ABI in the very near future
like the KMS atomic.

Assuming the DRM universal planes and atomic mode setting / page flip
infrastructure is in place, could the 2D compositing capabilities be
exposed through universal planes? We can assume that plane properties
are enough to describe all the compositing parameters.

Atomic updates are needed so that the complicated constraints can be
checked, and user space can try to reduce the composition complexity if
the kernel driver sees that it won't work.

Would it be feasible to generate a hundred identical non-primary planes
to be exposed to user space via DRM?

If that could be done, the kernel driver could just use the existing
kernel/user ABIs without having to invent something new, and programs
like a Wayland compositor would not need to be coded specifically for
this hardware.

What problems do you see with this plan?
Are any of those problems unfixable or simply prohibitive?

I have some concerns, which I am not sure will actually be a problem:
- Does allocating a 100 planes eat too much kernel memory?
  I mean just the bookkeeping, properties, etc.
- Would such an amount of planes make some in-kernel algorithms slow
  (particularly in DRM common code)?
- Considering how user space discovers all DRM resources, would this
  make a compositor "slow" to start?

I suppose whether these turn out to be prohibitive or not, one just has
to implement it and see. It should be usable on a slowish CPU with
unimpressive amounts of RAM, because that is where a separate 2D
compositing engine gives the most kick.

FWIW, dynamically created/destroyed planes would probably not be the
answer. The kernel driver cannot decide before-hand how many planes it
can expose. How many planes can be used depends completely on how user
space decides to use them. Therefore I believe it should expose the
maximum number always, whether there is any real use case that could
actually get them all running or not.

What if I cannot even pick a maximum number of planes, but wanted to
(as the hardware allows) let the 2D compositing scale up basically
unlimited while becoming just slower and slower?

I think at that point one would be looking at a rendering API really,
rather than a KMS API, so it's probably out of scope. Where is the line
between KMS 2D compositing with planes vs. 2D composite rendering?

Should I really be designing a driver-specific compositing API instead,
similar to what the Mesa OpenGL implementations use? Then have user
space maybe use the user space driver part via OpenWFC perhaps?
And when I mention OpenWFC, you probably notice, that I am not aware of
any standard user space API I could be implementing here. ;-)


Thanks,
pq

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 10:38 How to design a DRM KMS driver exposing 2D compositing? Pekka Paalanen
@ 2014-08-11 10:57 ` Damien Lespiau
  2014-08-11 12:07   ` Pekka Paalanen
  2014-08-11 12:06 ` Daniel Vetter
  2014-08-11 14:37 ` Matt Roper
  2 siblings, 1 reply; 23+ messages in thread
From: Damien Lespiau @ 2014-08-11 10:57 UTC (permalink / raw)
  To: Pekka Paalanen; +Cc: dri-devel

On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> Hi,

Hi,

> there is some hardware than can do 2D compositing with an arbitrary
> number of planes. I'm not sure what the absolute maximum number of
> planes is, but for the discussion, let's say it is 100.
> 
> There are many complicated, dynamic constraints on how many, what size,
> etc. planes can be used at once. A driver would be able to check those
> before kicking the 2D compositing engine.
> 
> The 2D compositing engine in the best case (only few planes used) is
> able to composite on the fly in scanout, just like the usual overlay
> hardware blocks in CRTCs. When the composition complexity goes up, the
> driver can fall back to compositing into a buffer rather than on the
> fly in scanout. This fallback needs to be completely transparent to the
> user space, implying only additional latency if anything.

This looks like a fallback that would use GL to compose the intermediate
buffer. Any reason why that fallback can't be kicked from userspace?

-- 
Damien

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 10:38 How to design a DRM KMS driver exposing 2D compositing? Pekka Paalanen
  2014-08-11 10:57 ` Damien Lespiau
@ 2014-08-11 12:06 ` Daniel Vetter
  2014-08-11 12:47   ` Pekka Paalanen
                     ` (2 more replies)
  2014-08-11 14:37 ` Matt Roper
  2 siblings, 3 replies; 23+ messages in thread
From: Daniel Vetter @ 2014-08-11 12:06 UTC (permalink / raw)
  To: Pekka Paalanen; +Cc: dri-devel

On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> Hi,
> 
> there is some hardware than can do 2D compositing with an arbitrary
> number of planes. I'm not sure what the absolute maximum number of
> planes is, but for the discussion, let's say it is 100.
> 
> There are many complicated, dynamic constraints on how many, what size,
> etc. planes can be used at once. A driver would be able to check those
> before kicking the 2D compositing engine.
> 
> The 2D compositing engine in the best case (only few planes used) is
> able to composite on the fly in scanout, just like the usual overlay
> hardware blocks in CRTCs. When the composition complexity goes up, the
> driver can fall back to compositing into a buffer rather than on the
> fly in scanout. This fallback needs to be completely transparent to the
> user space, implying only additional latency if anything.
> 
> These 2D compositing features should be exposed to user space through a
> standard kernel ABI, hopefully an existing ABI in the very near future
> like the KMS atomic.

I presume we're talking about the video core from raspi? Or at least
something similar?

> Assuming the DRM universal planes and atomic mode setting / page flip
> infrastructure is in place, could the 2D compositing capabilities be
> exposed through universal planes? We can assume that plane properties
> are enough to describe all the compositing parameters.
> 
> Atomic updates are needed so that the complicated constraints can be
> checked, and user space can try to reduce the composition complexity if
> the kernel driver sees that it won't work.
> 
> Would it be feasible to generate a hundred identical non-primary planes
> to be exposed to user space via DRM?
> 
> If that could be done, the kernel driver could just use the existing
> kernel/user ABIs without having to invent something new, and programs
> like a Wayland compositor would not need to be coded specifically for
> this hardware.
> 
> What problems do you see with this plan?
> Are any of those problems unfixable or simply prohibitive?
> 
> I have some concerns, which I am not sure will actually be a problem:
> - Does allocating a 100 planes eat too much kernel memory?
>   I mean just the bookkeeping, properties, etc.
> - Would such an amount of planes make some in-kernel algorithms slow
>   (particularly in DRM common code)?
> - Considering how user space discovers all DRM resources, would this
>   make a compositor "slow" to start?

I don't see any problem with that. We have a few plane-loops, but iirc
those can be easily fixed to use indices and similar stuff. The atomic
ioctl itself should scale nicely.

> I suppose whether these turn out to be prohibitive or not, one just has
> to implement it and see. It should be usable on a slowish CPU with
> unimpressive amounts of RAM, because that is where a separate 2D
> compositing engine gives the most kick.
> 
> FWIW, dynamically created/destroyed planes would probably not be the
> answer. The kernel driver cannot decide before-hand how many planes it
> can expose. How many planes can be used depends completely on how user
> space decides to use them. Therefore I believe it should expose the
> maximum number always, whether there is any real use case that could
> actually get them all running or not.

Yeah dynamic planes doesn't sound like a nice solution, least because
you'll get to audit piles of code. Currently really only framebuffers (and
to some extent connectors) can come and go freely in kms-land.

> What if I cannot even pick a maximum number of planes, but wanted to
> (as the hardware allows) let the 2D compositing scale up basically
> unlimited while becoming just slower and slower?
> 
> I think at that point one would be looking at a rendering API really,
> rather than a KMS API, so it's probably out of scope. Where is the line
> between KMS 2D compositing with planes vs. 2D composite rendering?

I think kms should still be real-time compositing - if you have to
internally render to a buffer and then scan that one out due to lack of
memory bandwidth or so that very much sounds like a rendering api. Ofc
stuff like writeback buffers blurry that a bit. But hw writeback is still
real-time.

> Should I really be designing a driver-specific compositing API instead,
> similar to what the Mesa OpenGL implementations use? Then have user
> space maybe use the user space driver part via OpenWFC perhaps?
> And when I mention OpenWFC, you probably notice, that I am not aware of
> any standard user space API I could be implementing here. ;-)

Personally I'd expose a bunch of planes with kms (enough so that you can
reap the usual benefits planes bring wrt video-playback and stuff like
that). So perhaps something in line with what current hw does in hw and
then double it a bit or twice - 16 planes or so. Your driver would reject
any requests that need intermediate buffers to store render results. I.e.
everything that can't be scanned out directly in real-time at about 60fps.
The fun with kms planes is also that right now we have 0 standards for
z-ordering and blending. So would need to define that first.

Then expose everything else with a separate api. I guess you'll just end
up with per-compositor userspace drivers due to the lack of a widespread
2d api. OpenVG is kinda dead, and cairo might not fit.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 10:57 ` Damien Lespiau
@ 2014-08-11 12:07   ` Pekka Paalanen
  2014-08-11 13:14     ` Damien Lespiau
  0 siblings, 1 reply; 23+ messages in thread
From: Pekka Paalanen @ 2014-08-11 12:07 UTC (permalink / raw)
  To: Damien Lespiau; +Cc: dri-devel

On Mon, 11 Aug 2014 11:57:10 +0100
Damien Lespiau <damien.lespiau@intel.com> wrote:

> On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> > Hi,
> 
> Hi,
> 
> > there is some hardware than can do 2D compositing with an arbitrary
> > number of planes. I'm not sure what the absolute maximum number of
> > planes is, but for the discussion, let's say it is 100.
> > 
> > There are many complicated, dynamic constraints on how many, what size,
> > etc. planes can be used at once. A driver would be able to check those
> > before kicking the 2D compositing engine.
> > 
> > The 2D compositing engine in the best case (only few planes used) is
> > able to composite on the fly in scanout, just like the usual overlay
> > hardware blocks in CRTCs. When the composition complexity goes up, the
> > driver can fall back to compositing into a buffer rather than on the
> > fly in scanout. This fallback needs to be completely transparent to the
> > user space, implying only additional latency if anything.
> 
> This looks like a fallback that would use GL to compose the intermediate
> buffer. Any reason why that fallback can't be kicked from userspace?

It is not GL, and GL might not be available or desireable. It is still
the same 2D compositing engine in hardware, but now running with
off-screen target buffer, because it cannot anymore keep up with the
continous pixel rate that the direct scanout would need.

If we were to use the 2D compositing engine from user space, we would
be on the road to OpenWFC. IOW, there is no standard API for the
user space to use yet, as far as I'm aware. ;-)

I'm just trying to avoid having to design a kernel driver ABI for a
user space driver, then design/implement some standard user space
API on top, and then go fix all compositors to actually use it instead
of / with KMS.


Thanks,
pq

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 12:06 ` Daniel Vetter
@ 2014-08-11 12:47   ` Pekka Paalanen
  2014-08-11 15:35     ` Daniel Vetter
  2014-08-11 13:32   ` Rob Clark
  2014-08-11 17:16   ` Eric Anholt
  2 siblings, 1 reply; 23+ messages in thread
From: Pekka Paalanen @ 2014-08-11 12:47 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel

Hi Daniel,

you make perfect sense as usual. :-)
Comments below.

On Mon, 11 Aug 2014 14:06:36 +0200
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> > Hi,
> > 
> > there is some hardware than can do 2D compositing with an arbitrary
> > number of planes. I'm not sure what the absolute maximum number of
> > planes is, but for the discussion, let's say it is 100.
> > 
> > There are many complicated, dynamic constraints on how many, what size,
> > etc. planes can be used at once. A driver would be able to check those
> > before kicking the 2D compositing engine.
> > 
> > The 2D compositing engine in the best case (only few planes used) is
> > able to composite on the fly in scanout, just like the usual overlay
> > hardware blocks in CRTCs. When the composition complexity goes up, the
> > driver can fall back to compositing into a buffer rather than on the
> > fly in scanout. This fallback needs to be completely transparent to the
> > user space, implying only additional latency if anything.
> > 
> > These 2D compositing features should be exposed to user space through a
> > standard kernel ABI, hopefully an existing ABI in the very near future
> > like the KMS atomic.
> 
> I presume we're talking about the video core from raspi? Or at least
> something similar?

Yes.

> > Assuming the DRM universal planes and atomic mode setting / page flip
> > infrastructure is in place, could the 2D compositing capabilities be
> > exposed through universal planes? We can assume that plane properties
> > are enough to describe all the compositing parameters.
> > 
> > Atomic updates are needed so that the complicated constraints can be
> > checked, and user space can try to reduce the composition complexity if
> > the kernel driver sees that it won't work.
> > 
> > Would it be feasible to generate a hundred identical non-primary planes
> > to be exposed to user space via DRM?
> > 
> > If that could be done, the kernel driver could just use the existing
> > kernel/user ABIs without having to invent something new, and programs
> > like a Wayland compositor would not need to be coded specifically for
> > this hardware.
> > 
> > What problems do you see with this plan?
> > Are any of those problems unfixable or simply prohibitive?
> > 
> > I have some concerns, which I am not sure will actually be a problem:
> > - Does allocating a 100 planes eat too much kernel memory?
> >   I mean just the bookkeeping, properties, etc.
> > - Would such an amount of planes make some in-kernel algorithms slow
> >   (particularly in DRM common code)?
> > - Considering how user space discovers all DRM resources, would this
> >   make a compositor "slow" to start?
> 
> I don't see any problem with that. We have a few plane-loops, but iirc
> those can be easily fixed to use indices and similar stuff. The atomic
> ioctl itself should scale nicely.

Very nice.

> > I suppose whether these turn out to be prohibitive or not, one just has
> > to implement it and see. It should be usable on a slowish CPU with
> > unimpressive amounts of RAM, because that is where a separate 2D
> > compositing engine gives the most kick.
> > 
> > FWIW, dynamically created/destroyed planes would probably not be the
> > answer. The kernel driver cannot decide before-hand how many planes it
> > can expose. How many planes can be used depends completely on how user
> > space decides to use them. Therefore I believe it should expose the
> > maximum number always, whether there is any real use case that could
> > actually get them all running or not.
> 
> Yeah dynamic planes doesn't sound like a nice solution, least because
> you'll get to audit piles of code. Currently really only framebuffers (and
> to some extent connectors) can come and go freely in kms-land.

Yup, thought so.

> > What if I cannot even pick a maximum number of planes, but wanted to
> > (as the hardware allows) let the 2D compositing scale up basically
> > unlimited while becoming just slower and slower?
> > 
> > I think at that point one would be looking at a rendering API really,
> > rather than a KMS API, so it's probably out of scope. Where is the line
> > between KMS 2D compositing with planes vs. 2D composite rendering?
> 
> I think kms should still be real-time compositing - if you have to
> internally render to a buffer and then scan that one out due to lack of
> memory bandwidth or so that very much sounds like a rendering api. Ofc
> stuff like writeback buffers blurry that a bit. But hw writeback is still
> real-time.

Agreed, that's a good and clear definition, even if it might make my
life harder.

I'm still not completely sure, that using an intermediate buffer means
sacrificing real-time (i.e. being able to hit the next vblank the user
space is aiming for) performance, maybe the 2D engine output rate
fluctuates so that the scanout block would have problems but a buffer
can still be completed in time. Anyway, details.

Would using an intermediate buffer be ok if we can still maintain
real-time? That is, say, if a compositor kicks the atomic update e.g.
7 ms before vblank, we would still hit it even with the intermediate
buffer? If that is actually possible, I don't know yet.

> > Should I really be designing a driver-specific compositing API instead,
> > similar to what the Mesa OpenGL implementations use? Then have user
> > space maybe use the user space driver part via OpenWFC perhaps?
> > And when I mention OpenWFC, you probably notice, that I am not aware of
> > any standard user space API I could be implementing here. ;-)
> 
> Personally I'd expose a bunch of planes with kms (enough so that you can
> reap the usual benefits planes bring wrt video-playback and stuff like
> that). So perhaps something in line with what current hw does in hw and
> then double it a bit or twice - 16 planes or so. Your driver would reject
> any requests that need intermediate buffers to store render results. I.e.
> everything that can't be scanned out directly in real-time at about 60fps.
> The fun with kms planes is also that right now we have 0 standards for
> z-ordering and blending. So would need to define that first.

I do not yet know where that real-time limit is, but I'm guessing it
could be pretty low. If it is, we might start hitting software
compositing (like Pixman) very often, which is too slow to be usable.

Defining z-order and blending sounds like peanuts compared to below.

> Then expose everything else with a separate api. I guess you'll just end
> up with per-compositor userspace drivers due to the lack of a widespread
> 2d api. OpenVG is kinda dead, and cairo might not fit.

Yeah, that is kind of the worst case, which also seems unavoidable.


Thanks,
pq

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 12:07   ` Pekka Paalanen
@ 2014-08-11 13:14     ` Damien Lespiau
  2014-08-11 13:44       ` Pekka Paalanen
  0 siblings, 1 reply; 23+ messages in thread
From: Damien Lespiau @ 2014-08-11 13:14 UTC (permalink / raw)
  To: Pekka Paalanen; +Cc: dri-devel

On Mon, Aug 11, 2014 at 03:07:33PM +0300, Pekka Paalanen wrote:
> > > there is some hardware than can do 2D compositing with an arbitrary
> > > number of planes. I'm not sure what the absolute maximum number of
> > > planes is, but for the discussion, let's say it is 100.
> > > 
> > > There are many complicated, dynamic constraints on how many, what size,
> > > etc. planes can be used at once. A driver would be able to check those
> > > before kicking the 2D compositing engine.
> > > 
> > > The 2D compositing engine in the best case (only few planes used) is
> > > able to composite on the fly in scanout, just like the usual overlay
> > > hardware blocks in CRTCs. When the composition complexity goes up, the
> > > driver can fall back to compositing into a buffer rather than on the
> > > fly in scanout. This fallback needs to be completely transparent to the
> > > user space, implying only additional latency if anything.
> > 
> > This looks like a fallback that would use GL to compose the intermediate
> > buffer. Any reason why that fallback can't be kicked from userspace?
> 
> It is not GL, and GL might not be available or desireable. It is still
> the same 2D compositing engine in hardware, but now running with
> off-screen target buffer, because it cannot anymore keep up with the
> continous pixel rate that the direct scanout would need.

I didn't mean this was GL, but just making the parallel, ie. we wouldn't
put a GL fallback into the kernel.

> If we were to use the 2D compositing engine from user space, we would
> be on the road to OpenWFC. IOW, there is no standard API for the
> user space to use yet, as far as I'm aware. ;-)
> 
> I'm just trying to avoid having to design a kernel driver ABI for a
> user space driver, then design/implement some standard user space
> API on top, and then go fix all compositors to actually use it instead
> of / with KMS.

It's no easy trade-off. For instance, if the compositor doesn't know
about some of the hw constraints you are talking about, it may ask the
kernel for a configuration that suddently will only allow 20 fps updates
(because of the bw limitation you're mentioning). And the compositor
just wouldn't know.

I can only speak for the hw I know, if you want to squeeze everything
you can from that simple (compared to the one you're talking about)
display hw, there's no choice, the compositor needs to know about the
constraints to make clever decisions (that's what we do on Android). But
then the appeal of a common interface is understandable.

(An answer that doesn't actually say anything interesting, oh well),

-- 
Damien

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 12:06 ` Daniel Vetter
  2014-08-11 12:47   ` Pekka Paalanen
@ 2014-08-11 13:32   ` Rob Clark
  2014-08-11 15:24     ` Daniel Vetter
  2014-08-12  7:20     ` Pekka Paalanen
  2014-08-11 17:16   ` Eric Anholt
  2 siblings, 2 replies; 23+ messages in thread
From: Rob Clark @ 2014-08-11 13:32 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel

On Mon, Aug 11, 2014 at 8:06 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
>> Hi,
>>
>> there is some hardware than can do 2D compositing with an arbitrary
>> number of planes. I'm not sure what the absolute maximum number of
>> planes is, but for the discussion, let's say it is 100.
>>
>> There are many complicated, dynamic constraints on how many, what size,
>> etc. planes can be used at once. A driver would be able to check those
>> before kicking the 2D compositing engine.
>>
>> The 2D compositing engine in the best case (only few planes used) is
>> able to composite on the fly in scanout, just like the usual overlay
>> hardware blocks in CRTCs. When the composition complexity goes up, the
>> driver can fall back to compositing into a buffer rather than on the
>> fly in scanout. This fallback needs to be completely transparent to the
>> user space, implying only additional latency if anything.
>>
>> These 2D compositing features should be exposed to user space through a
>> standard kernel ABI, hopefully an existing ABI in the very near future
>> like the KMS atomic.
>
> I presume we're talking about the video core from raspi? Or at least
> something similar?
>
>> Assuming the DRM universal planes and atomic mode setting / page flip
>> infrastructure is in place, could the 2D compositing capabilities be
>> exposed through universal planes? We can assume that plane properties
>> are enough to describe all the compositing parameters.
>>
>> Atomic updates are needed so that the complicated constraints can be
>> checked, and user space can try to reduce the composition complexity if
>> the kernel driver sees that it won't work.
>>
>> Would it be feasible to generate a hundred identical non-primary planes
>> to be exposed to user space via DRM?
>>
>> If that could be done, the kernel driver could just use the existing
>> kernel/user ABIs without having to invent something new, and programs
>> like a Wayland compositor would not need to be coded specifically for
>> this hardware.
>>
>> What problems do you see with this plan?
>> Are any of those problems unfixable or simply prohibitive?
>>
>> I have some concerns, which I am not sure will actually be a problem:
>> - Does allocating a 100 planes eat too much kernel memory?
>>   I mean just the bookkeeping, properties, etc.
>> - Would such an amount of planes make some in-kernel algorithms slow
>>   (particularly in DRM common code)?
>> - Considering how user space discovers all DRM resources, would this
>>   make a compositor "slow" to start?
>
> I don't see any problem with that. We have a few plane-loops, but iirc
> those can be easily fixed to use indices and similar stuff. The atomic
> ioctl itself should scale nicely.
>
>> I suppose whether these turn out to be prohibitive or not, one just has
>> to implement it and see. It should be usable on a slowish CPU with
>> unimpressive amounts of RAM, because that is where a separate 2D
>> compositing engine gives the most kick.
>>
>> FWIW, dynamically created/destroyed planes would probably not be the
>> answer. The kernel driver cannot decide before-hand how many planes it
>> can expose. How many planes can be used depends completely on how user
>> space decides to use them. Therefore I believe it should expose the
>> maximum number always, whether there is any real use case that could
>> actually get them all running or not.
>
> Yeah dynamic planes doesn't sound like a nice solution, least because
> you'll get to audit piles of code. Currently really only framebuffers (and
> to some extent connectors) can come and go freely in kms-land.
>
>> What if I cannot even pick a maximum number of planes, but wanted to
>> (as the hardware allows) let the 2D compositing scale up basically
>> unlimited while becoming just slower and slower?
>>
>> I think at that point one would be looking at a rendering API really,
>> rather than a KMS API, so it's probably out of scope. Where is the line
>> between KMS 2D compositing with planes vs. 2D composite rendering?
>
> I think kms should still be real-time compositing - if you have to
> internally render to a buffer and then scan that one out due to lack of
> memory bandwidth or so that very much sounds like a rendering api. Ofc
> stuff like writeback buffers blurry that a bit. But hw writeback is still
> real-time.

not really sure how much of this is exposed to the cpu side, vs hidden
on coproc..

but I tend to think it would be nice for compositors (userspace) to
know explicitly what is going on..  ie. if some layers are blended via
intermediate buffer, couldn't that intermediate buffer be potentially
re-used on next frame if not damaged?


>> Should I really be designing a driver-specific compositing API instead,
>> similar to what the Mesa OpenGL implementations use? Then have user
>> space maybe use the user space driver part via OpenWFC perhaps?
>> And when I mention OpenWFC, you probably notice, that I am not aware of
>> any standard user space API I could be implementing here. ;-)
>
> Personally I'd expose a bunch of planes with kms (enough so that you can
> reap the usual benefits planes bring wrt video-playback and stuff like
> that). So perhaps something in line with what current hw does in hw and
> then double it a bit or twice - 16 planes or so. Your driver would reject
> any requests that need intermediate buffers to store render results. I.e.
> everything that can't be scanned out directly in real-time at about 60fps.
> The fun with kms planes is also that right now we have 0 standards for
> z-ordering and blending. So would need to define that first.
>
> Then expose everything else with a separate api. I guess you'll just end
> up with per-compositor userspace drivers due to the lack of a widespread
> 2d api. OpenVG is kinda dead, and cairo might not fit.

I kind of suspect someone should really just design weston2d, an api
more explicitly for compositing.. model after OpenWFC if that fits
nicely.  Or not if it doesn't.  Or just use the existing weston
front-end/back-end split..

I expect other wayland compositors would want more or less the same
thing as weston (barring pre-existing layer-cake mess..  cough, cough,
cogl/clutter/gnome-shell..)

We could even make a gallium statetracker implementation of weston2d
to get some usage on desktop..

BR,
-R

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 13:14     ` Damien Lespiau
@ 2014-08-11 13:44       ` Pekka Paalanen
  0 siblings, 0 replies; 23+ messages in thread
From: Pekka Paalanen @ 2014-08-11 13:44 UTC (permalink / raw)
  To: Damien Lespiau; +Cc: dri-devel

On Mon, 11 Aug 2014 14:14:56 +0100
Damien Lespiau <damien.lespiau@intel.com> wrote:

> On Mon, Aug 11, 2014 at 03:07:33PM +0300, Pekka Paalanen wrote:
> > > > there is some hardware than can do 2D compositing with an arbitrary
> > > > number of planes. I'm not sure what the absolute maximum number of
> > > > planes is, but for the discussion, let's say it is 100.
> > > > 
> > > > There are many complicated, dynamic constraints on how many, what size,
> > > > etc. planes can be used at once. A driver would be able to check those
> > > > before kicking the 2D compositing engine.
> > > > 
> > > > The 2D compositing engine in the best case (only few planes used) is
> > > > able to composite on the fly in scanout, just like the usual overlay
> > > > hardware blocks in CRTCs. When the composition complexity goes up, the
> > > > driver can fall back to compositing into a buffer rather than on the
> > > > fly in scanout. This fallback needs to be completely transparent to the
> > > > user space, implying only additional latency if anything.
> > > 
> > > This looks like a fallback that would use GL to compose the intermediate
> > > buffer. Any reason why that fallback can't be kicked from userspace?
> > 
> > It is not GL, and GL might not be available or desireable. It is still
> > the same 2D compositing engine in hardware, but now running with
> > off-screen target buffer, because it cannot anymore keep up with the
> > continous pixel rate that the direct scanout would need.
> 
> I didn't mean this was GL, but just making the parallel, ie. we wouldn't
> put a GL fallback into the kernel.
> 
> > If we were to use the 2D compositing engine from user space, we would
> > be on the road to OpenWFC. IOW, there is no standard API for the
> > user space to use yet, as far as I'm aware. ;-)
> > 
> > I'm just trying to avoid having to design a kernel driver ABI for a
> > user space driver, then design/implement some standard user space
> > API on top, and then go fix all compositors to actually use it instead
> > of / with KMS.
> 
> It's no easy trade-off. For instance, if the compositor doesn't know
> about some of the hw constraints you are talking about, it may ask the
> kernel for a configuration that suddently will only allow 20 fps updates
> (because of the bw limitation you're mentioning). And the compositor
> just wouldn't know.

Sure, but it would still be much better than the actual fallback in the
compositor in user space, if we cannot drive the 2D engine from user
space.

KMS works the same way already: if you have GL rendering that just
runs for too long, your final pageflip using it will implicitly get
delayed that much. Does it not?

> I can only speak for the hw I know, if you want to squeeze everything
> you can from that simple (compared to the one you're talking about)
> display hw, there's no choice, the compositor needs to know about the
> constraints to make clever decisions (that's what we do on Android). But
> then the appeal of a common interface is understandable.
> 
> (An answer that doesn't actually say anything interesting, oh well),

Yeah... so it comes down to deciding at what point will the kernel
driver say "this won't fly, do something else". And danvet has a pretty
solid answer to that, I think.


Thanks,
pq

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 10:38 How to design a DRM KMS driver exposing 2D compositing? Pekka Paalanen
  2014-08-11 10:57 ` Damien Lespiau
  2014-08-11 12:06 ` Daniel Vetter
@ 2014-08-11 14:37 ` Matt Roper
  2014-08-12  8:42   ` Pekka Paalanen
  2 siblings, 1 reply; 23+ messages in thread
From: Matt Roper @ 2014-08-11 14:37 UTC (permalink / raw)
  To: Pekka Paalanen; +Cc: dri-devel

On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> Hi,
> 
> there is some hardware than can do 2D compositing with an arbitrary
> number of planes. I'm not sure what the absolute maximum number of
> planes is, but for the discussion, let's say it is 100.
> 
> There are many complicated, dynamic constraints on how many, what size,
> etc. planes can be used at once. A driver would be able to check those
> before kicking the 2D compositing engine.
> 
> The 2D compositing engine in the best case (only few planes used) is
> able to composite on the fly in scanout, just like the usual overlay
> hardware blocks in CRTCs. When the composition complexity goes up, the
> driver can fall back to compositing into a buffer rather than on the
> fly in scanout. This fallback needs to be completely transparent to the
> user space, implying only additional latency if anything.

Is your requirement that this needs to be transparent to all userspace
or just transparent to your display server (e.g., Weston)?  I'm
wondering whether it might be easier to write a libdrm interposer that
intercepts any libdrm calls dealing with planes and exposes a bunch of
additional "virtual" planes to the display server when queried.  When
you submit an atomic ioctl, your interposer will figure out the best
strategy to make that happen given the real hardware available on your
system and will try to blend some of your excess buffers via whatever
userspace API's are available (Cairo, GLES, OpenVG, etc.).  This would
keep kernel complexity down and allow easier debugging and tuning.


Matt

> These 2D compositing features should be exposed to user space through a
> standard kernel ABI, hopefully an existing ABI in the very near future
> like the KMS atomic.
> 
> Assuming the DRM universal planes and atomic mode setting / page flip
> infrastructure is in place, could the 2D compositing capabilities be
> exposed through universal planes? We can assume that plane properties
> are enough to describe all the compositing parameters.
> 
> Atomic updates are needed so that the complicated constraints can be
> checked, and user space can try to reduce the composition complexity if
> the kernel driver sees that it won't work.
> 
> Would it be feasible to generate a hundred identical non-primary planes
> to be exposed to user space via DRM?
> 
> If that could be done, the kernel driver could just use the existing
> kernel/user ABIs without having to invent something new, and programs
> like a Wayland compositor would not need to be coded specifically for
> this hardware.
> 
> What problems do you see with this plan?
> Are any of those problems unfixable or simply prohibitive?
> 
> I have some concerns, which I am not sure will actually be a problem:
> - Does allocating a 100 planes eat too much kernel memory?
>   I mean just the bookkeeping, properties, etc.
> - Would such an amount of planes make some in-kernel algorithms slow
>   (particularly in DRM common code)?
> - Considering how user space discovers all DRM resources, would this
>   make a compositor "slow" to start?
> 
> I suppose whether these turn out to be prohibitive or not, one just has
> to implement it and see. It should be usable on a slowish CPU with
> unimpressive amounts of RAM, because that is where a separate 2D
> compositing engine gives the most kick.
> 
> FWIW, dynamically created/destroyed planes would probably not be the
> answer. The kernel driver cannot decide before-hand how many planes it
> can expose. How many planes can be used depends completely on how user
> space decides to use them. Therefore I believe it should expose the
> maximum number always, whether there is any real use case that could
> actually get them all running or not.
> 
> What if I cannot even pick a maximum number of planes, but wanted to
> (as the hardware allows) let the 2D compositing scale up basically
> unlimited while becoming just slower and slower?
> 
> I think at that point one would be looking at a rendering API really,
> rather than a KMS API, so it's probably out of scope. Where is the line
> between KMS 2D compositing with planes vs. 2D composite rendering?
> 
> Should I really be designing a driver-specific compositing API instead,
> similar to what the Mesa OpenGL implementations use? Then have user
> space maybe use the user space driver part via OpenWFC perhaps?
> And when I mention OpenWFC, you probably notice, that I am not aware of
> any standard user space API I could be implementing here. ;-)
> 
> 
> Thanks,
> pq
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Matt Roper
Graphics Software Engineer
IoTG Platform Enabling & Development
Intel Corporation
(916) 356-2795

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 13:32   ` Rob Clark
@ 2014-08-11 15:24     ` Daniel Vetter
  2014-08-12  7:20     ` Pekka Paalanen
  1 sibling, 0 replies; 23+ messages in thread
From: Daniel Vetter @ 2014-08-11 15:24 UTC (permalink / raw)
  To: Rob Clark; +Cc: dri-devel

On Mon, Aug 11, 2014 at 09:32:32AM -0400, Rob Clark wrote:
> On Mon, Aug 11, 2014 at 8:06 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> > Personally I'd expose a bunch of planes with kms (enough so that you can
> > reap the usual benefits planes bring wrt video-playback and stuff like
> > that). So perhaps something in line with what current hw does in hw and
> > then double it a bit or twice - 16 planes or so. Your driver would reject
> > any requests that need intermediate buffers to store render results. I.e.
> > everything that can't be scanned out directly in real-time at about 60fps.
> > The fun with kms planes is also that right now we have 0 standards for
> > z-ordering and blending. So would need to define that first.
> >
> > Then expose everything else with a separate api. I guess you'll just end
> > up with per-compositor userspace drivers due to the lack of a widespread
> > 2d api. OpenVG is kinda dead, and cairo might not fit.
> 
> I kind of suspect someone should really just design weston2d, an api
> more explicitly for compositing.. model after OpenWFC if that fits
> nicely.  Or not if it doesn't.  Or just use the existing weston
> front-end/back-end split..
> 
> I expect other wayland compositors would want more or less the same
> thing as weston (barring pre-existing layer-cake mess..  cough, cough,
> cogl/clutter/gnome-shell..)
> 
> We could even make a gallium statetracker implementation of weston2d
> to get some usage on desktop..

There's vega already in mesa .... It just looks terribly unused.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 12:47   ` Pekka Paalanen
@ 2014-08-11 15:35     ` Daniel Vetter
  2014-08-11 16:09       ` Ville Syrjälä
  2014-08-12  7:10       ` Pekka Paalanen
  0 siblings, 2 replies; 23+ messages in thread
From: Daniel Vetter @ 2014-08-11 15:35 UTC (permalink / raw)
  To: Pekka Paalanen; +Cc: dri-devel

On Mon, Aug 11, 2014 at 03:47:22PM +0300, Pekka Paalanen wrote:
> > > What if I cannot even pick a maximum number of planes, but wanted to
> > > (as the hardware allows) let the 2D compositing scale up basically
> > > unlimited while becoming just slower and slower?
> > > 
> > > I think at that point one would be looking at a rendering API really,
> > > rather than a KMS API, so it's probably out of scope. Where is the line
> > > between KMS 2D compositing with planes vs. 2D composite rendering?
> > 
> > I think kms should still be real-time compositing - if you have to
> > internally render to a buffer and then scan that one out due to lack of
> > memory bandwidth or so that very much sounds like a rendering api. Ofc
> > stuff like writeback buffers blurry that a bit. But hw writeback is still
> > real-time.
> 
> Agreed, that's a good and clear definition, even if it might make my
> life harder.
> 
> I'm still not completely sure, that using an intermediate buffer means
> sacrificing real-time (i.e. being able to hit the next vblank the user
> space is aiming for) performance, maybe the 2D engine output rate
> fluctuates so that the scanout block would have problems but a buffer
> can still be completed in time. Anyway, details.
> 
> Would using an intermediate buffer be ok if we can still maintain
> real-time? That is, say, if a compositor kicks the atomic update e.g.
> 7 ms before vblank, we would still hit it even with the intermediate
> buffer? If that is actually possible, I don't know yet.

I guess you could hide this in the kernel if you want. After all the
entire point of kms is to shovel the memory management into the kernel
driver's responsibility. But I agree with Rob that if there are
intermediate buffers, it would be fairly neat to let userspace know about
them.

So I don't think the intermediate buffer thing would be a no-go for kms,
but I suspect that will only happen when the videocore can't hit the next
frame reliably. And that kind of stutter is imo not good for a kms driver.
I guess you could forgo vblank timestamp support and just go with
super-variable scanout times, but I guess that will make the video
playback people unhappy - they already bitch about the sub 1% inaccuracy
we have in our hdmi clocks.

> > > Should I really be designing a driver-specific compositing API instead,
> > > similar to what the Mesa OpenGL implementations use? Then have user
> > > space maybe use the user space driver part via OpenWFC perhaps?
> > > And when I mention OpenWFC, you probably notice, that I am not aware of
> > > any standard user space API I could be implementing here. ;-)
> > 
> > Personally I'd expose a bunch of planes with kms (enough so that you can
> > reap the usual benefits planes bring wrt video-playback and stuff like
> > that). So perhaps something in line with what current hw does in hw and
> > then double it a bit or twice - 16 planes or so. Your driver would reject
> > any requests that need intermediate buffers to store render results. I.e.
> > everything that can't be scanned out directly in real-time at about 60fps.
> > The fun with kms planes is also that right now we have 0 standards for
> > z-ordering and blending. So would need to define that first.
> 
> I do not yet know where that real-time limit is, but I'm guessing it
> could be pretty low. If it is, we might start hitting software
> compositing (like Pixman) very often, which is too slow to be usable.

Well for other drivers/stacks we'd fall back to GL compositing. pixman
would obviously be terribly. Curious question: Can you provoke the
hw/firmware to render into abitrary buffers or does it only work together
with real display outputs?

So I guess the real question is: What kind of interface does videocore
provide? Note that kms framebuffers are super-flexible and you're freee to
add your own ioctl for special framebuffers which are rendered live by the
vc. So that might be a possible way to expose this if you can't tell the
vc which buffers to render into explicitly.

> Defining z-order and blending sounds like peanuts compared to below.
> 
> > Then expose everything else with a separate api. I guess you'll just end
> > up with per-compositor userspace drivers due to the lack of a widespread
> > 2d api. OpenVG is kinda dead, and cairo might not fit.
> 
> Yeah, that is kind of the worst case, which also seems unavoidable.

Yeah, there's no universal 2d accel standard at all. Which sucks for hw
that can't do full gl.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 15:35     ` Daniel Vetter
@ 2014-08-11 16:09       ` Ville Syrjälä
  2014-08-11 17:21         ` Daniel Vetter
  2014-08-12  7:10       ` Pekka Paalanen
  1 sibling, 1 reply; 23+ messages in thread
From: Ville Syrjälä @ 2014-08-11 16:09 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel

On Mon, Aug 11, 2014 at 05:35:31PM +0200, Daniel Vetter wrote:
> On Mon, Aug 11, 2014 at 03:47:22PM +0300, Pekka Paalanen wrote:
> > > > What if I cannot even pick a maximum number of planes, but wanted to
> > > > (as the hardware allows) let the 2D compositing scale up basically
> > > > unlimited while becoming just slower and slower?
> > > > 
> > > > I think at that point one would be looking at a rendering API really,
> > > > rather than a KMS API, so it's probably out of scope. Where is the line
> > > > between KMS 2D compositing with planes vs. 2D composite rendering?
> > > 
> > > I think kms should still be real-time compositing - if you have to
> > > internally render to a buffer and then scan that one out due to lack of
> > > memory bandwidth or so that very much sounds like a rendering api. Ofc
> > > stuff like writeback buffers blurry that a bit. But hw writeback is still
> > > real-time.
> > 
> > Agreed, that's a good and clear definition, even if it might make my
> > life harder.
> > 
> > I'm still not completely sure, that using an intermediate buffer means
> > sacrificing real-time (i.e. being able to hit the next vblank the user
> > space is aiming for) performance, maybe the 2D engine output rate
> > fluctuates so that the scanout block would have problems but a buffer
> > can still be completed in time. Anyway, details.
> > 
> > Would using an intermediate buffer be ok if we can still maintain
> > real-time? That is, say, if a compositor kicks the atomic update e.g.
> > 7 ms before vblank, we would still hit it even with the intermediate
> > buffer? If that is actually possible, I don't know yet.
> 
> I guess you could hide this in the kernel if you want. After all the
> entire point of kms is to shovel the memory management into the kernel
> driver's responsibility. But I agree with Rob that if there are
> intermediate buffers, it would be fairly neat to let userspace know about
> them.
> 
> So I don't think the intermediate buffer thing would be a no-go for kms,
> but I suspect that will only happen when the videocore can't hit the next
> frame reliably. And that kind of stutter is imo not good for a kms driver.
> I guess you could forgo vblank timestamp support and just go with
> super-variable scanout times, but I guess that will make the video
> playback people unhappy - they already bitch about the sub 1% inaccuracy
> we have in our hdmi clocks.
> 
> > > > Should I really be designing a driver-specific compositing API instead,
> > > > similar to what the Mesa OpenGL implementations use? Then have user
> > > > space maybe use the user space driver part via OpenWFC perhaps?
> > > > And when I mention OpenWFC, you probably notice, that I am not aware of
> > > > any standard user space API I could be implementing here. ;-)
> > > 
> > > Personally I'd expose a bunch of planes with kms (enough so that you can
> > > reap the usual benefits planes bring wrt video-playback and stuff like
> > > that). So perhaps something in line with what current hw does in hw and
> > > then double it a bit or twice - 16 planes or so. Your driver would reject
> > > any requests that need intermediate buffers to store render results. I.e.
> > > everything that can't be scanned out directly in real-time at about 60fps.
> > > The fun with kms planes is also that right now we have 0 standards for
> > > z-ordering and blending. So would need to define that first.
> > 
> > I do not yet know where that real-time limit is, but I'm guessing it
> > could be pretty low. If it is, we might start hitting software
> > compositing (like Pixman) very often, which is too slow to be usable.
> 
> Well for other drivers/stacks we'd fall back to GL compositing. pixman
> would obviously be terribly. Curious question: Can you provoke the
> hw/firmware to render into abitrary buffers or does it only work together
> with real display outputs?
> 
> So I guess the real question is: What kind of interface does videocore
> provide? Note that kms framebuffers are super-flexible and you're freee to
> add your own ioctl for special framebuffers which are rendered live by the
> vc. So that might be a possible way to expose this if you can't tell the
> vc which buffers to render into explicitly.

We should maybe think about exposing this display engine writeback
stuff in some decent way. Maybe a property on the crtc (or plane when
doing per-plane writeback) where you attach a target framebuffer for
the write. And some virtual connectors/encoders to satisfy the kms API
requirements.

With DSI command mode I suppose it would be possible to even mix display
and writeback uses of the same hardware pipeline so that the writeback
doesn't disturb the display. But I'm not sure there would any nice way
to expose that in kms. Maybe just expose two crtcs, one for writeback
and one for display and multiplex in the driver.

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 12:06 ` Daniel Vetter
  2014-08-11 12:47   ` Pekka Paalanen
  2014-08-11 13:32   ` Rob Clark
@ 2014-08-11 17:16   ` Eric Anholt
  2014-08-11 17:27     ` Daniel Vetter
  2 siblings, 1 reply; 23+ messages in thread
From: Eric Anholt @ 2014-08-11 17:16 UTC (permalink / raw)
  To: Daniel Vetter, Pekka Paalanen; +Cc: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 4577 bytes --]

Daniel Vetter <daniel@ffwll.ch> writes:

> On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
>> Hi,
>> 
>> there is some hardware than can do 2D compositing with an arbitrary
>> number of planes. I'm not sure what the absolute maximum number of
>> planes is, but for the discussion, let's say it is 100.
>> 
>> There are many complicated, dynamic constraints on how many, what size,
>> etc. planes can be used at once. A driver would be able to check those
>> before kicking the 2D compositing engine.
>> 
>> The 2D compositing engine in the best case (only few planes used) is
>> able to composite on the fly in scanout, just like the usual overlay
>> hardware blocks in CRTCs. When the composition complexity goes up, the
>> driver can fall back to compositing into a buffer rather than on the
>> fly in scanout. This fallback needs to be completely transparent to the
>> user space, implying only additional latency if anything.
>> 
>> These 2D compositing features should be exposed to user space through a
>> standard kernel ABI, hopefully an existing ABI in the very near future
>> like the KMS atomic.
>
> I presume we're talking about the video core from raspi? Or at least
> something similar?

Pekka wasn't sure if things were confidential here, but I can say it:
Yeah, it's the RPi.

While I haven't written code using the compositor interface (I just did
enough to shim in a single plane for bringup, and I'm hoping Pekka and
company can handle the rest for me :) ), my understanding is that the
way you make use of it is that you've got your previous frame loaded up
in the HVS (the plane compositor hardware), then when you're asked to
put up a new frame that's going to be too hard, you take some
complicated chunk of your scene and ask the HVS to use any spare
bandwidth it has while it's still scanning out the previous frame in
order to composite that piece of new scene into memory.  Then, when it's
done with the offline composite, you ask the HVS to do the next scanout
frame using the original scene with the pre-composited temporary buffer.

I'm pretty comfortable with the idea of having some large number of
planes preallocated, and deciding that "nobody could possibly need more
than 16" (or whatever).

My initial reaction to "we should just punt when we run out of bandwidth
and have a special driver interface for offline composite" was "that's
awful, when the kernel could just get the job done immediately, and
easily, and it would know exactly what it needed to composite to get
things to fit (unlike userspace)".  I'm trying to come up with what
benefit there would be to having a separate interface for offline
composite.  I've got 3 things:

- Avoids having a potentially long, interruptible wait in the modeset
  path while the offline composite happens.  But I think we have other
  interruptible waits in that path alreaady.

- Userspace could potentially do something else besides use the HVS to
  get the fallback done.  Video would have to use the HVS, to get the
  same scaling filters applied as the previous frame where things *did*
  fit, but I guess you could composite some 1:1 RGBA overlays in GL,
  which would have more BW available to it than what you're borrowing
  from the previous frame's HVS capacity.

- Userspace could potentially use the offline composite interface for
  things besides just the running-out-of-bandwidth case.  Like, it was
  doing a nicely-filtered downscale of an overlaid video, then the user
  hit pause and walked away: you could have a timeout that noticed that
  the complicated scene hadn't changed in a while, and you'd drop from
  overlays to a HVS-composited single plane to reduce power.

The third one is the one I've actually found kind of compelling, and
might be switching me from wanting no userspace visibility into the
fallback.  But I don't have a good feel for how much complexity there is
to our descriptions of planes, and how much poorly-tested interface we'd
be adding to support this usecase.

(Because, honestly, I don't expect the fallbacks to be hit much -- my
understanding of the bandwidth equation is that you're mostly counting
the number of pixels that have to be read, and clipped-out pixels
because somebody's overlaid on top of you don't count unless they're in
the same burst read.  So unless people are going nuts with blending in
overlays, or downscaled video, it's probably not a problem, and
something that gets your pixels on the screen at all is sufficient)

[-- Attachment #1.2: Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 16:09       ` Ville Syrjälä
@ 2014-08-11 17:21         ` Daniel Vetter
  0 siblings, 0 replies; 23+ messages in thread
From: Daniel Vetter @ 2014-08-11 17:21 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: dri-devel

On Mon, Aug 11, 2014 at 07:09:11PM +0300, Ville Syrjälä wrote:
> On Mon, Aug 11, 2014 at 05:35:31PM +0200, Daniel Vetter wrote:
> > On Mon, Aug 11, 2014 at 03:47:22PM +0300, Pekka Paalanen wrote:
> > > > > What if I cannot even pick a maximum number of planes, but wanted to
> > > > > (as the hardware allows) let the 2D compositing scale up basically
> > > > > unlimited while becoming just slower and slower?
> > > > > 
> > > > > I think at that point one would be looking at a rendering API really,
> > > > > rather than a KMS API, so it's probably out of scope. Where is the line
> > > > > between KMS 2D compositing with planes vs. 2D composite rendering?
> > > > 
> > > > I think kms should still be real-time compositing - if you have to
> > > > internally render to a buffer and then scan that one out due to lack of
> > > > memory bandwidth or so that very much sounds like a rendering api. Ofc
> > > > stuff like writeback buffers blurry that a bit. But hw writeback is still
> > > > real-time.
> > > 
> > > Agreed, that's a good and clear definition, even if it might make my
> > > life harder.
> > > 
> > > I'm still not completely sure, that using an intermediate buffer means
> > > sacrificing real-time (i.e. being able to hit the next vblank the user
> > > space is aiming for) performance, maybe the 2D engine output rate
> > > fluctuates so that the scanout block would have problems but a buffer
> > > can still be completed in time. Anyway, details.
> > > 
> > > Would using an intermediate buffer be ok if we can still maintain
> > > real-time? That is, say, if a compositor kicks the atomic update e.g.
> > > 7 ms before vblank, we would still hit it even with the intermediate
> > > buffer? If that is actually possible, I don't know yet.
> > 
> > I guess you could hide this in the kernel if you want. After all the
> > entire point of kms is to shovel the memory management into the kernel
> > driver's responsibility. But I agree with Rob that if there are
> > intermediate buffers, it would be fairly neat to let userspace know about
> > them.
> > 
> > So I don't think the intermediate buffer thing would be a no-go for kms,
> > but I suspect that will only happen when the videocore can't hit the next
> > frame reliably. And that kind of stutter is imo not good for a kms driver.
> > I guess you could forgo vblank timestamp support and just go with
> > super-variable scanout times, but I guess that will make the video
> > playback people unhappy - they already bitch about the sub 1% inaccuracy
> > we have in our hdmi clocks.
> > 
> > > > > Should I really be designing a driver-specific compositing API instead,
> > > > > similar to what the Mesa OpenGL implementations use? Then have user
> > > > > space maybe use the user space driver part via OpenWFC perhaps?
> > > > > And when I mention OpenWFC, you probably notice, that I am not aware of
> > > > > any standard user space API I could be implementing here. ;-)
> > > > 
> > > > Personally I'd expose a bunch of planes with kms (enough so that you can
> > > > reap the usual benefits planes bring wrt video-playback and stuff like
> > > > that). So perhaps something in line with what current hw does in hw and
> > > > then double it a bit or twice - 16 planes or so. Your driver would reject
> > > > any requests that need intermediate buffers to store render results. I.e.
> > > > everything that can't be scanned out directly in real-time at about 60fps.
> > > > The fun with kms planes is also that right now we have 0 standards for
> > > > z-ordering and blending. So would need to define that first.
> > > 
> > > I do not yet know where that real-time limit is, but I'm guessing it
> > > could be pretty low. If it is, we might start hitting software
> > > compositing (like Pixman) very often, which is too slow to be usable.
> > 
> > Well for other drivers/stacks we'd fall back to GL compositing. pixman
> > would obviously be terribly. Curious question: Can you provoke the
> > hw/firmware to render into abitrary buffers or does it only work together
> > with real display outputs?
> > 
> > So I guess the real question is: What kind of interface does videocore
> > provide? Note that kms framebuffers are super-flexible and you're freee to
> > add your own ioctl for special framebuffers which are rendered live by the
> > vc. So that might be a possible way to expose this if you can't tell the
> > vc which buffers to render into explicitly.
> 
> We should maybe think about exposing this display engine writeback
> stuff in some decent way. Maybe a property on the crtc (or plane when
> doing per-plane writeback) where you attach a target framebuffer for
> the write. And some virtual connectors/encoders to satisfy the kms API
> requirements.
> 
> With DSI command mode I suppose it would be possible to even mix display
> and writeback uses of the same hardware pipeline so that the writeback
> doesn't disturb the display. But I'm not sure there would any nice way
> to expose that in kms. Maybe just expose two crtcs, one for writeback
> and one for display and multiplex in the driver.

Another idea was to punt this to v4l, at least for the fancier hw which
can do a lot of crazy video signal routing ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 17:16   ` Eric Anholt
@ 2014-08-11 17:27     ` Daniel Vetter
  2014-08-12  8:48       ` Pekka Paalanen
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Vetter @ 2014-08-11 17:27 UTC (permalink / raw)
  To: Eric Anholt; +Cc: dri-devel

On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:
> Daniel Vetter <daniel@ffwll.ch> writes:
> 
> > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> >> Hi,
> >> 
> >> there is some hardware than can do 2D compositing with an arbitrary
> >> number of planes. I'm not sure what the absolute maximum number of
> >> planes is, but for the discussion, let's say it is 100.
> >> 
> >> There are many complicated, dynamic constraints on how many, what size,
> >> etc. planes can be used at once. A driver would be able to check those
> >> before kicking the 2D compositing engine.
> >> 
> >> The 2D compositing engine in the best case (only few planes used) is
> >> able to composite on the fly in scanout, just like the usual overlay
> >> hardware blocks in CRTCs. When the composition complexity goes up, the
> >> driver can fall back to compositing into a buffer rather than on the
> >> fly in scanout. This fallback needs to be completely transparent to the
> >> user space, implying only additional latency if anything.
> >> 
> >> These 2D compositing features should be exposed to user space through a
> >> standard kernel ABI, hopefully an existing ABI in the very near future
> >> like the KMS atomic.
> >
> > I presume we're talking about the video core from raspi? Or at least
> > something similar?
> 
> Pekka wasn't sure if things were confidential here, but I can say it:
> Yeah, it's the RPi.
> 
> While I haven't written code using the compositor interface (I just did
> enough to shim in a single plane for bringup, and I'm hoping Pekka and
> company can handle the rest for me :) ), my understanding is that the
> way you make use of it is that you've got your previous frame loaded up
> in the HVS (the plane compositor hardware), then when you're asked to
> put up a new frame that's going to be too hard, you take some
> complicated chunk of your scene and ask the HVS to use any spare
> bandwidth it has while it's still scanning out the previous frame in
> order to composite that piece of new scene into memory.  Then, when it's
> done with the offline composite, you ask the HVS to do the next scanout
> frame using the original scene with the pre-composited temporary buffer.
> 
> I'm pretty comfortable with the idea of having some large number of
> planes preallocated, and deciding that "nobody could possibly need more
> than 16" (or whatever).
> 
> My initial reaction to "we should just punt when we run out of bandwidth
> and have a special driver interface for offline composite" was "that's
> awful, when the kernel could just get the job done immediately, and
> easily, and it would know exactly what it needed to composite to get
> things to fit (unlike userspace)".  I'm trying to come up with what
> benefit there would be to having a separate interface for offline
> composite.  I've got 3 things:
> 
> - Avoids having a potentially long, interruptible wait in the modeset
>   path while the offline composite happens.  But I think we have other
>   interruptible waits in that path alreaady.
> 
> - Userspace could potentially do something else besides use the HVS to
>   get the fallback done.  Video would have to use the HVS, to get the
>   same scaling filters applied as the previous frame where things *did*
>   fit, but I guess you could composite some 1:1 RGBA overlays in GL,
>   which would have more BW available to it than what you're borrowing
>   from the previous frame's HVS capacity.
> 
> - Userspace could potentially use the offline composite interface for
>   things besides just the running-out-of-bandwidth case.  Like, it was
>   doing a nicely-filtered downscale of an overlaid video, then the user
>   hit pause and walked away: you could have a timeout that noticed that
>   the complicated scene hadn't changed in a while, and you'd drop from
>   overlays to a HVS-composited single plane to reduce power.
> 
> The third one is the one I've actually found kind of compelling, and
> might be switching me from wanting no userspace visibility into the
> fallback.  But I don't have a good feel for how much complexity there is
> to our descriptions of planes, and how much poorly-tested interface we'd
> be adding to support this usecase.

Compositor should already do a rough bw guesstimate and if stuff doesn't
change any more bake the entire scene into a single framebuffer. The exact
same issue happens on more usual hw with video overlays, too.

Ofc if it turns out that scanning out your yuv planes is less bw then the
overlay shouldn't be stopped ofc. But imo there's nothing special here for
the rpi.
 
> (Because, honestly, I don't expect the fallbacks to be hit much -- my
> understanding of the bandwidth equation is that you're mostly counting
> the number of pixels that have to be read, and clipped-out pixels
> because somebody's overlaid on top of you don't count unless they're in
> the same burst read.  So unless people are going nuts with blending in
> overlays, or downscaled video, it's probably not a problem, and
> something that gets your pixels on the screen at all is sufficient)

Yeah I guess we need to check reality here. If the "we've run out of bw"
case just never happens then it's pointless to write special code for it.
And we can always add a limit later for the case where GL is usually
better and tell userspace that we can't do this many planes. Exact same
thing with running out of memory bw can happen anywhere else, too.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 15:35     ` Daniel Vetter
  2014-08-11 16:09       ` Ville Syrjälä
@ 2014-08-12  7:10       ` Pekka Paalanen
  1 sibling, 0 replies; 23+ messages in thread
From: Pekka Paalanen @ 2014-08-12  7:10 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel

On Mon, 11 Aug 2014 17:35:31 +0200
Daniel Vetter <daniel@ffwll.ch> wrote:

> Well for other drivers/stacks we'd fall back to GL compositing. pixman
> would obviously be terribly. Curious question: Can you provoke the
> hw/firmware to render into abitrary buffers or does it only work together
> with real display outputs?

Since we have been talking about on-line (direct to output) and
off-line (buffer target) use of the HVS (2D compositing engine), it
should be able to do both I think.

> So I guess the real question is: What kind of interface does videocore
> provide? Note that kms framebuffers are super-flexible and you're freee to
> add your own ioctl for special framebuffers which are rendered live by the
> vc. So that might be a possible way to expose this if you can't tell the
> vc which buffers to render into explicitly.

Right. I don't know the HVS details yet, but I'm hoping we can tell
it to render into a custom buffer, like the 3D core can.

This discussion is very helpful btw, I'm starting to see some possible
plans.


Thanks,
pq

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 13:32   ` Rob Clark
  2014-08-11 15:24     ` Daniel Vetter
@ 2014-08-12  7:20     ` Pekka Paalanen
  2014-08-12  8:03       ` Daniel Vetter
  1 sibling, 1 reply; 23+ messages in thread
From: Pekka Paalanen @ 2014-08-12  7:20 UTC (permalink / raw)
  To: Rob Clark; +Cc: dri-devel

On Mon, 11 Aug 2014 09:32:32 -0400
Rob Clark <robdclark@gmail.com> wrote:

> On Mon, Aug 11, 2014 at 8:06 AM, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> >> What if I cannot even pick a maximum number of planes, but wanted to
> >> (as the hardware allows) let the 2D compositing scale up basically
> >> unlimited while becoming just slower and slower?
> >>
> >> I think at that point one would be looking at a rendering API really,
> >> rather than a KMS API, so it's probably out of scope. Where is the line
> >> between KMS 2D compositing with planes vs. 2D composite rendering?
> >
> > I think kms should still be real-time compositing - if you have to
> > internally render to a buffer and then scan that one out due to lack of
> > memory bandwidth or so that very much sounds like a rendering api. Ofc
> > stuff like writeback buffers blurry that a bit. But hw writeback is still
> > real-time.
> 
> not really sure how much of this is exposed to the cpu side, vs hidden
> on coproc..
> 
> but I tend to think it would be nice for compositors (userspace) to
> know explicitly what is going on..  ie. if some layers are blended via
> intermediate buffer, couldn't that intermediate buffer be potentially
> re-used on next frame if not damaged?

Very true, and I think that speaks for exposing the HVS explicitly to
user space to be directly used. That way I believe the user space could
track damage and composite only the minimum, rather than everything
every time which I suppose the KMS API approach would imply.

We don't have dirty regions in KMS API/props, do we? But yeah, that is
starting to feel like a stretch to push through KMS.

> >> Should I really be designing a driver-specific compositing API instead,
> >> similar to what the Mesa OpenGL implementations use? Then have user
> >> space maybe use the user space driver part via OpenWFC perhaps?
> >> And when I mention OpenWFC, you probably notice, that I am not aware of
> >> any standard user space API I could be implementing here. ;-)
> >
> > Personally I'd expose a bunch of planes with kms (enough so that you can
> > reap the usual benefits planes bring wrt video-playback and stuff like
> > that). So perhaps something in line with what current hw does in hw and
> > then double it a bit or twice - 16 planes or so. Your driver would reject
> > any requests that need intermediate buffers to store render results. I.e.
> > everything that can't be scanned out directly in real-time at about 60fps.
> > The fun with kms planes is also that right now we have 0 standards for
> > z-ordering and blending. So would need to define that first.
> >
> > Then expose everything else with a separate api. I guess you'll just end
> > up with per-compositor userspace drivers due to the lack of a widespread
> > 2d api. OpenVG is kinda dead, and cairo might not fit.
> 
> I kind of suspect someone should really just design weston2d, an api
> more explicitly for compositing.. model after OpenWFC if that fits
> nicely.  Or not if it doesn't.  Or just use the existing weston
> front-end/back-end split..
> 
> I expect other wayland compositors would want more or less the same
> thing as weston (barring pre-existing layer-cake mess..  cough, cough,
> cogl/clutter/gnome-shell..)
> 
> We could even make a gallium statetracker implementation of weston2d
> to get some usage on desktop..

Yeah. I suppose I should aim for whatever driver-specific
interface we need for the HVS to be used from user space, use that in
Weston, and get a feeling of what might be a nice, driver-agnostic 2D
compositing API.


Thanks,
pq

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-12  7:20     ` Pekka Paalanen
@ 2014-08-12  8:03       ` Daniel Vetter
  2014-08-12 10:04         ` Ville Syrjälä
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Vetter @ 2014-08-12  8:03 UTC (permalink / raw)
  To: Pekka Paalanen; +Cc: dri-devel

On Tue, Aug 12, 2014 at 9:20 AM, Pekka Paalanen <ppaalanen@gmail.com> wrote:
>> but I tend to think it would be nice for compositors (userspace) to
>> know explicitly what is going on..  ie. if some layers are blended via
>> intermediate buffer, couldn't that intermediate buffer be potentially
>> re-used on next frame if not damaged?
>
> Very true, and I think that speaks for exposing the HVS explicitly to
> user space to be directly used. That way I believe the user space could
> track damage and composite only the minimum, rather than everything
> every time which I suppose the KMS API approach would imply.
>
> We don't have dirty regions in KMS API/props, do we? But yeah, that is
> starting to feel like a stretch to push through KMS.

We have the dirty-ioctl, but imo it's a bit misdesigned: It works at
the framebuffer level (so the driver always has to figure out which
crtc/plane this is about), and it only works for frontbuffer
rendering. It was essentially a single-purpose thing for udl uploads.

But in generally I think it would make tons of sense to supply a
per-crtc (or maybe per-plane) damage rect with nuclear flips. Both
mipi dsi and edp have provisions to upload a subrect, so this could be
useful in general. And decent compositors compute this already anyway.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 14:37 ` Matt Roper
@ 2014-08-12  8:42   ` Pekka Paalanen
  0 siblings, 0 replies; 23+ messages in thread
From: Pekka Paalanen @ 2014-08-12  8:42 UTC (permalink / raw)
  To: Matt Roper; +Cc: dri-devel

On Mon, 11 Aug 2014 07:37:18 -0700
Matt Roper <matthew.d.roper@intel.com> wrote:

> On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> > Hi,
> > 
> > there is some hardware than can do 2D compositing with an arbitrary
> > number of planes. I'm not sure what the absolute maximum number of
> > planes is, but for the discussion, let's say it is 100.
> > 
> > There are many complicated, dynamic constraints on how many, what size,
> > etc. planes can be used at once. A driver would be able to check those
> > before kicking the 2D compositing engine.
> > 
> > The 2D compositing engine in the best case (only few planes used) is
> > able to composite on the fly in scanout, just like the usual overlay
> > hardware blocks in CRTCs. When the composition complexity goes up, the
> > driver can fall back to compositing into a buffer rather than on the
> > fly in scanout. This fallback needs to be completely transparent to the
> > user space, implying only additional latency if anything.
> 
> Is your requirement that this needs to be transparent to all userspace
> or just transparent to your display server (e.g., Weston)?  I'm
> wondering whether it might be easier to write a libdrm interposer that
> intercepts any libdrm calls dealing with planes and exposes a bunch of
> additional "virtual" planes to the display server when queried.  When
> you submit an atomic ioctl, your interposer will figure out the best
> strategy to make that happen given the real hardware available on your
> system and will try to blend some of your excess buffers via whatever
> userspace API's are available (Cairo, GLES, OpenVG, etc.).  This would
> keep kernel complexity down and allow easier debugging and tuning.

That's an inventive proposition. ;-)

I would still need to design the kernel/user ABI for the HVS (the 2D
engine). As I am starting to believe, that the "non-real-time" use of
the HVS does not belong behind the KMS API, we might as well just do
things more properly, and expose it with a real user space API
eventually.


Thanks,
pq

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-11 17:27     ` Daniel Vetter
@ 2014-08-12  8:48       ` Pekka Paalanen
  2014-08-12 16:10         ` Eric Anholt
  0 siblings, 1 reply; 23+ messages in thread
From: Pekka Paalanen @ 2014-08-12  8:48 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel

On Mon, 11 Aug 2014 19:27:45 +0200
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:
> > Daniel Vetter <daniel@ffwll.ch> writes:
> > 
> > > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> > >> Hi,
> > >> 
> > >> there is some hardware than can do 2D compositing with an arbitrary
> > >> number of planes. I'm not sure what the absolute maximum number of
> > >> planes is, but for the discussion, let's say it is 100.
> > >> 
> > >> There are many complicated, dynamic constraints on how many, what size,
> > >> etc. planes can be used at once. A driver would be able to check those
> > >> before kicking the 2D compositing engine.
> > >> 
> > >> The 2D compositing engine in the best case (only few planes used) is
> > >> able to composite on the fly in scanout, just like the usual overlay
> > >> hardware blocks in CRTCs. When the composition complexity goes up, the
> > >> driver can fall back to compositing into a buffer rather than on the
> > >> fly in scanout. This fallback needs to be completely transparent to the
> > >> user space, implying only additional latency if anything.
> > >> 
> > >> These 2D compositing features should be exposed to user space through a
> > >> standard kernel ABI, hopefully an existing ABI in the very near future
> > >> like the KMS atomic.
> > >
> > > I presume we're talking about the video core from raspi? Or at least
> > > something similar?
> > 
> > Pekka wasn't sure if things were confidential here, but I can say it:
> > Yeah, it's the RPi.
> > 
> > While I haven't written code using the compositor interface (I just did
> > enough to shim in a single plane for bringup, and I'm hoping Pekka and
> > company can handle the rest for me :) ), my understanding is that the
> > way you make use of it is that you've got your previous frame loaded up
> > in the HVS (the plane compositor hardware), then when you're asked to
> > put up a new frame that's going to be too hard, you take some
> > complicated chunk of your scene and ask the HVS to use any spare
> > bandwidth it has while it's still scanning out the previous frame in
> > order to composite that piece of new scene into memory.  Then, when it's
> > done with the offline composite, you ask the HVS to do the next scanout
> > frame using the original scene with the pre-composited temporary buffer.
> > 
> > I'm pretty comfortable with the idea of having some large number of
> > planes preallocated, and deciding that "nobody could possibly need more
> > than 16" (or whatever).
> > 
> > My initial reaction to "we should just punt when we run out of bandwidth
> > and have a special driver interface for offline composite" was "that's
> > awful, when the kernel could just get the job done immediately, and
> > easily, and it would know exactly what it needed to composite to get
> > things to fit (unlike userspace)".  I'm trying to come up with what
> > benefit there would be to having a separate interface for offline
> > composite.  I've got 3 things:
> > 
> > - Avoids having a potentially long, interruptible wait in the modeset
> >   path while the offline composite happens.  But I think we have other
> >   interruptible waits in that path alreaady.
> > 
> > - Userspace could potentially do something else besides use the HVS to
> >   get the fallback done.  Video would have to use the HVS, to get the
> >   same scaling filters applied as the previous frame where things *did*
> >   fit, but I guess you could composite some 1:1 RGBA overlays in GL,
> >   which would have more BW available to it than what you're borrowing
> >   from the previous frame's HVS capacity.
> > 
> > - Userspace could potentially use the offline composite interface for
> >   things besides just the running-out-of-bandwidth case.  Like, it was
> >   doing a nicely-filtered downscale of an overlaid video, then the user
> >   hit pause and walked away: you could have a timeout that noticed that
> >   the complicated scene hadn't changed in a while, and you'd drop from
> >   overlays to a HVS-composited single plane to reduce power.
> > 
> > The third one is the one I've actually found kind of compelling, and
> > might be switching me from wanting no userspace visibility into the
> > fallback.  But I don't have a good feel for how much complexity there is
> > to our descriptions of planes, and how much poorly-tested interface we'd
> > be adding to support this usecase.
> 
> Compositor should already do a rough bw guesstimate and if stuff doesn't
> change any more bake the entire scene into a single framebuffer. The exact
> same issue happens on more usual hw with video overlays, too.
> 
> Ofc if it turns out that scanning out your yuv planes is less bw then the
> overlay shouldn't be stopped ofc. But imo there's nothing special here for
> the rpi.
>  
> > (Because, honestly, I don't expect the fallbacks to be hit much -- my
> > understanding of the bandwidth equation is that you're mostly counting
> > the number of pixels that have to be read, and clipped-out pixels
> > because somebody's overlaid on top of you don't count unless they're in
> > the same burst read.  So unless people are going nuts with blending in
> > overlays, or downscaled video, it's probably not a problem, and
> > something that gets your pixels on the screen at all is sufficient)
> 
> Yeah I guess we need to check reality here. If the "we've run out of bw"
> case just never happens then it's pointless to write special code for it.
> And we can always add a limit later for the case where GL is usually
> better and tell userspace that we can't do this many planes. Exact same
> thing with running out of memory bw can happen anywhere else, too.

I had a chat with Eric last night, and our different views about the
on-line/real-time performance limits of the HVS seem to be due to alpha
blending.

Eric has not been using alpha blending much or at all, while my
experiments with Weston and DispmanX pretty much always need alpha
blending (e.g. because DispmanX cannot say that only a sub-region of a
buffer needs blending). Eric says alpha blending kills the performance.

This makes me think that maybe I should expose only one or two (cursor?)
planes with alpha blending formats, and all other planes with only
opaque formats. That would naturally limit the compositor's use of
planes to cases where it probably matters most: cursors, and opaque
video and openGL surfaces.

Then all the alpha-blended stuff will hit the fallback... which is...
Pixman at the moment (thinking about Weston here) until Eric gets
the GLESv2 flying. :-/

That means that doing a driver-specific kernel/user ABI for using the
HVS seems required. Write driver-specific libdrm API for it, use it
directly in Weston, see what falls out later.


Thanks,
pq

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-12  8:03       ` Daniel Vetter
@ 2014-08-12 10:04         ` Ville Syrjälä
  0 siblings, 0 replies; 23+ messages in thread
From: Ville Syrjälä @ 2014-08-12 10:04 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel

On Tue, Aug 12, 2014 at 10:03:26AM +0200, Daniel Vetter wrote:
> On Tue, Aug 12, 2014 at 9:20 AM, Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >> but I tend to think it would be nice for compositors (userspace) to
> >> know explicitly what is going on..  ie. if some layers are blended via
> >> intermediate buffer, couldn't that intermediate buffer be potentially
> >> re-used on next frame if not damaged?
> >
> > Very true, and I think that speaks for exposing the HVS explicitly to
> > user space to be directly used. That way I believe the user space could
> > track damage and composite only the minimum, rather than everything
> > every time which I suppose the KMS API approach would imply.
> >
> > We don't have dirty regions in KMS API/props, do we? But yeah, that is
> > starting to feel like a stretch to push through KMS.
> 
> We have the dirty-ioctl, but imo it's a bit misdesigned: It works at
> the framebuffer level (so the driver always has to figure out which
> crtc/plane this is about), and it only works for frontbuffer
> rendering. It was essentially a single-purpose thing for udl uploads.
> 
> But in generally I think it would make tons of sense to supply a
> per-crtc (or maybe per-plane) damage rect with nuclear flips. Both
> mipi dsi and edp have provisions to upload a subrect, so this could be
> useful in general. And decent compositors compute this already anyway.

Agreed, as long as we make it more of a hint so that the driver is
allowed to expand the rect to satisfy hardware specific alignment
requirements and whatnot.

I think a single per-crtc rect should be enough, but in case people
would like to implement a more sophisticated multi-rect update I
suppose we could allow it. And for those that don't want the extra
complexity of trying to deal with multiple rectangles, the driver
could just calculate the bounding rectangle and update that.

-- 
Ville Syrjälä
Intel OTC

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-12  8:48       ` Pekka Paalanen
@ 2014-08-12 16:10         ` Eric Anholt
  2014-08-13  7:02           ` Pekka Paalanen
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Anholt @ 2014-08-12 16:10 UTC (permalink / raw)
  To: Pekka Paalanen, Daniel Vetter; +Cc: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 6671 bytes --]

Pekka Paalanen <ppaalanen@gmail.com> writes:

> On Mon, 11 Aug 2014 19:27:45 +0200
> Daniel Vetter <daniel@ffwll.ch> wrote:
>
>> On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:
>> > Daniel Vetter <daniel@ffwll.ch> writes:
>> > 
>> > > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
>> > >> Hi,
>> > >> 
>> > >> there is some hardware than can do 2D compositing with an arbitrary
>> > >> number of planes. I'm not sure what the absolute maximum number of
>> > >> planes is, but for the discussion, let's say it is 100.
>> > >> 
>> > >> There are many complicated, dynamic constraints on how many, what size,
>> > >> etc. planes can be used at once. A driver would be able to check those
>> > >> before kicking the 2D compositing engine.
>> > >> 
>> > >> The 2D compositing engine in the best case (only few planes used) is
>> > >> able to composite on the fly in scanout, just like the usual overlay
>> > >> hardware blocks in CRTCs. When the composition complexity goes up, the
>> > >> driver can fall back to compositing into a buffer rather than on the
>> > >> fly in scanout. This fallback needs to be completely transparent to the
>> > >> user space, implying only additional latency if anything.
>> > >> 
>> > >> These 2D compositing features should be exposed to user space through a
>> > >> standard kernel ABI, hopefully an existing ABI in the very near future
>> > >> like the KMS atomic.
>> > >
>> > > I presume we're talking about the video core from raspi? Or at least
>> > > something similar?
>> > 
>> > Pekka wasn't sure if things were confidential here, but I can say it:
>> > Yeah, it's the RPi.
>> > 
>> > While I haven't written code using the compositor interface (I just did
>> > enough to shim in a single plane for bringup, and I'm hoping Pekka and
>> > company can handle the rest for me :) ), my understanding is that the
>> > way you make use of it is that you've got your previous frame loaded up
>> > in the HVS (the plane compositor hardware), then when you're asked to
>> > put up a new frame that's going to be too hard, you take some
>> > complicated chunk of your scene and ask the HVS to use any spare
>> > bandwidth it has while it's still scanning out the previous frame in
>> > order to composite that piece of new scene into memory.  Then, when it's
>> > done with the offline composite, you ask the HVS to do the next scanout
>> > frame using the original scene with the pre-composited temporary buffer.
>> > 
>> > I'm pretty comfortable with the idea of having some large number of
>> > planes preallocated, and deciding that "nobody could possibly need more
>> > than 16" (or whatever).
>> > 
>> > My initial reaction to "we should just punt when we run out of bandwidth
>> > and have a special driver interface for offline composite" was "that's
>> > awful, when the kernel could just get the job done immediately, and
>> > easily, and it would know exactly what it needed to composite to get
>> > things to fit (unlike userspace)".  I'm trying to come up with what
>> > benefit there would be to having a separate interface for offline
>> > composite.  I've got 3 things:
>> > 
>> > - Avoids having a potentially long, interruptible wait in the modeset
>> >   path while the offline composite happens.  But I think we have other
>> >   interruptible waits in that path alreaady.
>> > 
>> > - Userspace could potentially do something else besides use the HVS to
>> >   get the fallback done.  Video would have to use the HVS, to get the
>> >   same scaling filters applied as the previous frame where things *did*
>> >   fit, but I guess you could composite some 1:1 RGBA overlays in GL,
>> >   which would have more BW available to it than what you're borrowing
>> >   from the previous frame's HVS capacity.
>> > 
>> > - Userspace could potentially use the offline composite interface for
>> >   things besides just the running-out-of-bandwidth case.  Like, it was
>> >   doing a nicely-filtered downscale of an overlaid video, then the user
>> >   hit pause and walked away: you could have a timeout that noticed that
>> >   the complicated scene hadn't changed in a while, and you'd drop from
>> >   overlays to a HVS-composited single plane to reduce power.
>> > 
>> > The third one is the one I've actually found kind of compelling, and
>> > might be switching me from wanting no userspace visibility into the
>> > fallback.  But I don't have a good feel for how much complexity there is
>> > to our descriptions of planes, and how much poorly-tested interface we'd
>> > be adding to support this usecase.
>> 
>> Compositor should already do a rough bw guesstimate and if stuff doesn't
>> change any more bake the entire scene into a single framebuffer. The exact
>> same issue happens on more usual hw with video overlays, too.
>> 
>> Ofc if it turns out that scanning out your yuv planes is less bw then the
>> overlay shouldn't be stopped ofc. But imo there's nothing special here for
>> the rpi.
>>  
>> > (Because, honestly, I don't expect the fallbacks to be hit much -- my
>> > understanding of the bandwidth equation is that you're mostly counting
>> > the number of pixels that have to be read, and clipped-out pixels
>> > because somebody's overlaid on top of you don't count unless they're in
>> > the same burst read.  So unless people are going nuts with blending in
>> > overlays, or downscaled video, it's probably not a problem, and
>> > something that gets your pixels on the screen at all is sufficient)
>> 
>> Yeah I guess we need to check reality here. If the "we've run out of bw"
>> case just never happens then it's pointless to write special code for it.
>> And we can always add a limit later for the case where GL is usually
>> better and tell userspace that we can't do this many planes. Exact same
>> thing with running out of memory bw can happen anywhere else, too.
>
> I had a chat with Eric last night, and our different views about the
> on-line/real-time performance limits of the HVS seem to be due to alpha
> blending.
>
> Eric has not been using alpha blending much or at all, while my
> experiments with Weston and DispmanX pretty much always need alpha
> blending (e.g. because DispmanX cannot say that only a sub-region of a
> buffer needs blending). Eric says alpha blending kills the
> performance.

Note, I wasn't saying anything about performance.  I was just talking
about how compositing in X knows that (almost) everything is actually
opaque, so I don't have the worries about alpha blending that you
apparently do in Weston.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 818 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: How to design a DRM KMS driver exposing 2D compositing?
  2014-08-12 16:10         ` Eric Anholt
@ 2014-08-13  7:02           ` Pekka Paalanen
  0 siblings, 0 replies; 23+ messages in thread
From: Pekka Paalanen @ 2014-08-13  7:02 UTC (permalink / raw)
  To: Eric Anholt; +Cc: dri-devel

On Tue, 12 Aug 2014 09:10:47 -0700
Eric Anholt <eric@anholt.net> wrote:

> Pekka Paalanen <ppaalanen@gmail.com> writes:
> 
> > On Mon, 11 Aug 2014 19:27:45 +0200
> > Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> >> On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:
> >> > Daniel Vetter <daniel@ffwll.ch> writes:
> >> > 
> >> > > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> >> > >> Hi,
> >> > >> 
> >> > >> there is some hardware than can do 2D compositing with an arbitrary
> >> > >> number of planes. I'm not sure what the absolute maximum number of
> >> > >> planes is, but for the discussion, let's say it is 100.
> >> > >> 
> >> > >> There are many complicated, dynamic constraints on how many, what size,
> >> > >> etc. planes can be used at once. A driver would be able to check those
> >> > >> before kicking the 2D compositing engine.
> >> > >> 
> >> > >> The 2D compositing engine in the best case (only few planes used) is
> >> > >> able to composite on the fly in scanout, just like the usual overlay
> >> > >> hardware blocks in CRTCs. When the composition complexity goes up, the
> >> > >> driver can fall back to compositing into a buffer rather than on the
> >> > >> fly in scanout. This fallback needs to be completely transparent to the
> >> > >> user space, implying only additional latency if anything.
> >> > >> 
> >> > >> These 2D compositing features should be exposed to user space through a
> >> > >> standard kernel ABI, hopefully an existing ABI in the very near future
> >> > >> like the KMS atomic.
> >> > >
> >> > > I presume we're talking about the video core from raspi? Or at least
> >> > > something similar?
> >> > 
> >> > Pekka wasn't sure if things were confidential here, but I can say it:
> >> > Yeah, it's the RPi.
> >> > 
> >> > While I haven't written code using the compositor interface (I just did
> >> > enough to shim in a single plane for bringup, and I'm hoping Pekka and
> >> > company can handle the rest for me :) ), my understanding is that the
> >> > way you make use of it is that you've got your previous frame loaded up
> >> > in the HVS (the plane compositor hardware), then when you're asked to
> >> > put up a new frame that's going to be too hard, you take some
> >> > complicated chunk of your scene and ask the HVS to use any spare
> >> > bandwidth it has while it's still scanning out the previous frame in
> >> > order to composite that piece of new scene into memory.  Then, when it's
> >> > done with the offline composite, you ask the HVS to do the next scanout
> >> > frame using the original scene with the pre-composited temporary buffer.
> >> > 
> >> > I'm pretty comfortable with the idea of having some large number of
> >> > planes preallocated, and deciding that "nobody could possibly need more
> >> > than 16" (or whatever).
> >> > 
> >> > My initial reaction to "we should just punt when we run out of bandwidth
> >> > and have a special driver interface for offline composite" was "that's
> >> > awful, when the kernel could just get the job done immediately, and
> >> > easily, and it would know exactly what it needed to composite to get
> >> > things to fit (unlike userspace)".  I'm trying to come up with what
> >> > benefit there would be to having a separate interface for offline
> >> > composite.  I've got 3 things:
> >> > 
> >> > - Avoids having a potentially long, interruptible wait in the modeset
> >> >   path while the offline composite happens.  But I think we have other
> >> >   interruptible waits in that path alreaady.
> >> > 
> >> > - Userspace could potentially do something else besides use the HVS to
> >> >   get the fallback done.  Video would have to use the HVS, to get the
> >> >   same scaling filters applied as the previous frame where things *did*
> >> >   fit, but I guess you could composite some 1:1 RGBA overlays in GL,
> >> >   which would have more BW available to it than what you're borrowing
> >> >   from the previous frame's HVS capacity.
> >> > 
> >> > - Userspace could potentially use the offline composite interface for
> >> >   things besides just the running-out-of-bandwidth case.  Like, it was
> >> >   doing a nicely-filtered downscale of an overlaid video, then the user
> >> >   hit pause and walked away: you could have a timeout that noticed that
> >> >   the complicated scene hadn't changed in a while, and you'd drop from
> >> >   overlays to a HVS-composited single plane to reduce power.
> >> > 
> >> > The third one is the one I've actually found kind of compelling, and
> >> > might be switching me from wanting no userspace visibility into the
> >> > fallback.  But I don't have a good feel for how much complexity there is
> >> > to our descriptions of planes, and how much poorly-tested interface we'd
> >> > be adding to support this usecase.
> >> 
> >> Compositor should already do a rough bw guesstimate and if stuff doesn't
> >> change any more bake the entire scene into a single framebuffer. The exact
> >> same issue happens on more usual hw with video overlays, too.
> >> 
> >> Ofc if it turns out that scanning out your yuv planes is less bw then the
> >> overlay shouldn't be stopped ofc. But imo there's nothing special here for
> >> the rpi.
> >>  
> >> > (Because, honestly, I don't expect the fallbacks to be hit much -- my
> >> > understanding of the bandwidth equation is that you're mostly counting
> >> > the number of pixels that have to be read, and clipped-out pixels
> >> > because somebody's overlaid on top of you don't count unless they're in
> >> > the same burst read.  So unless people are going nuts with blending in
> >> > overlays, or downscaled video, it's probably not a problem, and
> >> > something that gets your pixels on the screen at all is sufficient)
> >> 
> >> Yeah I guess we need to check reality here. If the "we've run out of bw"
> >> case just never happens then it's pointless to write special code for it.
> >> And we can always add a limit later for the case where GL is usually
> >> better and tell userspace that we can't do this many planes. Exact same
> >> thing with running out of memory bw can happen anywhere else, too.
> >
> > I had a chat with Eric last night, and our different views about the
> > on-line/real-time performance limits of the HVS seem to be due to alpha
> > blending.
> >
> > Eric has not been using alpha blending much or at all, while my
> > experiments with Weston and DispmanX pretty much always need alpha
> > blending (e.g. because DispmanX cannot say that only a sub-region of a
> > buffer needs blending). Eric says alpha blending kills the
> > performance.
> 
> Note, I wasn't saying anything about performance.  I was just talking
> about how compositing in X knows that (almost) everything is actually
> opaque, so I don't have the worries about alpha blending that you
> apparently do in Weston.

Ok, I'm confused.

Most surfaces in Weston do have non-opaque parts, usually the window
decorations, depending of course on the desktop visual style in use.
That means almost no surface is completely opaque, the wallpaper being
the obvious exception.

In Weston, we also do have the opaque region as set by apps as a
hint, that these regions do not need alpha blending. However with
DispmanX, there was no way to make use of the opaque region markup
unless it covered the whole surface.

Well, I could have split every window into 5 DispmanX elements
instead of just one (4 blended, 1 opaque) to approximate the usual
case with decorations, but I never tried that. There was some concern,
that the number of elements would become the dominating limit on how
much can be on screen at once, so it didn't feel worth the added
complexity, and enabling the automatic fallback to off-line just worked.

Alpha-blending can still be forced to a whole window by desktop
effects, though.

Does this explain why I saw that with DispmanX, the HVS on-line mode
would fail to reliably drive the output with just one or two basic app
windows open if even that much? IIRC that was on a 1280x1024 monitor,
not even close to a full-HD.


Thanks,
pq

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-08-13  7:02 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-11 10:38 How to design a DRM KMS driver exposing 2D compositing? Pekka Paalanen
2014-08-11 10:57 ` Damien Lespiau
2014-08-11 12:07   ` Pekka Paalanen
2014-08-11 13:14     ` Damien Lespiau
2014-08-11 13:44       ` Pekka Paalanen
2014-08-11 12:06 ` Daniel Vetter
2014-08-11 12:47   ` Pekka Paalanen
2014-08-11 15:35     ` Daniel Vetter
2014-08-11 16:09       ` Ville Syrjälä
2014-08-11 17:21         ` Daniel Vetter
2014-08-12  7:10       ` Pekka Paalanen
2014-08-11 13:32   ` Rob Clark
2014-08-11 15:24     ` Daniel Vetter
2014-08-12  7:20     ` Pekka Paalanen
2014-08-12  8:03       ` Daniel Vetter
2014-08-12 10:04         ` Ville Syrjälä
2014-08-11 17:16   ` Eric Anholt
2014-08-11 17:27     ` Daniel Vetter
2014-08-12  8:48       ` Pekka Paalanen
2014-08-12 16:10         ` Eric Anholt
2014-08-13  7:02           ` Pekka Paalanen
2014-08-11 14:37 ` Matt Roper
2014-08-12  8:42   ` Pekka Paalanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.