User ptr horror show

All of lore.kernel.org
 help / color / mirror / Atom feed

* User ptr horror show
@ 2014-06-30 18:21 Jerome Glisse
  2014-06-30 18:47 ` David Herrmann
  0 siblings, 1 reply; 5+ messages in thread
From: Jerome Glisse @ 2014-06-30 18:21 UTC (permalink / raw)
  To: dri-devel, airlied, Daniel Vetter

So in light of the radeon patch to add user ptr, i took a look at
intel code and it is time to put an end to this non sense. It
violate so many mm assumptions that it just not a doable options.

So Intel code only register a range_start callback that means that
any gup or other i915 activities that happens just after this call
back returns as no idea what so ever of it might get. It might get
the old pages that are about to change or the new pages.

For the use case where the gpu is only reading thing from those
buffer this is fine. But the use case where gpu gonna writte to
them and then user space expect to be able to access what the gpu
wrote through its userspace mapping it will be like any games in
a casino. You will never ever win in the end. You are bound to
loose. Just no way around that.

So let just stop that non sense. Wake up from our dream and move
on.

I am not even talking about how invalidate_page is ignore and how
well this will play with page write back.

Cheers,
Jérôme
The Grand Inquisitor

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: User ptr horror show
  2014-06-30 18:21 User ptr horror show Jerome Glisse
@ 2014-06-30 18:47 ` David Herrmann
  2014-06-30 19:04   ` Jerome Glisse
  0 siblings, 1 reply; 5+ messages in thread
From: David Herrmann @ 2014-06-30 18:47 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Daniel Vetter, dri-devel@lists.freedesktop.org

Hi

On Mon, Jun 30, 2014 at 8:21 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> So in light of the radeon patch to add user ptr, i took a look at
> intel code and it is time to put an end to this non sense. It
> violate so many mm assumptions that it just not a doable options.
>
> So Intel code only register a range_start callback that means that
> any gup or other i915 activities that happens just after this call
> back returns as no idea what so ever of it might get. It might get
> the old pages that are about to change or the new pages.

Can you give a complete example of that race? I cannot follow.

I did have a quite thorough look on intel's userptr implementation and
it does things similar to AIO, Direct-IO and other APIs that pin
user-pages (they also do it for reads or writes):
 - Get pages via GUP
 - don't care whether the user unmaps, truncates, moves, kills them;
they work on pages, not on VM ranges

Additionally to what AIO and Direct-IO do, intel userptr adds the
range_start callback to release pinned pages whenever the pages are
unmapped. However, anyone who truncates inode pages, schedules
writeback, etc., has to lock the page. Thus, any following GUP-fast
from userptr will fail and the slowpath will wait on mmap_sem. So I'd
really prefer if you could elaborate on your race?

Thanks
David

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: User ptr horror show
  2014-06-30 18:47 ` David Herrmann
@ 2014-06-30 19:04   ` Jerome Glisse
  2014-06-30 19:25     ` David Herrmann
  0 siblings, 1 reply; 5+ messages in thread
From: Jerome Glisse @ 2014-06-30 19:04 UTC (permalink / raw)
  To: David Herrmann; +Cc: Daniel Vetter, dri-devel@lists.freedesktop.org

On Mon, Jun 30, 2014 at 08:47:31PM +0200, David Herrmann wrote:
> Hi
> 
> On Mon, Jun 30, 2014 at 8:21 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > So in light of the radeon patch to add user ptr, i took a look at
> > intel code and it is time to put an end to this non sense. It
> > violate so many mm assumptions that it just not a doable options.
> >
> > So Intel code only register a range_start callback that means that
> > any gup or other i915 activities that happens just after this call
> > back returns as no idea what so ever of it might get. It might get
> > the old pages that are about to change or the new pages.
> 
> Can you give a complete example of that race? I cannot follow.
> 
> I did have a quite thorough look on intel's userptr implementation and
> it does things similar to AIO, Direct-IO and other APIs that pin
> user-pages (they also do it for reads or writes):
>  - Get pages via GUP
>  - don't care whether the user unmaps, truncates, moves, kills them;
> they work on pages, not on VM ranges

Those other syscall have clear definition, ie they work on page and
do not rely on the vma. But afaict the user ptr gem object do not
enforce nor claim that what gpu will use might not be what user
can see through the user space mapping. Hence it gives user of the
ioctl a false hope.

I am not against an ioctl that would steal the page from under the
vma and thus give a predictable outcome while steal allowing zero
copy which is i believe the only sane use case that can be made.

> 
> Additionally to what AIO and Direct-IO do, intel userptr adds the
> range_start callback to release pinned pages whenever the pages are
> unmapped. However, anyone who truncates inode pages, schedules
> writeback, etc., has to lock the page. Thus, any following GUP-fast
> from userptr will fail and the slowpath will wait on mmap_sem. So I'd
> really prefer if you could elaborate on your race?

Some writback code path (and other cpu page table modificiation) will not
call range_start but only invalidate_page. More over once the range_start
is call a GUP that is done before range_end is call will return what ever
it sees inside the cpu page table at the time which might be new pages or
old pages.

Thus you can imagine i915 trying to use an userptr object right after a
range_start but before a range_end, the i915 will read the page table (GUP
is not serializing anything here) and will assume that whatever it got from
there is current while it might just be soon to be discarded/replaced pages.
Hence you can not garanty that pages you use are the same backing the user
space range. Note that the mmap_sem does not protect page migration or thing
like that. It only protect vma modifications.

As i said for the gpu only accessing those buffer in read mode is fine but
i am sure userspace will start relying on the gpu writting to those page
and being able to read back what the gpu wrote through the user space
mapping. This will work often but this can only work because you are lucky
and there is no single way to make it work reliably.

So instead of giving false hope, just steal the page from the vma and be
explicit about it.

Cheers,
Jérôme Glisse

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: User ptr horror show
  2014-06-30 19:04   ` Jerome Glisse
@ 2014-06-30 19:25     ` David Herrmann
  2014-06-30 20:13       ` Jerome Glisse
  0 siblings, 1 reply; 5+ messages in thread
From: David Herrmann @ 2014-06-30 19:25 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Daniel Vetter, dri-devel@lists.freedesktop.org

Hi

On Mon, Jun 30, 2014 at 9:04 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Mon, Jun 30, 2014 at 08:47:31PM +0200, David Herrmann wrote:
>> Additionally to what AIO and Direct-IO do, intel userptr adds the
>> range_start callback to release pinned pages whenever the pages are
>> unmapped. However, anyone who truncates inode pages, schedules
>> writeback, etc., has to lock the page. Thus, any following GUP-fast
>> from userptr will fail and the slowpath will wait on mmap_sem. So I'd
>> really prefer if you could elaborate on your race?
>
> Some writback code path (and other cpu page table modificiation) will not
> call range_start but only invalidate_page. More over once the range_start
> is call a GUP that is done before range_end is call will return what ever
> it sees inside the cpu page table at the time which might be new pages or
> old pages.

range_start/end are usually called by unmap_*() functions, which
normally are called under page-lock. Therefore, _no_ new PTEs can be
established as long as the page is locked. This means, once range_end
is called, you're guaranteed any racing GUP-fast will fail and any
racing GUP will wait on the page-lock.
However, I wonder why i915 uses range_start instead of range_end. The
PTEs are still in place when range_start is called, therefore a racing
GUP-fast will still succeed.

Regarding "invalidate_page": This is called _after_ the PTE has been
removed (see rmap and try_unmap_page()), also with the page-locked.
Same scenario as range_end.
However, I also wonder why i915 doesn't have a callback for that? They
definitely need to, afaics.

Writeback code races with parallel GUP writes and keeps pages in
place. They rely on GUP-writers to call SetDirty() once they're done.
I've never liked that, but I don't see any code protecting writeback
against GUP writes, do you?

> Thus you can imagine i915 trying to use an userptr object right after a
> range_start but before a range_end, the i915 will read the page table (GUP
> is not serializing anything here) and will assume that whatever it got from
> there is current while it might just be soon to be discarded/replaced pages.
> Hence you can not garanty that pages you use are the same backing the user
> space range. Note that the mmap_sem does not protect page migration or thing
> like that. It only protect vma modifications.
>
> As i said for the gpu only accessing those buffer in read mode is fine but
> i am sure userspace will start relying on the gpu writting to those page
> and being able to read back what the gpu wrote through the user space
> mapping. This will work often but this can only work because you are lucky
> and there is no single way to make it work reliably.

i915 userptr is a 3.16 feature, right? So we can still revert it if
it's really broken.

However, I'd still like to see an example _call-trace_ with a *real
race*. I cannot see any writeback code that replaces pages despite
elevated page-refs. page-migration is quite strict about page-refs and
fails in those cases. truncate() is the only place I can see, but this
should be fixed by using range_end() instead of range_start(), right?

Thanks
David

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: User ptr horror show
  2014-06-30 19:25     ` David Herrmann
@ 2014-06-30 20:13       ` Jerome Glisse
  0 siblings, 0 replies; 5+ messages in thread
From: Jerome Glisse @ 2014-06-30 20:13 UTC (permalink / raw)
  To: David Herrmann; +Cc: Daniel Vetter, dri-devel@lists.freedesktop.org

On Mon, Jun 30, 2014 at 09:25:10PM +0200, David Herrmann wrote:
> Hi
> 
> On Mon, Jun 30, 2014 at 9:04 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Mon, Jun 30, 2014 at 08:47:31PM +0200, David Herrmann wrote:
> >> Additionally to what AIO and Direct-IO do, intel userptr adds the
> >> range_start callback to release pinned pages whenever the pages are
> >> unmapped. However, anyone who truncates inode pages, schedules
> >> writeback, etc., has to lock the page. Thus, any following GUP-fast
> >> from userptr will fail and the slowpath will wait on mmap_sem. So I'd
> >> really prefer if you could elaborate on your race?
> >
> > Some writback code path (and other cpu page table modificiation) will not
> > call range_start but only invalidate_page. More over once the range_start
> > is call a GUP that is done before range_end is call will return what ever
> > it sees inside the cpu page table at the time which might be new pages or
> > old pages.
> 
> range_start/end are usually called by unmap_*() functions, which
> normally are called under page-lock. Therefore, _no_ new PTEs can be
> established as long as the page is locked. This means, once range_end
> is called, you're guaranteed any racing GUP-fast will fail and any
> racing GUP will wait on the page-lock.

Page-lock does not protect cpu page table update. It happens that some
code path take it but not all of them. Just grep for all the call site
to mmu_notifier_invalidate_range_start they are quite few of them.

None the less, the slow GUP does not care about page-lock so again the
race i describe does exist:

Time   Function being call
 1  - range_start
 2  - some i915 cs submission making use of user ptr and calling GUP
 3  - cpu page table update
 4  - range_end

What ever happen at 2 return pages that are replaced by others at 3.

> However, I wonder why i915 uses range_start instead of range_end. The
> PTEs are still in place when range_start is called, therefore a racing
> GUP-fast will still succeed.

Yes GUP-fast can succeed right after a range_start ends but before the
cpu page table is updated in anyway.

> Regarding "invalidate_page": This is called _after_ the PTE has been
> removed (see rmap and try_unmap_page()), also with the page-locked.
> Same scenario as range_end.

I am well aware of all that code. Yet nothing forbids the GPU to perform
write access to the page while there is a writeback meaning that the wb
semantic is badly violated by the GPU driver. The whole writeback code is
full of assumption about when and how you can set the dirty bit and when
its clear i am pretty sure that current code can race and violate that too
ie i915 set dirty bit right before core code clears it and you end up with
stall data inside the page  that might never be written back to disk because
kernel lost the fact that the page is now dirty.

> However, I also wonder why i915 doesn't have a callback for that? They
> definitely need to, afaics.

On invalidate_page you must abide by the new access protection for the
page if it becomes read only then secondary pte (in this case gpu page
table entry) must be set to write only too.

> Writeback code races with parallel GUP writes and keeps pages in
> place. They rely on GUP-writers to call SetDirty() once they're done.
> I've never liked that, but I don't see any code protecting writeback
> against GUP writes, do you?

GUP is a bad API, i am pretty sure all mm guys hate it, i think they
were too forgiving to its user. The fact that it does exist does not
means that we can rely on it. Especialy we should not assume things
about it that are not true and the comment above GUP clearly tell you
that whatever GUP returns might be a lie and you have been warn.

So once again the write case is not doable period. This would be a
violation of GUP and several other core kernel mm code. So unlike
trying to provide and API that lies and works only by luck stop
pretending it is sane and ok.

Again i am not opose to an API that would steal page from a vma.
Stealing the car is ok, riding in the trunk hopping the vma will
not change car is not.

> 
> > Thus you can imagine i915 trying to use an userptr object right after a
> > range_start but before a range_end, the i915 will read the page table (GUP
> > is not serializing anything here) and will assume that whatever it got from
> > there is current while it might just be soon to be discarded/replaced pages.
> > Hence you can not garanty that pages you use are the same backing the user
> > space range. Note that the mmap_sem does not protect page migration or thing
> > like that. It only protect vma modifications.
> >
> > As i said for the gpu only accessing those buffer in read mode is fine but
> > i am sure userspace will start relying on the gpu writting to those page
> > and being able to read back what the gpu wrote through the user space
> > mapping. This will work often but this can only work because you are lucky
> > and there is no single way to make it work reliably.
> 
> i915 userptr is a 3.16 feature, right? So we can still revert it if
> it's really broken.
> 

It is broken by all means revert.

> However, I'd still like to see an example _call-trace_ with a *real
> race*. I cannot see any writeback code that replaces pages despite
> elevated page-refs. page-migration is quite strict about page-refs and
> fails in those cases. truncate() is the only place I can see, but this
> should be fixed by using range_end() instead of range_start(), right?

__replace_page in kernel/event/uprobes.c is an example i was looking
at recently. I am sure they are others and there is nothing stoping
core kernel mm guys to add more and i expect a lot more will come soon
for valid core mm reasons (numa is king of the hill when it comes to
mm code and whatever gpu driver rely on especialy if it is through
GUP will be laughted at).

Cheers,
Jérôme Glisse

> 
> Thanks
> David

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-06-30 20:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-30 18:21 User ptr horror show Jerome Glisse
2014-06-30 18:47 ` David Herrmann
2014-06-30 19:04   ` Jerome Glisse
2014-06-30 19:25     ` David Herrmann
2014-06-30 20:13       ` Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.