All of lore.kernel.org
 help / color / mirror / Atom feed
* pv_ops & gntdev?
@ 2009-02-24 18:45 Gerd Hoffmann
  2009-02-24 21:18 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-02-24 18:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

  Hi,

Short question: Do you have plans to port the gntdev userspace driver
(the one sitting in drivers/xen/gntdev/ in the 2.6.18 tree) to the
pv_ops kernel?

thanks,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-24 18:45 pv_ops & gntdev? Gerd Hoffmann
@ 2009-02-24 21:18 ` Jeremy Fitzhardinge
  2009-02-24 22:12   ` Gerd Hoffmann
  0 siblings, 1 reply; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-24 21:18 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
> Short question: Do you have plans to port the gntdev userspace driver
> (the one sitting in drivers/xen/gntdev/ in the 2.6.18 tree) to the
> pv_ops kernel?
>   

I poked at it a little bit, but didn't get very far because I wasn't 
sure if anyone actually wants it.  But blktap2 essentially uses the same 
mechanism (or does it actually use gnttab?), and we'd like to get that 
working.

What do you want to do with it?   Do you want to bring it over?

    J

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-24 21:18 ` Jeremy Fitzhardinge
@ 2009-02-24 22:12   ` Gerd Hoffmann
  2009-02-25 10:05     ` Gerd Hoffmann
  0 siblings, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-02-24 22:12 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Jeremy Fitzhardinge wrote:
> Gerd Hoffmann wrote:
>> Short question: Do you have plans to port the gntdev userspace driver
>> (the one sitting in drivers/xen/gntdev/ in the 2.6.18 tree) to the
>> pv_ops kernel?
> 
> I poked at it a little bit, but didn't get very far because I wasn't
> sure if anyone actually wants it.

As far I know the plan is to get rid of any page references in xen and
replace those by grants (i.e. xen text console for example).  To get
that done gntdev is required.

> But blktap2 essentially uses the same
> mechanism (or does it actually use gnttab?), and we'd like to get that
> working.

Not sure about blktap2.  blktap1 and gntdev are very simliar though when
looked at from a memory management point of view.  Both map grants into
userspace memory.  blktap is tied to the block protocol, gntdev is
generic.  Quite possible that blktap2 simply uses gntdev.

> What do you want to do with it?

Test my userspace backend drivers.

> Do you want to bring it over?

Well.  I hoped someone did that already ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-24 22:12   ` Gerd Hoffmann
@ 2009-02-25 10:05     ` Gerd Hoffmann
  2009-02-25 17:51       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-02-25 10:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
>> Do you want to bring it over?
> 
> Well.  I hoped someone did that already ...

Ideas how to handle the grant unmap case?
I suspect the vm_ops->zap_pte() approach isn't going to fly for mainline
merge ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 10:05     ` Gerd Hoffmann
@ 2009-02-25 17:51       ` Jeremy Fitzhardinge
  2009-02-25 18:37         ` Gerd Hoffmann
  2009-02-25 18:56         ` Gerd Hoffmann
  0 siblings, 2 replies; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-25 17:51 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
> Gerd Hoffmann wrote:
>   
>>> Do you want to bring it over?
>>>       
>> Well.  I hoped someone did that already ...
>>     
>
> Ideas how to handle the grant unmap case?
> I suspect the vm_ops->zap_pte() approach isn't going to fly for mainline
> merge ...

Yes, my plan is:

   1. when installing a grant in a pte page, allocate a shadow page and
      hang it off page->private; also set a page flag on the pgd page
      flags (steal something spare)
   2. store the grant handle in the slot corresponding to the pte
   3. when clearing the pte, check to see if there's a grant handle and
      zap it
   4. when unpinning the pagetable, check to see if there's the "grant
      mapping" flag in the pgd, and do a grant-zap pass before unpinning

    J

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 17:51       ` Jeremy Fitzhardinge
@ 2009-02-25 18:37         ` Gerd Hoffmann
  2009-02-25 20:06           ` Jeremy Fitzhardinge
  2009-02-25 18:56         ` Gerd Hoffmann
  1 sibling, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-02-25 18:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Jeremy Fitzhardinge wrote:
> Gerd Hoffmann wrote:
>> Gerd Hoffmann wrote:
>>  
>>>> Do you want to bring it over?
>>>>       
>>> Well.  I hoped someone did that already ...
>>>     
>>
>> Ideas how to handle the grant unmap case?
>> I suspect the vm_ops->zap_pte() approach isn't going to fly for mainline
>> merge ...
> 
> Yes, my plan is:
> 
>   1. when installing a grant in a pte page, allocate a shadow page and
>      hang it off page->private; also set a page flag on the pgd page
>      flags (steal something spare)

pgd or pte page flags?  or both?

>   2. store the grant handle in the slot corresponding to the pte

ok.

>   3. when clearing the pte, check to see if there's a grant handle and
>      zap it

That needs a new paravirt hook I think ...

>   4. when unpinning the pagetable, check to see if there's the "grant
>      mapping" flag in the pgd, and do a grant-zap pass before unpinning

ok.

Prelimary patches by chance?

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 17:51       ` Jeremy Fitzhardinge
  2009-02-25 18:37         ` Gerd Hoffmann
@ 2009-02-25 18:56         ` Gerd Hoffmann
  2009-02-25 20:10           ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-02-25 18:56 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Jeremy Fitzhardinge wrote:
>> Ideas how to handle the grant unmap case?
>> I suspect the vm_ops->zap_pte() approach isn't going to fly for mainline
>> merge ...
> 
> Yes, my plan is:
> 
>   1. when installing a grant in a pte page, allocate a shadow page and
>      hang it off page->private; also set a page flag on the pgd page
>      flags (steal something spare)
>   2. store the grant handle in the slot corresponding to the pt
>   3. when clearing the pte, check to see if there's a grant handle and
>      zap it
>   4. when unpinning the pagetable, check to see if there's the "grant
>      mapping" flag in the pgd, and do a grant-zap pass before unpinning

Also: 2.6.18 gntdev does some funky interaction with the balloon driver.
 What is the point in doing so?

thanks,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 18:37         ` Gerd Hoffmann
@ 2009-02-25 20:06           ` Jeremy Fitzhardinge
  2009-02-25 21:19             ` Gerd Hoffmann
  0 siblings, 1 reply; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-25 20:06 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
>> Yes, my plan is:
>>
>>   1. when installing a grant in a pte page, allocate a shadow page and
>>      hang it off page->private; also set a page flag on the pgd page
>>      flags (steal something spare)
>>     
>
> pgd or pte page flags?  or both?
>   

Just pgd, so that the unpinner can see it needs to make a second pass.  
At the pte level, looking for a non-NULL page->private should be enough.

>>   3. when clearing the pte, check to see if there's a grant handle and
>>      zap it
>>     
>
> That needs a new paravirt hook I think ...
>   

Hm, I hope not.  Doesn't it always end up using set_pte*?

>>   4. when unpinning the pagetable, check to see if there's the "grant
>>      mapping" flag in the pgd, and do a grant-zap pass before unpinning
>>     
>
> ok.
>
> Prelimary patches by chance?
>   

No, I haven't got around to implementing this at all.

Thanks,
    J

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 18:56         ` Gerd Hoffmann
@ 2009-02-25 20:10           ` Jeremy Fitzhardinge
  2009-02-25 21:45             ` Gerd Hoffmann
  2009-02-25 22:16             ` Gerd Hoffmann
  0 siblings, 2 replies; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-25 20:10 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
> Also: 2.6.18 gntdev does some funky interaction with the balloon driver.
>  What is the point in doing so?
>   

Oh, the alloc_empty_pages_and_pagevec() stuff?  It allocates some pages, 
just for their pfn slots, and then returns the underlying memory back to 
Xen.  It uses the pfns to slot granted pages into.

I think.

    J

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 20:06           ` Jeremy Fitzhardinge
@ 2009-02-25 21:19             ` Gerd Hoffmann
  2009-02-25 21:35               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-02-25 21:19 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Jeremy Fitzhardinge wrote:
> Hm, I hope not.  Doesn't it always end up using set_pte*?

ptep_get_and_clear_full();

See zap_pte_range() in mm/memory.c

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 21:19             ` Gerd Hoffmann
@ 2009-02-25 21:35               ` Jeremy Fitzhardinge
  2009-02-25 21:43                 ` Gerd Hoffmann
  0 siblings, 1 reply; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-25 21:35 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
> Jeremy Fitzhardinge wrote:
>   
>> Hm, I hope not.  Doesn't it always end up using set_pte*?
>>     
>
> ptep_get_and_clear_full();
>
> See zap_pte_range() in mm/memory.c

Ah, right.  It could be changed to use ptep_modify_prot_start/commit.  
That would make it batchable too.  Slightly less good for native.

    J

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 21:35               ` Jeremy Fitzhardinge
@ 2009-02-25 21:43                 ` Gerd Hoffmann
  2009-02-25 21:48                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-02-25 21:43 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Jeremy Fitzhardinge wrote:
> Gerd Hoffmann wrote:
>> Jeremy Fitzhardinge wrote:
>>  
>>> Hm, I hope not.  Doesn't it always end up using set_pte*?
>>>     
>>
>> ptep_get_and_clear_full();
>>
>> See zap_pte_range() in mm/memory.c
> 
> Ah, right.  It could be changed to use ptep_modify_prot_start/commit. 
> That would make it batchable too.  Slightly less good for native.

Hmm.  One more Q:  How is pinning used?  Is a process pinned all the
time?  Or can it happen that the pages are unpinned in case it does hang
around idle for a while?  Is there some way to prevent a process from
being unpinned?

I somehow feel like checking out mmu notifiers first ...

thanks,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 20:10           ` Jeremy Fitzhardinge
@ 2009-02-25 21:45             ` Gerd Hoffmann
  2009-02-25 22:16             ` Gerd Hoffmann
  1 sibling, 0 replies; 27+ messages in thread
From: Gerd Hoffmann @ 2009-02-25 21:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Jeremy Fitzhardinge wrote:
> Gerd Hoffmann wrote:
>> Also: 2.6.18 gntdev does some funky interaction with the balloon driver.
>>  What is the point in doing so?
>>   
> 
> Oh, the alloc_empty_pages_and_pagevec() stuff?  It allocates some pages,
> just for their pfn slots, and then returns the underlying memory back to
> Xen.  It uses the pfns to slot granted pages into.

Ok, and that way you'll also get a struct page for the grant mappings,
right?  2.6.18 gntdev does map grants to both kernel and userspace.  Any
 idea why?  Just userspace should be possible too, right?

thanks,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 21:43                 ` Gerd Hoffmann
@ 2009-02-25 21:48                   ` Jeremy Fitzhardinge
  2009-02-25 22:11                     ` Gerd Hoffmann
  0 siblings, 1 reply; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-25 21:48 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
> Hmm.  One more Q:  How is pinning used?  Is a process pinned all the
> time?  Or can it happen that the pages are unpinned in case it does hang
> around idle for a while?  Is there some way to prevent a process from
> being unpinned?
>   

Pinning in the sense of Xen pagetables; a pagetable is pinned for the 
whole time the process exists - it gets pinned at fork/exec time, and 
unpin on exit.  (It has no effect on the residency of the usermode 
pages, though granted pages mapped into usermode would not be swappable 
at all.)

> I somehow feel like checking out mmu notifiers first ...
>   
Why's that?  What would you use them for?

    J

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 21:48                   ` Jeremy Fitzhardinge
@ 2009-02-25 22:11                     ` Gerd Hoffmann
  2009-02-25 22:28                       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-02-25 22:11 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Jeremy Fitzhardinge wrote:
> Gerd Hoffmann wrote:
>> Hmm.  One more Q:  How is pinning used?  Is a process pinned all the
>> time?  Or can it happen that the pages are unpinned in case it does hang
>> around idle for a while?  Is there some way to prevent a process from
>> being unpinned?
>>   
> 
> Pinning in the sense of Xen pagetables; a pagetable is pinned for the
> whole time the process exists - it gets pinned at fork/exec time, and
> unpin on exit.  (It has no effect on the residency of the usermode
> pages, though granted pages mapped into usermode would not be swappable
> at all.)
> 
>> I somehow feel like checking out mmu notifiers first ...
>>   
> Why's that?  What would you use them for?

Unmapping the grants when the kernel is about to zap the page range.
I think it could be easier because I don't have to link the low-level
paravirt code with the gntdev driver then.  Also I can easily batch
unmaps.  And there is a performance hit for the extra checks only for
processes which actually use grants.

The (unclean) exit case might be tricky though.  Could be the kernel
tries to unpin before zapping all mappings, so we don't trap into xen
all the time.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 20:10           ` Jeremy Fitzhardinge
  2009-02-25 21:45             ` Gerd Hoffmann
@ 2009-02-25 22:16             ` Gerd Hoffmann
  2009-02-25 22:30               ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-02-25 22:16 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Jeremy Fitzhardinge wrote:
> Gerd Hoffmann wrote:
>> Also: 2.6.18 gntdev does some funky interaction with the balloon driver.
>>  What is the point in doing so?
>>   
> 
> Oh, the alloc_empty_pages_and_pagevec() stuff?  It allocates some pages,
> just for their pfn slots, and then returns the underlying memory back to
> Xen.  It uses the pfns to slot granted pages into.

Yep, gives pfn slot for the grant map.  And a struct page too.

/me wonderes whenever that is required or whenever I could just map the
grants without asking the balloon driver for some slots ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 22:11                     ` Gerd Hoffmann
@ 2009-02-25 22:28                       ` Jeremy Fitzhardinge
  2009-02-25 22:42                         ` Gerd Hoffmann
  0 siblings, 1 reply; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-25 22:28 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
> Unmapping the grants when the kernel is about to zap the page range.
> I think it could be easier because I don't have to link the low-level
> paravirt code with the gntdev driver then.  Also I can easily batch
> unmaps.  And there is a performance hit for the extra checks only for
> processes which actually use grants.
>   

Would the unmap operation on the vma be enough to do that?   Hm, I guess 
vm_operations_struct needs something that gets called before the zap 
rather than after (->close).

> The (unclean) exit case might be tricky though.  Could be the kernel
> tries to unpin before zapping all mappings, so we don't trap into xen
> all the time.
>   

Yes, the unpin needs to make sure it clears out all the grant refs, 
because you just can't unpin otherwise.

The alternative is to hook into the fault handler so that a 
fault-n-emulate operation on the pagetable which fails because of the 
grant reference can be fixed up at that point.  But that might be even 
more intrusive.

    J

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 22:16             ` Gerd Hoffmann
@ 2009-02-25 22:30               ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-25 22:30 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
> Jeremy Fitzhardinge wrote:
>   
>> Gerd Hoffmann wrote:
>>     
>>> Also: 2.6.18 gntdev does some funky interaction with the balloon driver.
>>>  What is the point in doing so?
>>>   
>>>       
>> Oh, the alloc_empty_pages_and_pagevec() stuff?  It allocates some pages,
>> just for their pfn slots, and then returns the underlying memory back to
>> Xen.  It uses the pfns to slot granted pages into.
>>     
>
> Yep, gives pfn slot for the grant map.  And a struct page too.
>
> /me wonderes whenever that is required or whenever I could just map the
> grants without asking the balloon driver for some slots ...
>   

I think they could be special ptes now, and not need any struct page.

    J

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 22:28                       ` Jeremy Fitzhardinge
@ 2009-02-25 22:42                         ` Gerd Hoffmann
  2009-02-25 22:59                           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-02-25 22:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Jeremy Fitzhardinge wrote:
> Gerd Hoffmann wrote:
>> Unmapping the grants when the kernel is about to zap the page range.
>> I think it could be easier because I don't have to link the low-level
>> paravirt code with the gntdev driver then.  Also I can easily batch
>> unmaps.  And there is a performance hit for the extra checks only for
>> processes which actually use grants.
>>   
> 
> Would the unmap operation on the vma be enough to do that?   Hm, I guess
> vm_operations_struct needs something that gets called before the zap
> rather than after (->close).

mmu motifier will fit the bill and additionally work fine on partial unmaps.

>> The (unclean) exit case might be tricky though.  Could be the kernel
>> tries to unpin before zapping all mappings, so we don't trap into xen
>> all the time.
> 
> Yes, the unpin needs to make sure it clears out all the grant refs,
> because you just can't unpin otherwise.
> 
> The alternative is to hook into the fault handler so that a
> fault-n-emulate operation on the pagetable which fails because of the
> grant reference can be fixed up at that point.  But that might be even
> more intrusive.

Or just allow to flag "skip the unpin please for this mm".  It's an
optimization after all, it isn't required, right?

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 22:42                         ` Gerd Hoffmann
@ 2009-02-25 22:59                           ` Jeremy Fitzhardinge
  2009-03-02 22:53                             ` Gerd Hoffmann
  0 siblings, 1 reply; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-25 22:59 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
>> Would the unmap operation on the vma be enough to do that?   Hm, I guess
>> vm_operations_struct needs something that gets called before the zap
>> rather than after (->close).
>>     
>
> mmu motifier will fit the bill and additionally work fine on partial unmaps.
>   

OK, it'll be interesting to see how that turns out.

>>> The (unclean) exit case might be tricky though.  Could be the kernel
>>> tries to unpin before zapping all mappings, so we don't trap into xen
>>> all the time.
>>>       
>> Yes, the unpin needs to make sure it clears out all the grant refs,
>> because you just can't unpin otherwise.
>>
>> The alternative is to hook into the fault handler so that a
>> fault-n-emulate operation on the pagetable which fails because of the
>> grant reference can be fixed up at that point.  But that might be even
>> more intrusive.
>>     
>
> Or just allow to flag "skip the unpin please for this mm".  It's an
> optimization after all, it isn't required, right?
>   

If its pinned you have to unpin; you can't free pagetable pages while 
they're pinned.  If you don't pin the pagetable then context switches 
get very slow (it would need to revalidate the whole pagetable every 
context switch).

We could, however, defer the unpin and unpin late, after all the vmas 
have been unmapped.  But that's a bit ugly (it would require a separate 
hook in exit_mmap).

    J

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: pv_ops & gntdev?
  2009-02-25 22:59                           ` Jeremy Fitzhardinge
@ 2009-03-02 22:53                             ` Gerd Hoffmann
  2009-03-03  1:05                               ` Jeremy Fitzhardinge
  2009-03-03 22:59                               ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 27+ messages in thread
From: Gerd Hoffmann @ 2009-03-02 22:53 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

[-- Attachment #1: Type: text/plain, Size: 1133 bytes --]

Jeremy Fitzhardinge wrote:
> Gerd Hoffmann wrote:
>> mmu motifier will fit the bill and additionally work fine on partial
>> unmaps.
> 
> OK, it'll be interesting to see how that turns out.

Prelimary version attached.

Features:
 * compiles ;)
 * survived light testing.
 * mapping via xc_gnttab_map_grant_ref*s* (i.e. multiple grants at once)
   works stable.  Last time I tried that with 2.6.18 I quickly killed
   the host kernel.
 * much smaller than the 2.6.18 version.
 * maps grants directly, doesn't burn pfns and kernel address space.

Issues:
 (1) Grant teardown on unclean exit doesn't work yet.  Doesn't crash,
     but I think I leak the pages because they are not unmapped via
     grant api.
 (2) Also on unclean exit my release callbacks are not called for some
     silly reason, to be investigated.  Memory leak.  Probably related
     to (1).  Maybe I did something wrong with the VM flags ...
 (3) There are probably some nasty corner cases not handled correctly
     yet, the whole thing needs a careful review once the two issues
     listed above are fixed.

Reviews, comments and hints are welcome,

  Gerd


[-- Attachment #2: gntdev.diff --]
[-- Type: text/plain, Size: 23331 bytes --]

diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 0138113..555e4b6 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -74,9 +74,17 @@ config XEN_COMPAT_XENFS
          a xen platform.
          If in doubt, say yes.
 
+config XEN_GNTDEV
+	bool "userspace grant access device driver"
+	depends on XEN
+	default XEN_DOM0
+	select MMU_NOTIFIER
+	help
+	  Allows userspace processes use grants.
+
 config XEN_XENBUS_FRONTEND
        tristate
 
 config XEN_S3
        def_bool y
-       depends on XEN_DOM0 && ACPI
\ No newline at end of file
+       depends on XEN_DOM0 && ACPI
diff --git a/drivers/xen/Makefile b/drivers/xen/Makefile
index 4b01fc8..c2180d9 100644
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -5,8 +5,9 @@ obj-$(CONFIG_HOTPLUG_CPU)		+= cpu_hotplug.o
 obj-$(CONFIG_XEN_XENCOMM)		+= xencomm.o
 obj-$(CONFIG_XEN_BALLOON)		+= balloon.o
 obj-$(CONFIG_XEN_DEV_EVTCHN)	+= evtchn.o
+obj-$(CONFIG_XEN_GNTDEV)	+= gntdev.o
 obj-$(CONFIG_XEN_BLKDEV_BACKEND)	+= blkback/
 obj-$(CONFIG_XEN_NETDEV_BACKEND)	+= netback/
 obj-$(CONFIG_XENFS)			+= xenfs/
 
-obj-$(CONFIG_XEN_S3)			+= acpi.o
\ No newline at end of file
+obj-$(CONFIG_XEN_S3)			+= acpi.o
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
new file mode 100644
index 0000000..3dd0e5f
--- /dev/null
+++ b/drivers/xen/gntdev.c
@@ -0,0 +1,594 @@
+/******************************************************************************
+ * gntdev.c
+ *
+ * Device for accessing (in user-space) pages that have been granted by other
+ * domains.
+ *
+ * Copyright (c) 2006-2007, D G Murray.
+ *           (c) 2009 Gerd Hoffmann <kraxel@redhat.com>
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/mmu_notifier.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+
+#include <xen/grant_table.h>
+#include <xen/gntdev.h>
+#include <asm/xen/hypervisor.h>
+#include <asm/xen/hypercall.h>
+#include <asm/xen/page.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Derek G. Murray <Derek.Murray@cl.cam.ac.uk>, "
+              "Gerd Hoffmann <kraxel@redhat.com>");
+MODULE_DESCRIPTION("User-space granted page access driver");
+
+static int debug = 1;
+module_param(debug, int, 0644);
+static int limit = 1024;
+module_param(limit, int, 0644);
+
+struct gntdev_priv {
+        struct list_head maps;
+        uint32_t used;
+	uint32_t limit;
+	struct rw_semaphore sem;
+        struct mm_struct *mm;
+        struct mmu_notifier mn;
+};
+
+struct grant_info {
+        domid_t domid;
+        grant_ref_t ref;
+        grant_handle_t handle;
+};
+
+struct grant_map {
+        struct list_head next;
+        struct gntdev_priv *priv;
+        struct vm_area_struct *vma;
+        int index;
+        int count;
+	int flags;
+	int is_mapped;
+	struct ioctl_gntdev_grant_ref *grants;
+	struct gnttab_map_grant_ref   *map_ops;
+	struct gnttab_unmap_grant_ref *unmap_ops;
+};
+
+/* ------------------------------------------------------------------ */
+
+static void gntdev_print_maps(struct gntdev_priv *priv,
+                              char *text, int text_index)
+{
+        struct grant_map *map;
+
+        printk("%s: maps list (priv %p, usage %d/%d)\n",
+               __FUNCTION__, priv, priv->used, priv->limit);
+        list_for_each_entry(map, &priv->maps, next)
+                printk("  index %2d, count %2d %s\n",
+                       map->index, map->count,
+                       map->index == text_index && text ? text : "");
+}
+
+static struct grant_map *gntdev_add_map(struct gntdev_priv *priv, int count)
+{
+        struct grant_map *map, *add;
+
+        add = kzalloc(sizeof(struct grant_map), GFP_KERNEL);
+        if (NULL == add)
+                return NULL;
+
+	add->grants    = kzalloc(sizeof(add->grants[0])    * count, GFP_KERNEL);
+	add->map_ops   = kzalloc(sizeof(add->map_ops[0])   * count, GFP_KERNEL);
+	add->unmap_ops = kzalloc(sizeof(add->unmap_ops[0]) * count, GFP_KERNEL);
+	if (NULL == add->grants  ||
+	    NULL == add->map_ops ||
+	    NULL == add->unmap_ops) {
+		kfree(add->grants);
+		kfree(add->map_ops);
+		kfree(add->unmap_ops);
+		kfree(add);
+                return NULL;
+	}
+
+        add->index = 0;
+        add->count = count;
+        add->priv  = priv;
+
+        down_write(&priv->sem);
+        if (add->count + priv->used > priv->limit) {
+		up_write(&priv->sem);
+                kfree(add);
+                return NULL;
+        }
+
+        list_for_each_entry(map, &priv->maps, next) {
+                if (add->index + add->count < map->index) {
+                        list_add_tail(&add->next, &map->next);
+                        goto done;
+                }
+                add->index = map->index + map->count;
+        }
+        list_add_tail(&add->next, &priv->maps);
+
+done:
+        priv->used += add->count;
+        if (debug > 1)
+                gntdev_print_maps(priv, "[new]", add->index);
+	up_write(&priv->sem);
+        return add;
+}
+
+static struct grant_map *gntdev_find_map_index(struct gntdev_priv *priv, int index,
+					       int count)
+{
+        struct grant_map *map;
+
+        list_for_each_entry(map, &priv->maps, next) {
+                if (map->index != index)
+                        continue;
+                if (map->count != count)
+                        continue;
+                return map;
+        }
+        return NULL;
+}
+
+static struct grant_map *gntdev_find_map_vaddr(struct gntdev_priv *priv,
+					       unsigned long vaddr)
+{
+        struct grant_map *map;
+
+        list_for_each_entry(map, &priv->maps, next) {
+		if (!map->vma)
+			continue;
+		if (vaddr < map->vma->vm_start)
+			continue;
+		if (vaddr >= map->vma->vm_end)
+			continue;
+                return map;
+        }
+        return NULL;
+}
+
+static int gntdev_del_map(struct gntdev_priv *priv, int index,
+                          int count)
+{
+        struct grant_map *map;
+        int i, err = -EINVAL;
+
+        down_write(&priv->sem);
+        map = gntdev_find_map_index(priv, index, count);
+        if (NULL == map)
+                goto out;
+
+	err = -EBUSY;
+	if (map->vma)
+		goto out;
+        for (i = 0; i < map->count; i++)
+                if (map->unmap_ops[i].handle)
+                        goto out;
+
+        list_del(&map->next);
+        kfree(map->grants);
+        kfree(map->map_ops);
+        kfree(map->unmap_ops);
+        kfree(map);
+        priv->used -= map->count;
+        err = 0;
+
+out:
+	up_write(&priv->sem);
+        return err;
+}
+
+/* ------------------------------------------------------------------ */
+
+static int find_grant_ptes(pte_t *pte, pgtable_t token, unsigned long addr, void *data)
+{
+        struct grant_map *map = data;
+        unsigned int pgnr = (addr - map->vma->vm_start) >> PAGE_SHIFT;
+	u64 mpte;
+
+	BUG_ON(pgnr >= map->count);
+	mpte  = (u64)pfn_to_mfn(page_to_pfn(token)) << PAGE_SHIFT;
+	mpte |= (unsigned long)pte & ~PAGE_MASK;
+        gnttab_set_map_op(&map->map_ops[pgnr], mpte, map->flags,
+                          map->grants[pgnr].ref,
+                          map->grants[pgnr].domid);
+        gnttab_set_unmap_op(&map->unmap_ops[pgnr], mpte, map->flags,
+			    0 /* handle */);
+        return 0;
+}
+
+static int map_grant_pages(struct grant_map *map)
+{
+	int i, err = 0;
+
+	if (debug)
+		printk("%s: map %d+%d\n", __FUNCTION__, map->index, map->count);
+        BUG_ON(HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,
+					 map->map_ops, map->count));
+	for (i = 0; i < map->count; i++) {
+		if (map->map_ops[i].status)
+			err = -EINVAL;
+		map->unmap_ops[i].handle = map->map_ops[i].handle;
+	}
+	return err;
+}
+
+static int unmap_grant_pages(struct grant_map *map, int offset, int pages)
+{
+	int i, err = 0;
+
+	if (debug)
+		printk("%s: map %d+%d [%d+%d]\n", __FUNCTION__,
+		       map->index, map->count, offset, pages);
+	BUG_ON(HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref,
+					 map->unmap_ops + offset, pages));
+	for (i = 0; i < pages; i++) {
+		if (map->unmap_ops[offset+i].status)
+			err = -EINVAL;
+		map->unmap_ops[offset+i].handle = 0;
+	}
+	return err;
+}
+
+/* ------------------------------------------------------------------ */
+
+static void gntdev_vma_close(struct vm_area_struct *vma)
+{
+        struct grant_map *map = vma->vm_private_data;
+
+        if (debug > 1)
+                printk("%s\n", __FUNCTION__);
+	map->is_mapped = 0;
+	map->vma = NULL;
+	vma->vm_private_data = NULL;
+}
+
+static int gntdev_vma_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+        if (debug)
+                printk("%s: vaddr %p, pgoff %ld (shouldn't happen)\n",
+		       __FUNCTION__, vmf->virtual_address, vmf->pgoff);
+	vmf->flags = VM_FAULT_ERROR;
+        return 0;
+}
+
+static struct vm_operations_struct gntdev_vmops = {
+	.close = gntdev_vma_close,
+	.fault = gntdev_vma_fault,
+};
+
+/* ------------------------------------------------------------------ */
+
+static void mn_invl_range_start(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long start, unsigned long end)
+{
+        struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
+        struct grant_map *map;
+	unsigned long mstart, mend;
+	int err;
+
+	down_read(&priv->sem);
+        list_for_each_entry(map, &priv->maps, next) {
+                if (!map->vma)
+                        continue;
+                if (!map->is_mapped)
+                        continue;
+                if (map->vma->vm_start >= end)
+                        continue;
+                if (map->vma->vm_end <= start)
+                        continue;
+		mstart = max(start, map->vma->vm_start);
+		mend   = min(end,   map->vma->vm_end);
+                if (debug > 1)
+                        printk("%s: map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
+                               __FUNCTION__, map->index, map->count,
+                               map->vma->vm_start, map->vma->vm_end,
+			       start, end, mstart, mend);
+		err = unmap_grant_pages(map,
+					(mstart - map->vma->vm_start) >> PAGE_SHIFT,
+					(mend - mstart) >> PAGE_SHIFT);
+		BUG_ON(err);
+        }
+	up_read(&priv->sem);
+}
+
+static void mn_invl_page(struct mmu_notifier *mn,
+			 struct mm_struct *mm,
+			 unsigned long address)
+{
+	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
+}
+
+static void mn_release(struct mmu_notifier *mn,
+		       struct mm_struct *mm)
+{
+        struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
+        struct grant_map *map;
+	int err;
+
+	down_read(&priv->sem);
+        list_for_each_entry(map, &priv->maps, next) {
+                if (!map->vma)
+                        continue;
+                if (debug > 1)
+                        printk("%s: map %d+%d (%lx %lx)\n",
+                               __FUNCTION__, map->index, map->count,
+                               map->vma->vm_start, map->vma->vm_end);
+		err = unmap_grant_pages(map, 0, map->count);
+		BUG_ON(err);
+        }
+	up_read(&priv->sem);
+}
+
+struct mmu_notifier_ops gntdev_mmu_ops = {
+        .release                = mn_release,
+        .invalidate_page        = mn_invl_page,
+        .invalidate_range_start = mn_invl_range_start,
+};
+
+/* ------------------------------------------------------------------ */
+
+static int gntdev_open(struct inode *inode, struct file *flip)
+{
+        struct gntdev_priv *priv;
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+                return -ENOMEM;
+
+        INIT_LIST_HEAD(&priv->maps);
+	init_rwsem(&priv->sem);
+        priv->limit = limit;
+        priv->mm = get_task_mm(current);
+        priv->mn.ops = &gntdev_mmu_ops;
+        mmu_notifier_register(&priv->mn, priv->mm);
+	flip->private_data = priv;
+
+        if (debug)
+                printk("%s: priv %p\n", __FUNCTION__, priv);
+
+	return 0;
+}
+
+static int gntdev_release(struct inode *inode, struct file *flip)
+{
+        struct gntdev_priv *priv = flip->private_data;
+
+        if (debug)
+                printk("%s: priv %p\n", __FUNCTION__, priv);
+
+        BUG_ON(!list_empty(&priv->maps));
+        mmu_notifier_unregister(&priv->mn, priv->mm);
+        mm_release(current, priv->mm);
+        kfree(priv);
+	return 0;
+}
+
+static long gntdev_ioctl(struct file *flip,
+			 unsigned int cmd, unsigned long arg)
+{
+        struct gntdev_priv *priv = flip->private_data;
+        void __user *ptr = (void __user *)arg;
+
+	switch (cmd) {
+	case IOCTL_GNTDEV_MAP_GRANT_REF:
+	{
+		struct ioctl_gntdev_map_grant_ref op;
+		struct ioctl_gntdev_map_grant_ref __user *u = ptr;
+		struct grant_map *map;
+
+		if (copy_from_user(&op, u, sizeof(op)) != 0)
+                        return -EFAULT;
+		if (debug > 1)
+			printk("%s: priv %p, add %d\n", __FUNCTION__, priv,
+			       op.count);
+		if (unlikely(op.count <= 0))
+                        return -EINVAL;
+		if (unlikely(op.count > priv->limit))
+                        return -EINVAL;
+
+		map = gntdev_add_map(priv, op.count);
+                if (!map)
+                        return -ENOMEM;
+		if (copy_from_user(map->grants, &u->refs,
+				   sizeof(map->grants[0]) * op.count) != 0) {
+			gntdev_del_map(priv, map->index, map->count);
+			return -EFAULT;
+		}
+
+                op.index = map->index << PAGE_SHIFT;
+		if (copy_to_user(u, &op, sizeof(op)) != 0) {
+			gntdev_del_map(priv, map->index, map->count);
+                        return -EFAULT;
+		}
+
+                return 0;
+	}
+	case IOCTL_GNTDEV_UNMAP_GRANT_REF:
+	{
+		struct ioctl_gntdev_unmap_grant_ref op;
+
+		if (copy_from_user(&op, ptr, sizeof(op)) != 0)
+                        return -EFAULT;
+		if (debug > 1)
+			printk("%s: priv %p, del %d+%d\n", __FUNCTION__, priv,
+			       (int)op.index, (int)op.count);
+
+                return gntdev_del_map(priv, op.index >> PAGE_SHIFT, op.count);
+	}
+	case IOCTL_GNTDEV_GET_OFFSET_FOR_VADDR:
+	{
+		struct ioctl_gntdev_get_offset_for_vaddr op;
+		struct grant_map *map;
+
+		if (copy_from_user(&op, ptr, sizeof(op)) != 0)
+			return -EFAULT;
+		if (debug > 1)
+			printk("%s: priv %p, offset for vaddr %lx\n", __FUNCTION__, priv,
+			       (unsigned long)op.vaddr);
+
+		down_read(&priv->sem);
+		map = gntdev_find_map_vaddr(priv, op.vaddr);
+		if (map == NULL ||
+		    map->vma->vm_start != op.vaddr) {
+			up_read(&priv->sem);
+			return -EINVAL;
+		}
+		op.offset = map->index << PAGE_SHIFT;
+		op.count = map->count;
+		up_read(&priv->sem);
+
+		if (copy_to_user(ptr, &op, sizeof(op)) != 0)
+			return -EFAULT;
+		return 0;
+	}
+	case IOCTL_GNTDEV_SET_MAX_GRANTS:
+	{
+		struct ioctl_gntdev_set_max_grants op;
+
+		if (copy_from_user(&op, ptr, sizeof(op)) != 0)
+                        return -EFAULT;
+		if (debug)
+			printk("%s: priv %p, limit %d\n", __FUNCTION__, priv, op.count);
+                if (op.count > limit)
+                        return -EINVAL;
+                down_write(&priv->sem);
+                priv->limit = op.count;
+                up_write(&priv->sem);
+		return 0;
+	}
+	default:
+		if (debug)
+			printk("%s: priv %p, unknown cmd %x\n", __FUNCTION__, priv, cmd);
+		return -ENOIOCTLCMD;
+	}
+
+	return 0;
+}
+
+static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma) 
+{
+        struct gntdev_priv *priv = flip->private_data;
+        int index = vma->vm_pgoff;
+        int count = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+        struct grant_map *map;
+        int err = -EINVAL;
+
+	if ((vma->vm_flags & VM_WRITE) && !(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+
+        if (debug > 1)
+                printk("%s: map %d+%d at %lx (pgoff %lx)\n", __FUNCTION__,
+		       index, count, vma->vm_start, vma->vm_pgoff);
+
+	down_read(&priv->sem);
+        map = gntdev_find_map_index(priv, index, count);
+        if (!map)
+                goto unlock_out;
+        if (map->vma)
+                goto unlock_out;
+        if (priv->mm != vma->vm_mm) {
+                printk("%s: Huh? Other mm?\n", __FUNCTION__);
+                goto unlock_out;
+        }
+
+	vma->vm_ops = &gntdev_vmops;
+
+	vma->vm_flags |= VM_RESERVED;
+	vma->vm_flags |= VM_DONTCOPY;
+	vma->vm_flags |= VM_DONTEXPAND;
+
+	vma->vm_private_data = map;
+        map->vma = vma;
+
+        map->flags = GNTMAP_host_map | GNTMAP_application_map | GNTMAP_contains_pte;
+        if (!(vma->vm_flags & VM_WRITE))
+                map->flags |= GNTMAP_readonly;
+
+        err = apply_to_page_range(vma->vm_mm, vma->vm_start,
+				  vma->vm_end - vma->vm_start,
+				  find_grant_ptes, map);
+	if (err) {
+		goto unlock_out;
+		if (debug)
+			printk("%s: find_grant_ptes() failure.\n", __FUNCTION__);
+	}
+
+        err = map_grant_pages(map);
+	if (err) {
+		goto unlock_out;
+		if (debug)
+			printk("%s: map_grant_pages() failure.\n", __FUNCTION__);
+	}
+	map->is_mapped = 1;
+
+unlock_out:
+	up_read(&priv->sem);
+        return err;
+}
+
+static const struct file_operations gntdev_fops = {
+	.owner = THIS_MODULE,
+	.open = gntdev_open,
+	.release = gntdev_release,
+	.mmap = gntdev_mmap,
+	.unlocked_ioctl = gntdev_ioctl
+};
+
+static struct miscdevice gntdev_miscdev = {
+        .minor        = MISC_DYNAMIC_MINOR,
+        .name         = "gntdev",
+        .fops         = &gntdev_fops,
+};
+
+/* ------------------------------------------------------------------ */
+
+static int __init gntdev_init(void)
+{
+        int err;
+
+	if (!xen_domain())
+		return -ENODEV;
+
+        err = misc_register(&gntdev_miscdev);
+        if (err != 0) {
+		printk(KERN_ERR "Could not register gntdev device\n");
+		return err;
+	}
+	return 0;
+}
+
+static void __exit gntdev_exit(void)
+{
+        misc_deregister(&gntdev_miscdev);
+}
+
+module_init(gntdev_init);
+module_exit(gntdev_exit);
+
+/* ------------------------------------------------------------------ */
+
diff --git a/include/xen/gntdev.h b/include/xen/gntdev.h
new file mode 100644
index 0000000..8bd1467
--- /dev/null
+++ b/include/xen/gntdev.h
@@ -0,0 +1,119 @@
+/******************************************************************************
+ * gntdev.h
+ * 
+ * Interface to /dev/xen/gntdev.
+ * 
+ * Copyright (c) 2007, D G Murray
+ * 
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ * 
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ * 
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ * 
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef __LINUX_PUBLIC_GNTDEV_H__
+#define __LINUX_PUBLIC_GNTDEV_H__
+
+struct ioctl_gntdev_grant_ref {
+	/* The domain ID of the grant to be mapped. */
+	uint32_t domid;
+	/* The grant reference of the grant to be mapped. */
+	uint32_t ref;
+};
+
+/*
+ * Inserts the grant references into the mapping table of an instance
+ * of gntdev. N.B. This does not perform the mapping, which is deferred
+ * until mmap() is called with @index as the offset.
+ */
+#define IOCTL_GNTDEV_MAP_GRANT_REF \
+_IOC(_IOC_NONE, 'G', 0, sizeof(struct ioctl_gntdev_map_grant_ref))
+struct ioctl_gntdev_map_grant_ref {
+	/* IN parameters */
+	/* The number of grants to be mapped. */
+	uint32_t count;
+	uint32_t pad;
+	/* OUT parameters */
+	/* The offset to be used on a subsequent call to mmap(). */
+	uint64_t index;
+	/* Variable IN parameter. */
+	/* Array of grant references, of size @count. */
+	struct ioctl_gntdev_grant_ref refs[1];
+};
+
+/*
+ * Removes the grant references from the mapping table of an instance of
+ * of gntdev. N.B. munmap() must be called on the relevant virtual address(es)
+ * before this ioctl is called, or an error will result.
+ */
+#define IOCTL_GNTDEV_UNMAP_GRANT_REF \
+_IOC(_IOC_NONE, 'G', 1, sizeof(struct ioctl_gntdev_unmap_grant_ref))       
+struct ioctl_gntdev_unmap_grant_ref {
+	/* IN parameters */
+	/* The offset was returned by the corresponding map operation. */
+	uint64_t index;
+	/* The number of pages to be unmapped. */
+	uint32_t count;
+	uint32_t pad;
+};
+
+/*
+ * Returns the offset in the driver's address space that corresponds
+ * to @vaddr. This can be used to perform a munmap(), followed by an
+ * UNMAP_GRANT_REF ioctl, where no state about the offset is retained by
+ * the caller. The number of pages that were allocated at the same time as
+ * @vaddr is returned in @count.
+ *
+ * N.B. Where more than one page has been mapped into a contiguous range, the
+ *      supplied @vaddr must correspond to the start of the range; otherwise
+ *      an error will result. It is only possible to munmap() the entire
+ *      contiguously-allocated range at once, and not any subrange thereof.
+ */
+#define IOCTL_GNTDEV_GET_OFFSET_FOR_VADDR \
+_IOC(_IOC_NONE, 'G', 2, sizeof(struct ioctl_gntdev_get_offset_for_vaddr))
+struct ioctl_gntdev_get_offset_for_vaddr {
+	/* IN parameters */
+	/* The virtual address of the first mapped page in a range. */
+	uint64_t vaddr;
+	/* OUT parameters */
+	/* The offset that was used in the initial mmap() operation. */
+	uint64_t offset;
+	/* The number of pages mapped in the VM area that begins at @vaddr. */
+	uint32_t count;
+	uint32_t pad;
+};
+
+/*
+ * Sets the maximum number of grants that may mapped at once by this gntdev
+ * instance.
+ *
+ * N.B. This must be called before any other ioctl is performed on the device.
+ */
+#define IOCTL_GNTDEV_SET_MAX_GRANTS \
+_IOC(_IOC_NONE, 'G', 3, sizeof(struct ioctl_gntdev_set_max_grants))
+struct ioctl_gntdev_set_max_grants {
+	/* IN parameter */
+	/* The maximum number of grants that may be mapped at once. */
+	uint32_t count;
+};
+
+#endif /* __LINUX_PUBLIC_GNTDEV_H__ */

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: Re: pv_ops & gntdev?
  2009-03-02 22:53                             ` Gerd Hoffmann
@ 2009-03-03  1:05                               ` Jeremy Fitzhardinge
  2009-03-03 22:59                               ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-03  1:05 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
> Features:
>  * compiles ;)
>  * survived light testing.
>  * mapping via xc_gnttab_map_grant_ref*s* (i.e. multiple grants at once)
>    works stable.  Last time I tried that with 2.6.18 I quickly killed
>    the host kernel.
>  * much smaller than the 2.6.18 version.
>  * maps grants directly, doesn't burn pfns and kernel address space.
>   

Cool, thanks.  I'll stick it in when I've finished firefighting the 
current breakage.

    J

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: pv_ops & gntdev?
  2009-03-02 22:53                             ` Gerd Hoffmann
  2009-03-03  1:05                               ` Jeremy Fitzhardinge
@ 2009-03-03 22:59                               ` Jeremy Fitzhardinge
  2009-03-04  8:18                                 ` Gerd Hoffmann
  1 sibling, 1 reply; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-03 22:59 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
> Jeremy Fitzhardinge wrote:
>   
>> Gerd Hoffmann wrote:
>>     
>>> mmu motifier will fit the bill and additionally work fine on partial
>>> unmaps.
>>>       
>> OK, it'll be interesting to see how that turns out.
>>     
>
> Prelimary version attached.
>
> Features:
>  * compiles ;)
>  * survived light testing.
>  * mapping via xc_gnttab_map_grant_ref*s* (i.e. multiple grants at once)
>    works stable.  Last time I tried that with 2.6.18 I quickly killed
>    the host kernel.
>  * much smaller than the 2.6.18 version.
>  * maps grants directly, doesn't burn pfns and kernel address space.
>   

Do you have a test program?

> Issues:
>  (1) Grant teardown on unclean exit doesn't work yet.  Doesn't crash,
>      but I think I leak the pages because they are not unmapped via
>      grant api.
>  (2) Also on unclean exit my release callbacks are not called for some
>      silly reason, to be investigated.  Memory leak.  Probably related
>      to (1).  Maybe I did something wrong with the VM flags ...
>  (3) There are probably some nasty corner cases not handled correctly
>      yet, the whole thing needs a careful review once the two issues
>      listed above are fixed.
>
> Reviews, comments and hints are welcome,
>   

Comments below.  There's quite a bit of whitespace damage and formatting 
strangeness (NULL == tests, for example).  I've checked in a couple of 
followup patches to address some of the comments below, but not all.

> diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> index 0138113..555e4b6 100644
> --- a/drivers/xen/Kconfig
> +++ b/drivers/xen/Kconfig
> @@ -74,9 +74,17 @@ config XEN_COMPAT_XENFS
>           a xen platform.
>           If in doubt, say yes.
>  
> +config XEN_GNTDEV
> +	bool "userspace grant access device driver"
> +	depends on XEN
> +	default XEN_DOM0
> +	select MMU_NOTIFIER
> +	help
> +	  Allows userspace processes use grants.
> +
>  config XEN_XENBUS_FRONTEND
>         tristate
>  
>  config XEN_S3
>         def_bool y
> -       depends on XEN_DOM0 && ACPI
> \ No newline at end of file
> +       depends on XEN_DOM0 && ACPI
> diff --git a/drivers/xen/Makefile b/drivers/xen/Makefile
> index 4b01fc8..c2180d9 100644
> --- a/drivers/xen/Makefile
> +++ b/drivers/xen/Makefile
> @@ -5,8 +5,9 @@ obj-$(CONFIG_HOTPLUG_CPU)		+= cpu_hotplug.o
>  obj-$(CONFIG_XEN_XENCOMM)		+= xencomm.o
>  obj-$(CONFIG_XEN_BALLOON)		+= balloon.o
>  obj-$(CONFIG_XEN_DEV_EVTCHN)	+= evtchn.o
> +obj-$(CONFIG_XEN_GNTDEV)	+= gntdev.o
>  obj-$(CONFIG_XEN_BLKDEV_BACKEND)	+= blkback/
>  obj-$(CONFIG_XEN_NETDEV_BACKEND)	+= netback/
>  obj-$(CONFIG_XENFS)			+= xenfs/
>  
> -obj-$(CONFIG_XEN_S3)			+= acpi.o
> \ No newline at end of file
> +obj-$(CONFIG_XEN_S3)			+= acpi.o
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> new file mode 100644
> index 0000000..3dd0e5f
> --- /dev/null
> +++ b/drivers/xen/gntdev.c
> @@ -0,0 +1,594 @@
> +/******************************************************************************
> + * gntdev.c
> + *
> + * Device for accessing (in user-space) pages that have been granted by other
> + * domains.
> + *
> + * Copyright (c) 2006-2007, D G Murray.
> + *           (c) 2009 Gerd Hoffmann <kraxel@redhat.com>
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +#include <linux/miscdevice.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/sched.h>
> +#include <linux/rwsem.h>
> +
> +#include <xen/grant_table.h>
> +#include <xen/gntdev.h>
> +#include <asm/xen/hypervisor.h>
> +#include <asm/xen/hypercall.h>
> +#include <asm/xen/page.h>
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Derek G. Murray <Derek.Murray@cl.cam.ac.uk>, "
> +              "Gerd Hoffmann <kraxel@redhat.com>");
> +MODULE_DESCRIPTION("User-space granted page access driver");
> +
> +static int debug = 1;
> +module_param(debug, int, 0644);
> +static int limit = 1024;
> +module_param(limit, int, 0644);
> +
> +struct gntdev_priv {
> +        struct list_head maps;
> +        uint32_t used;
> +	uint32_t limit;
> +	struct rw_semaphore sem;
> +        struct mm_struct *mm;
> +        struct mmu_notifier mn;
> +};
> +
> +struct grant_info {
> +        domid_t domid;
> +        grant_ref_t ref;
> +        grant_handle_t handle;
> +};
> +
> +struct grant_map {
> +        struct list_head next;
> +        struct gntdev_priv *priv;
> +        struct vm_area_struct *vma;
> +        int index;
> +        int count;
> +	int flags;
> +	int is_mapped;
> +	struct ioctl_gntdev_grant_ref *grants;
> +	struct gnttab_map_grant_ref   *map_ops;
> +	struct gnttab_unmap_grant_ref *unmap_ops;
> +};
> +
> +/* ------------------------------------------------------------------ */
> +
> +static void gntdev_print_maps(struct gntdev_priv *priv,
> +                              char *text, int text_index)
> +{
> +        struct grant_map *map;
> +
> +        printk("%s: maps list (priv %p, usage %d/%d)\n",
> +               __FUNCTION__, priv, priv->used, priv->limit);
> +        list_for_each_entry(map, &priv->maps, next)
> +                printk("  index %2d, count %2d %s\n",
> +                       map->index, map->count,
> +                       map->index == text_index && text ? text : "");
> +}
> +
> +static struct grant_map *gntdev_add_map(struct gntdev_priv *priv, int count)
> +{
> +        struct grant_map *map, *add;
> +
> +        add = kzalloc(sizeof(struct grant_map), GFP_KERNEL);
> +        if (NULL == add)
> +                return NULL;
> +
> +	add->grants    = kzalloc(sizeof(add->grants[0])    * count, GFP_KERNEL);
> +	add->map_ops   = kzalloc(sizeof(add->map_ops[0])   * count, GFP_KERNEL);
> +	add->unmap_ops = kzalloc(sizeof(add->unmap_ops[0]) * count, GFP_KERNEL);
> +	if (NULL == add->grants  ||
> +	    NULL == add->map_ops ||
> +	    NULL == add->unmap_ops) {
> +		kfree(add->grants);
> +		kfree(add->map_ops);
> +		kfree(add->unmap_ops);
> +		kfree(add);
> +                return NULL;
> +	}
> +
> +        add->index = 0;
> +        add->count = count;
> +        add->priv  = priv;
> +
> +        down_write(&priv->sem);
> +        if (add->count + priv->used > priv->limit) {
> +		up_write(&priv->sem);
> +                kfree(add);
>   
Leaks ->grants, ->map_ops, ->unmap_ops?

> +                return NULL;
> +        }
> +
> +        list_for_each_entry(map, &priv->maps, next) {
> +                if (add->index + add->count < map->index) {
> +                        list_add_tail(&add->next, &map->next);
> +                        goto done;
> +                }
> +                add->index = map->index + map->count;
> +        }
> +        list_add_tail(&add->next, &priv->maps);
> +
> +done:
> +        priv->used += add->count;
> +        if (debug > 1)
> +                gntdev_print_maps(priv, "[new]", add->index);
> +	up_write(&priv->sem);
> +        return add;
> +}
> +
> +static struct grant_map *gntdev_find_map_index(struct gntdev_priv *priv, int index,
> +					       int count)
> +{
> +        struct grant_map *map;
> +
> +        list_for_each_entry(map, &priv->maps, next) {
> +                if (map->index != index)
> +                        continue;
> +                if (map->count != count)
> +                        continue;
> +                return map;
> +        }
> +        return NULL;
> +}
> +
> +static struct grant_map *gntdev_find_map_vaddr(struct gntdev_priv *priv,
> +					       unsigned long vaddr)
> +{
> +        struct grant_map *map;
> +
> +        list_for_each_entry(map, &priv->maps, next) {
> +		if (!map->vma)
> +			continue;
> +		if (vaddr < map->vma->vm_start)
> +			continue;
> +		if (vaddr >= map->vma->vm_end)
> +			continue;
> +                return map;
> +        }
> +        return NULL;
> +}
> +
> +static int gntdev_del_map(struct gntdev_priv *priv, int index,
> +                          int count)
> +{
> +        struct grant_map *map;
> +        int i, err = -EINVAL;
> +
> +        down_write(&priv->sem);
> +        map = gntdev_find_map_index(priv, index, count);
> +        if (NULL == map)
> +                goto out;
> +
> +	err = -EBUSY;
> +	if (map->vma)
> +		goto out;
> +        for (i = 0; i < map->count; i++)
> +                if (map->unmap_ops[i].handle)
> +                        goto out;
> +
> +        list_del(&map->next);
> +        kfree(map->grants);
> +        kfree(map->map_ops);
> +        kfree(map->unmap_ops);
> +        kfree(map);
> +        priv->used -= map->count;
> +        err = 0;
> +
> +out:
> +	up_write(&priv->sem);
> +        return err;
> +}
> +
> +/* ------------------------------------------------------------------ */
> +
> +static int find_grant_ptes(pte_t *pte, pgtable_t token, unsigned long addr, void *data)
> +{
> +        struct grant_map *map = data;
> +        unsigned int pgnr = (addr - map->vma->vm_start) >> PAGE_SHIFT;
> +	u64 mpte;
>   
pteval_t or pte_t
> +
> +	BUG_ON(pgnr >= map->count);
> +	mpte  = (u64)pfn_to_mfn(page_to_pfn(token)) << PAGE_SHIFT;
> +	mpte |= (unsigned long)pte & ~PAGE_MASK;
>   
(!) You're casting pte_t * to unsigned long, masking off the lower bits 
then oring them into your pte.  That doesn't look like it will mean very 
much...  I guess this explains your not-unmapping bug.

This should probably be using mfn_pte() anyway:

	mfn = pfn_to_mfn(page_to_pfn(token));
	pgprot = __pgprot(pte->pte & PTE_FLAGS_MASK);
 	mpte = mfn_pte(mfn, pgprot);


> +        gnttab_set_map_op(&map->map_ops[pgnr], mpte, map->flags,
> +                          map->grants[pgnr].ref,
> +                          map->grants[pgnr].domid);
> +        gnttab_set_unmap_op(&map->unmap_ops[pgnr], mpte, map->flags,
> +			    0 /* handle */);
> +        return 0;
> +}
> +
> +static int map_grant_pages(struct grant_map *map)
> +{
> +	int i, err = 0;
> +
> +	if (debug)
> +		printk("%s: map %d+%d\n", __FUNCTION__, map->index, map->count);
> +        BUG_ON(HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,
> +					 map->map_ops, map->count));
>   
WARN_ON() is better here, because we can probably limp on if it fails.
> +	for (i = 0; i < map->count; i++) {
> +		if (map->map_ops[i].status)
> +			err = -EINVAL;
> +		map->unmap_ops[i].handle = map->map_ops[i].handle;
> +	}
> +	return err;
> +}
> +
> +static int unmap_grant_pages(struct grant_map *map, int offset, int pages)
> +{
> +	int i, err = 0;
> +
> +	if (debug)
> +		printk("%s: map %d+%d [%d+%d]\n", __FUNCTION__,
> +		       map->index, map->count, offset, pages);
> +	BUG_ON(HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref,
> +					 map->unmap_ops + offset, pages));
>   
Ditto.
> +	for (i = 0; i < pages; i++) {
> +		if (map->unmap_ops[offset+i].status)
> +			err = -EINVAL;
> +		map->unmap_ops[offset+i].handle = 0;
> +	}
> +	return err;
> +}
> +
> +/* ------------------------------------------------------------------ */
> +
> +static void gntdev_vma_close(struct vm_area_struct *vma)
> +{
> +        struct grant_map *map = vma->vm_private_data;
> +
> +        if (debug > 1)
> +                printk("%s\n", __FUNCTION__);
> +	map->is_mapped = 0;
> +	map->vma = NULL;
> +	vma->vm_private_data = NULL;
> +}
> +
> +static int gntdev_vma_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +        if (debug)
> +                printk("%s: vaddr %p, pgoff %ld (shouldn't happen)\n",
> +		       __FUNCTION__, vmf->virtual_address, vmf->pgoff);
> +	vmf->flags = VM_FAULT_ERROR;
> +        return 0;
> +}
> +
> +static struct vm_operations_struct gntdev_vmops = {
> +	.close = gntdev_vma_close,
> +	.fault = gntdev_vma_fault,
> +};
> +
> +/* ------------------------------------------------------------------ */
> +
> +static void mn_invl_range_start(struct mmu_notifier *mn,
> +				struct mm_struct *mm,
> +				unsigned long start, unsigned long end)
> +{
> +        struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
> +        struct grant_map *map;
> +	unsigned long mstart, mend;
> +	int err;
> +
> +	down_read(&priv->sem);
> +        list_for_each_entry(map, &priv->maps, next) {
> +                if (!map->vma)
> +                        continue;
> +                if (!map->is_mapped)
> +                        continue;
> +                if (map->vma->vm_start >= end)
> +                        continue;
> +                if (map->vma->vm_end <= start)
> +                        continue;
> +		mstart = max(start, map->vma->vm_start);
> +		mend   = min(end,   map->vma->vm_end);
> +                if (debug > 1)
> +                        printk("%s: map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
> +                               __FUNCTION__, map->index, map->count,
> +                               map->vma->vm_start, map->vma->vm_end,
> +			       start, end, mstart, mend);
> +		err = unmap_grant_pages(map,
> +					(mstart - map->vma->vm_start) >> PAGE_SHIFT,
> +					(mend - mstart) >> PAGE_SHIFT);
> +		BUG_ON(err);
> +        }
> +	up_read(&priv->sem);
> +}
> +
> +static void mn_invl_page(struct mmu_notifier *mn,
> +			 struct mm_struct *mm,
> +			 unsigned long address)
> +{
> +	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
> +}
> +
> +static void mn_release(struct mmu_notifier *mn,
> +		       struct mm_struct *mm)
> +{
> +        struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
> +        struct grant_map *map;
> +	int err;
> +
> +	down_read(&priv->sem);
> +        list_for_each_entry(map, &priv->maps, next) {
> +                if (!map->vma)
> +                        continue;
> +                if (debug > 1)
> +                        printk("%s: map %d+%d (%lx %lx)\n",
> +                               __FUNCTION__, map->index, map->count,
> +                               map->vma->vm_start, map->vma->vm_end);
> +		err = unmap_grant_pages(map, 0, map->count);
> +		BUG_ON(err);
> +        }
> +	up_read(&priv->sem);
> +}
> +
> +struct mmu_notifier_ops gntdev_mmu_ops = {
> +        .release                = mn_release,
> +        .invalidate_page        = mn_invl_page,
> +        .invalidate_range_start = mn_invl_range_start,
> +};
> +
> +/* ------------------------------------------------------------------ */
> +
> +static int gntdev_open(struct inode *inode, struct file *flip)
> +{
> +        struct gntdev_priv *priv;
> +
> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +                return -ENOMEM;
> +
> +        INIT_LIST_HEAD(&priv->maps);
> +	init_rwsem(&priv->sem);
> +        priv->limit = limit;
> +        priv->mm = get_task_mm(current);
>   
Is this to cope with the /dev/gnttab fd being inherited by/passed to 
some other process, even if the mappings don't follow it?
> +        priv->mn.ops = &gntdev_mmu_ops;
> +        mmu_notifier_register(&priv->mn, priv->mm);
> +	flip->private_data = priv;
> +
> +        if (debug)
> +                printk("%s: priv %p\n", __FUNCTION__, priv);
> +
> +	return 0;
> +}
> +
> +static int gntdev_release(struct inode *inode, struct file *flip)
> +{
> +        struct gntdev_priv *priv = flip->private_data;
> +
> +        if (debug)
> +                printk("%s: priv %p\n", __FUNCTION__, priv);
> +
> +        BUG_ON(!list_empty(&priv->maps));
> +        mmu_notifier_unregister(&priv->mn, priv->mm);
> +        mm_release(current, priv->mm);
> +        kfree(priv);
> +	return 0;
> +}
> +
> +static long gntdev_ioctl(struct file *flip,
> +			 unsigned int cmd, unsigned long arg)
> +{
> +        struct gntdev_priv *priv = flip->private_data;
> +        void __user *ptr = (void __user *)arg;
> +
> +	switch (cmd) {
> +	case IOCTL_GNTDEV_MAP_GRANT_REF:
>   
ioctl switches should be split into separate functions.

> +	{
> +		struct ioctl_gntdev_map_grant_ref op;
> +		struct ioctl_gntdev_map_grant_ref __user *u = ptr;
> +		struct grant_map *map;
> +
> +		if (copy_from_user(&op, u, sizeof(op)) != 0)
> +                        return -EFAULT;
> +		if (debug > 1)
> +			printk("%s: priv %p, add %d\n", __FUNCTION__, priv,
> +			       op.count);
> +		if (unlikely(op.count <= 0))
> +                        return -EINVAL;
> +		if (unlikely(op.count > priv->limit))
> +                        return -EINVAL;
> +
> +		map = gntdev_add_map(priv, op.count);
> +                if (!map)
> +                        return -ENOMEM;
> +		if (copy_from_user(map->grants, &u->refs,
> +				   sizeof(map->grants[0]) * op.count) != 0) {
> +			gntdev_del_map(priv, map->index, map->count);
> +			return -EFAULT;
> +		}
> +
> +                op.index = map->index << PAGE_SHIFT;
> +		if (copy_to_user(u, &op, sizeof(op)) != 0) {
> +			gntdev_del_map(priv, map->index, map->count);
> +                        return -EFAULT;
> +		}
> +
> +                return 0;
> +	}
> +	case IOCTL_GNTDEV_UNMAP_GRANT_REF:
> +	{
> +		struct ioctl_gntdev_unmap_grant_ref op;
> +
> +		if (copy_from_user(&op, ptr, sizeof(op)) != 0)
> +                        return -EFAULT;
> +		if (debug > 1)
> +			printk("%s: priv %p, del %d+%d\n", __FUNCTION__, priv,
> +			       (int)op.index, (int)op.count);
> +
> +                return gntdev_del_map(priv, op.index >> PAGE_SHIFT, op.count);
> +	}
> +	case IOCTL_GNTDEV_GET_OFFSET_FOR_VADDR:
> +	{
> +		struct ioctl_gntdev_get_offset_for_vaddr op;
> +		struct grant_map *map;
> +
> +		if (copy_from_user(&op, ptr, sizeof(op)) != 0)
> +			return -EFAULT;
> +		if (debug > 1)
> +			printk("%s: priv %p, offset for vaddr %lx\n", __FUNCTION__, priv,
> +			       (unsigned long)op.vaddr);
> +
> +		down_read(&priv->sem);
> +		map = gntdev_find_map_vaddr(priv, op.vaddr);
> +		if (map == NULL ||
> +		    map->vma->vm_start != op.vaddr) {
> +			up_read(&priv->sem);
> +			return -EINVAL;
> +		}
> +		op.offset = map->index << PAGE_SHIFT;
> +		op.count = map->count;
> +		up_read(&priv->sem);
> +
> +		if (copy_to_user(ptr, &op, sizeof(op)) != 0)
> +			return -EFAULT;
> +		return 0;
> +	}
> +	case IOCTL_GNTDEV_SET_MAX_GRANTS:
> +	{
> +		struct ioctl_gntdev_set_max_grants op;
> +
> +		if (copy_from_user(&op, ptr, sizeof(op)) != 0)
> +                        return -EFAULT;
> +		if (debug)
> +			printk("%s: priv %p, limit %d\n", __FUNCTION__, priv, op.count);
> +                if (op.count > limit)
> +                        return -EINVAL;
> +                down_write(&priv->sem);
> +                priv->limit = op.count;
> +                up_write(&priv->sem);
> +		return 0;
> +	}
> +	default:
> +		if (debug)
> +			printk("%s: priv %p, unknown cmd %x\n", __FUNCTION__, priv, cmd);
> +		return -ENOIOCTLCMD;
> +	}
> +
> +	return 0;
> +}
> +
> +static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma) 
> +{
> +        struct gntdev_priv *priv = flip->private_data;
> +        int index = vma->vm_pgoff;
> +        int count = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +        struct grant_map *map;
> +        int err = -EINVAL;
> +
> +	if ((vma->vm_flags & VM_WRITE) && !(vma->vm_flags & VM_SHARED))
> +		return -EINVAL;
> +
> +        if (debug > 1)
> +                printk("%s: map %d+%d at %lx (pgoff %lx)\n", __FUNCTION__,
> +		       index, count, vma->vm_start, vma->vm_pgoff);
> +
> +	down_read(&priv->sem);
> +        map = gntdev_find_map_index(priv, index, count);
> +        if (!map)
> +                goto unlock_out;
> +        if (map->vma)
> +                goto unlock_out;
> +        if (priv->mm != vma->vm_mm) {
> +                printk("%s: Huh? Other mm?\n", __FUNCTION__);
> +                goto unlock_out;
> +        }
> +
> +	vma->vm_ops = &gntdev_vmops;
> +
> +	vma->vm_flags |= VM_RESERVED;
> +	vma->vm_flags |= VM_DONTCOPY;
> +	vma->vm_flags |= VM_DONTEXPAND;
>   
VM_IO?
> +
> +	vma->vm_private_data = map;
> +        map->vma = vma;
> +
> +        map->flags = GNTMAP_host_map | GNTMAP_application_map | GNTMAP_contains_pte;
> +        if (!(vma->vm_flags & VM_WRITE))
> +                map->flags |= GNTMAP_readonly;
> +
> +        err = apply_to_page_range(vma->vm_mm, vma->vm_start,
> +				  vma->vm_end - vma->vm_start,
> +				  find_grant_ptes, map);
> +	if (err) {
> +		goto unlock_out;
> +		if (debug)
> +			printk("%s: find_grant_ptes() failure.\n", __FUNCTION__);
> +	}
> +
> +        err = map_grant_pages(map);
> +	if (err) {
> +		goto unlock_out;
> +		if (debug)
> +			printk("%s: map_grant_pages() failure.\n", __FUNCTION__);
> +	}
> +	map->is_mapped = 1;
> +
> +unlock_out:
> +	up_read(&priv->sem);
> +        return err;
> +}
> +
> +static const struct file_operations gntdev_fops = {
> +	.owner = THIS_MODULE,
> +	.open = gntdev_open,
> +	.release = gntdev_release,
> +	.mmap = gntdev_mmap,
> +	.unlocked_ioctl = gntdev_ioctl
> +};
> +
> +static struct miscdevice gntdev_miscdev = {
> +        .minor        = MISC_DYNAMIC_MINOR,
> +        .name         = "gntdev",
> +        .fops         = &gntdev_fops,
> +};
> +
> +/* ------------------------------------------------------------------ */
> +
> +static int __init gntdev_init(void)
> +{
> +        int err;
> +
> +	if (!xen_domain())
> +		return -ENODEV;
> +
> +        err = misc_register(&gntdev_miscdev);
> +        if (err != 0) {
> +		printk(KERN_ERR "Could not register gntdev device\n");
> +		return err;
> +	}
> +	return 0;
> +}
> +
> +static void __exit gntdev_exit(void)
> +{
> +        misc_deregister(&gntdev_miscdev);
> +}
> +
> +module_init(gntdev_init);
> +module_exit(gntdev_exit);
> +
> +/* ------------------------------------------------------------------ */
> +
> diff --git a/include/xen/gntdev.h b/include/xen/gntdev.h
> new file mode 100644
> index 0000000..8bd1467
> --- /dev/null
> +++ b/include/xen/gntdev.h
> @@ -0,0 +1,119 @@
> +/******************************************************************************
> + * gntdev.h
> + * 
> + * Interface to /dev/xen/gntdev.
> + * 
> + * Copyright (c) 2007, D G Murray
> + * 
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License version 2
> + * as published by the Free Software Foundation; or, when distributed
> + * separately from the Linux kernel or incorporated into other
> + * software packages, subject to the following license:
> + * 
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this source file (the "Software"), to deal in the Software without
> + * restriction, including without limitation the rights to use, copy, modify,
> + * merge, publish, distribute, sublicense, and/or sell copies of the Software,
> + * and to permit persons to whom the Software is furnished to do so, subject to
> + * the following conditions:
> + * 
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + * 
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + */
> +
> +#ifndef __LINUX_PUBLIC_GNTDEV_H__
> +#define __LINUX_PUBLIC_GNTDEV_H__
> +
> +struct ioctl_gntdev_grant_ref {
> +	/* The domain ID of the grant to be mapped. */
> +	uint32_t domid;
> +	/* The grant reference of the grant to be mapped. */
> +	uint32_t ref;
> +};
> +
> +/*
> + * Inserts the grant references into the mapping table of an instance
> + * of gntdev. N.B. This does not perform the mapping, which is deferred
> + * until mmap() is called with @index as the offset.
> + */
> +#define IOCTL_GNTDEV_MAP_GRANT_REF \
> +_IOC(_IOC_NONE, 'G', 0, sizeof(struct ioctl_gntdev_map_grant_ref))
> +struct ioctl_gntdev_map_grant_ref {
> +	/* IN parameters */
> +	/* The number of grants to be mapped. */
> +	uint32_t count;
> +	uint32_t pad;
> +	/* OUT parameters */
> +	/* The offset to be used on a subsequent call to mmap(). */
> +	uint64_t index;
> +	/* Variable IN parameter. */
> +	/* Array of grant references, of size @count. */
> +	struct ioctl_gntdev_grant_ref refs[1];
> +};
> +
> +/*
> + * Removes the grant references from the mapping table of an instance of
> + * of gntdev. N.B. munmap() must be called on the relevant virtual address(es)
> + * before this ioctl is called, or an error will result.
> + */
> +#define IOCTL_GNTDEV_UNMAP_GRANT_REF \
> +_IOC(_IOC_NONE, 'G', 1, sizeof(struct ioctl_gntdev_unmap_grant_ref))       
> +struct ioctl_gntdev_unmap_grant_ref {
> +	/* IN parameters */
> +	/* The offset was returned by the corresponding map operation. */
> +	uint64_t index;
> +	/* The number of pages to be unmapped. */
> +	uint32_t count;
> +	uint32_t pad;
> +};
> +
> +/*
> + * Returns the offset in the driver's address space that corresponds
> + * to @vaddr. This can be used to perform a munmap(), followed by an
> + * UNMAP_GRANT_REF ioctl, where no state about the offset is retained by
> + * the caller. The number of pages that were allocated at the same time as
> + * @vaddr is returned in @count.
> + *
> + * N.B. Where more than one page has been mapped into a contiguous range, the
> + *      supplied @vaddr must correspond to the start of the range; otherwise
> + *      an error will result. It is only possible to munmap() the entire
> + *      contiguously-allocated range at once, and not any subrange thereof.
> + */
> +#define IOCTL_GNTDEV_GET_OFFSET_FOR_VADDR \
> +_IOC(_IOC_NONE, 'G', 2, sizeof(struct ioctl_gntdev_get_offset_for_vaddr))
> +struct ioctl_gntdev_get_offset_for_vaddr {
> +	/* IN parameters */
> +	/* The virtual address of the first mapped page in a range. */
> +	uint64_t vaddr;
> +	/* OUT parameters */
> +	/* The offset that was used in the initial mmap() operation. */
> +	uint64_t offset;
> +	/* The number of pages mapped in the VM area that begins at @vaddr. */
> +	uint32_t count;
> +	uint32_t pad;
> +};
> +
> +/*
> + * Sets the maximum number of grants that may mapped at once by this gntdev
> + * instance.
> + *
> + * N.B. This must be called before any other ioctl is performed on the device.
> + */
> +#define IOCTL_GNTDEV_SET_MAX_GRANTS \
> +_IOC(_IOC_NONE, 'G', 3, sizeof(struct ioctl_gntdev_set_max_grants))
> +struct ioctl_gntdev_set_max_grants {
> +	/* IN parameter */
> +	/* The maximum number of grants that may be mapped at once. */
> +	uint32_t count;
> +};
> +
> +#endif /* __LINUX_PUBLIC_GNTDEV_H__ */
>   

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: pv_ops & gntdev?
  2009-03-03 22:59                               ` Jeremy Fitzhardinge
@ 2009-03-04  8:18                                 ` Gerd Hoffmann
  2009-03-04 14:52                                   ` Gerd Hoffmann
  0 siblings, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-03-04  8:18 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Jeremy Fitzhardinge wrote:
> Do you have a test program?

Qemu with xen support, including userspace backends for disk & nic which
use gntdev.

Grab it here:
http://git.et.redhat.com/?p=qemu-kraxel.git;a=shortlog;h=refs/heads/xenbits.v0

usage:
qemu -M xenpv -xen-create -uuid $(uuidgen) \
  -kernel <bzImage> -initrd <initrd> \
  -drive media=disk,if=xen,file=<diskimage> \
  -net nic,model=xen,macaddr=<addr> \
  -serial <yourconsolehere>

> Comments below.  There's quite a bit of whitespace damage and formatting
> strangeness (NULL == tests, for example).  I've checked in a couple of
> followup patches to address some of the comments below, but not all.

Ok, I'll grab a fresh checkout and go over it.

>> +        up_write(&priv->sem);
>> +                kfree(add);
>>   
> Leaks ->grants, ->map_ops, ->unmap_ops?

Yep.

>> +    u64 mpte;
>>   
> pteval_t or pte_t

Well, gnttab_set_map_op() wants u64 there, thats why I did it that way.

>> +    BUG_ON(pgnr >= map->count);
>> +    mpte  = (u64)pfn_to_mfn(page_to_pfn(token)) << PAGE_SHIFT;
>> +    mpte |= (unsigned long)pte & ~PAGE_MASK;
>>   
> (!) You're casting pte_t * to unsigned long, masking off the lower bits
> then oring them into your pte.  That doesn't look like it will mean very
> much...  I guess this explains your not-unmapping bug.
> 
> This should probably be using mfn_pte() anyway:
> 
>     mfn = pfn_to_mfn(page_to_pfn(token));
>     pgprot = __pgprot(pte->pte & PTE_FLAGS_MASK);
>     mpte = mfn_pte(mfn, pgprot);

Looks better indeed.  Unclean stuff carried over from 2.6.18 ...

>> +        BUG_ON(HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,
>> +                     map->map_ops, map->count));
>>   
> WARN_ON() is better here, because we can probably limp on if it fails.

Agree.

>> +static long gntdev_ioctl(struct file *flip,
>> +             unsigned int cmd, unsigned long arg)
>> +{
>> +        struct gntdev_priv *priv = flip->private_data;
>> +        void __user *ptr = (void __user *)arg;
>> +
>> +    switch (cmd) {
>> +    case IOCTL_GNTDEV_MAP_GRANT_REF:
>>   
> ioctl switches should be split into separate functions.

All of them or just the larger ones?

>> +    vma->vm_flags |= VM_RESERVED;
>> +    vma->vm_flags |= VM_DONTCOPY;
>> +    vma->vm_flags |= VM_DONTEXPAND;
>>   
> VM_IO?

Dunno, maybe.  2.6.18 used VM_RESERVED, thats why I sticked to that.
According to mm.h there isn't a big difference between VM_IO and
VM_RESERVED anyway ...

>> +        priv->mm = get_task_mm(current);
>>
>Is this to cope with the /dev/gnttab fd being inherited by/passed to
>some other process, even if the mappings don't follow it?

Well, sort of.  Main intention is that I want to keep track of the mm I
registered the mmu notifier for.  But, yes, it is also used to cope with
fd's inherited / passed on.  It will not work (mmap -> EINVAL), but at
least the driver shouldn't trip over it.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: pv_ops & gntdev?
  2009-03-04  8:18                                 ` Gerd Hoffmann
@ 2009-03-04 14:52                                   ` Gerd Hoffmann
  2009-03-04 15:16                                     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 27+ messages in thread
From: Gerd Hoffmann @ 2009-03-04 14:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
> Jeremy Fitzhardinge wrote:
>> Do you have a test program?

> usage:
> qemu -M xenpv -xen-create -uuid $(uuidgen) \
>   -kernel <bzImage> -initrd <initrd> \
>   -drive media=disk,if=xen,file=<diskimage> \
>   -net nic,model=xen,macaddr=<addr> \
>   -serial <yourconsolehere>

"-m 256" is needed too.

>>> +    BUG_ON(pgnr >= map->count);
>>> +    mpte  = (u64)pfn_to_mfn(page_to_pfn(token)) << PAGE_SHIFT;
>>> +    mpte |= (unsigned long)pte & ~PAGE_MASK;
>>>   
>> (!) You're casting pte_t * to unsigned long, masking off the lower bits
>> then oring them into your pte.  That doesn't look like it will mean very
>> much...  I guess this explains your not-unmapping bug.
>>
>> This should probably be using mfn_pte() anyway:
>>
>>     mfn = pfn_to_mfn(page_to_pfn(token));
>>     pgprot = __pgprot(pte->pte & PTE_FLAGS_MASK);
>>     mpte = mfn_pte(mfn, pgprot);
> 
> Looks better indeed.  Unclean stuff carried over from 2.6.18 ...

Well, no.  mpte (bad name indeed) isn't a pte, but the *pointer*
(machine address) to the pte.

I've fixed that and did a few more cleanups, patch comes later today.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: pv_ops & gntdev?
  2009-03-04 14:52                                   ` Gerd Hoffmann
@ 2009-03-04 15:16                                     ` Jeremy Fitzhardinge
  2009-03-04 15:32                                       ` Gerd Hoffmann
  0 siblings, 1 reply; 27+ messages in thread
From: Jeremy Fitzhardinge @ 2009-03-04 15:16 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: Xen Development Mailing List

Gerd Hoffmann wrote:
> Gerd Hoffmann wrote:
>   
>> Jeremy Fitzhardinge wrote:
>>     
>>> Do you have a test program?
>>>       
>
>   
>> usage:
>> qemu -M xenpv -xen-create -uuid $(uuidgen) \
>>   -kernel <bzImage> -initrd <initrd> \
>>   -drive media=disk,if=xen,file=<diskimage> \
>>   -net nic,model=xen,macaddr=<addr> \
>>   -serial <yourconsolehere>
>>     
>
> "-m 256" is needed too.
>
>   
>>>> +    BUG_ON(pgnr >= map->count);
>>>> +    mpte  = (u64)pfn_to_mfn(page_to_pfn(token)) << PAGE_SHIFT;
>>>> +    mpte |= (unsigned long)pte & ~PAGE_MASK;
>>>>   
>>>>         
>>> (!) You're casting pte_t * to unsigned long, masking off the lower bits
>>> then oring them into your pte.  That doesn't look like it will mean very
>>> much...  I guess this explains your not-unmapping bug.
>>>
>>> This should probably be using mfn_pte() anyway:
>>>
>>>     mfn = pfn_to_mfn(page_to_pfn(token));
>>>     pgprot = __pgprot(pte->pte & PTE_FLAGS_MASK);
>>>     mpte = mfn_pte(mfn, pgprot);
>>>       
>> Looks better indeed.  Unclean stuff carried over from 2.6.18 ...
>>     
>
> Well, no.  mpte (bad name indeed) isn't a pte, but the *pointer*
> (machine address) to the pte.
>   

Yes, but it *looks* better ;) Yes, my "cleanup" will have screwed up the 
actual functional parts in that case.   Hiding the PAGE_MASK in a 
tactical application of PAGE_OFFSET will probably help make it clearer.

    J

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: pv_ops & gntdev?
  2009-03-04 15:16                                     ` Jeremy Fitzhardinge
@ 2009-03-04 15:32                                       ` Gerd Hoffmann
  0 siblings, 0 replies; 27+ messages in thread
From: Gerd Hoffmann @ 2009-03-04 15:32 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen Development Mailing List

> Yes, but it *looks* better ;)

indeed ;)

> Yes, my "cleanup" will have screwed up the
> actual functional parts in that case.   Hiding the PAGE_MASK in a
> tactical application of PAGE_OFFSET will probably help make it clearer.

Well, maybe there is a macro for that, the name is *not* PAGE_OFFSET though.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2009-03-04 15:32 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-24 18:45 pv_ops & gntdev? Gerd Hoffmann
2009-02-24 21:18 ` Jeremy Fitzhardinge
2009-02-24 22:12   ` Gerd Hoffmann
2009-02-25 10:05     ` Gerd Hoffmann
2009-02-25 17:51       ` Jeremy Fitzhardinge
2009-02-25 18:37         ` Gerd Hoffmann
2009-02-25 20:06           ` Jeremy Fitzhardinge
2009-02-25 21:19             ` Gerd Hoffmann
2009-02-25 21:35               ` Jeremy Fitzhardinge
2009-02-25 21:43                 ` Gerd Hoffmann
2009-02-25 21:48                   ` Jeremy Fitzhardinge
2009-02-25 22:11                     ` Gerd Hoffmann
2009-02-25 22:28                       ` Jeremy Fitzhardinge
2009-02-25 22:42                         ` Gerd Hoffmann
2009-02-25 22:59                           ` Jeremy Fitzhardinge
2009-03-02 22:53                             ` Gerd Hoffmann
2009-03-03  1:05                               ` Jeremy Fitzhardinge
2009-03-03 22:59                               ` Jeremy Fitzhardinge
2009-03-04  8:18                                 ` Gerd Hoffmann
2009-03-04 14:52                                   ` Gerd Hoffmann
2009-03-04 15:16                                     ` Jeremy Fitzhardinge
2009-03-04 15:32                                       ` Gerd Hoffmann
2009-02-25 18:56         ` Gerd Hoffmann
2009-02-25 20:10           ` Jeremy Fitzhardinge
2009-02-25 21:45             ` Gerd Hoffmann
2009-02-25 22:16             ` Gerd Hoffmann
2009-02-25 22:30               ` Jeremy Fitzhardinge

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.