* [RFC] PAGE_RW Should be added to PAGE_COPY ?
@ 2006-09-13 14:02 Yingchao Zhou
2006-09-14 14:48 ` Hugh Dickins
0 siblings, 1 reply; 8+ messages in thread
From: Yingchao Zhou @ 2006-09-13 14:02 UTC (permalink / raw)
To: linux-kernel; +Cc: akpm, alan, zxc
The current kernel set PAGE_COPY without write bit. This will cause intermittent non-cosistent data for user-level network drivers such as Infiniband, Quadrics and Myrinet. Which has also be mentioned by Costin Iancu in the paper "HUNTing the Overlap " (PACT'05).
An example of such phenomena is the following sequences:
register a memory space BUFF for receive message,
receive message,
call mprotect(...PROT_NONE...) and mprotect(...PROT_READ|PROT_WRITE) one by one,
write into BUFF,
then receive again.
The second time received data will perhaps not be the data sent by the peer machine but the data written by itself in the 4th step.
The reson is that :
1) User-level network driver locks phy pages when memory space is registered;
2) 2 calls to mprotect change ptes in the space to PAGE_COPY, so write any page in the space will cause a page fault;
3) In the page fault handler, it goes to do_wp_page, and in it if Page Is Locked, a new page is generated and filled into the pte. So the physical page seen by the host is not the same one by the NIC.
Adding PAGE_RW to PAGE_COPY will resolve this problem.
In my option, the reason for absense of RW is to save memory by mapping all those only read pages into ZERO_PAGE. But is there really programs which make many read-ops in memory space without even initialize them?
___________________________________________________
_ Yingchao Zhou _
_ ICT, CAS _
_ (86)010-62601009 _
___________________________________________________
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] PAGE_RW Should be added to PAGE_COPY ?
2006-09-13 14:02 Yingchao Zhou
@ 2006-09-14 14:48 ` Hugh Dickins
0 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2006-09-14 14:48 UTC (permalink / raw)
To: Yingchao Zhou; +Cc: linux-kernel, akpm, alan, zxc
On Wed, 13 Sep 2006, Yingchao Zhou wrote:
>
> The current kernel set PAGE_COPY without write bit. This will cause intermittent non-cosistent data for user-level network drivers such as Infiniband, Quadrics and Myrinet. Which has also be mentioned by Costin Iancu in the paper "HUNTing the Overlap " (PACT'05).
> An example of such phenomena is the following sequences:
> register a memory space BUFF for receive message,
> receive message,
> call mprotect(...PROT_NONE...) and mprotect(...PROT_READ|PROT_WRITE) one by one,
> write into BUFF,
> then receive again.
> The second time received data will perhaps not be the data sent by the peer machine but the data written by itself in the 4th step.
PAGE_COPY (without the write bit) is used when the area was mmap'ed
MAP_PRIVATE: which indeed is asking for private copies of pages to
be made - which will be left containing the data written there by the
application, rather than shared data received later by the driver.
You want to mmap MAP_SHARED, which will use PAGE_SHARED instead,
including the write bit, both before and after the mprotects.
There should be no problem then: do you actually see a problem
when MAP_SHARED is used?
(You don't mention which release you're describing, and some of
the details may vary: the not-yet-started 2.6.19 is likely to use
PAGE_COPY even when MAP_SHARED, to help it keep track of the number
of dirty pages; but in that case, do_wp_page() won't make a copy.)
>
> The reson is that :
> 1) User-level network driver locks phy pages when memory space is registered;
> 2) 2 calls to mprotect change ptes in the space to PAGE_COPY, so write any page in the space will cause a page fault;
Not if PROT_WRITE, MAP_SHARED I think.
> 3) In the page fault handler, it goes to do_wp_page, and in it if Page Is Locked, a new page is generated and filled into the pte. So the physical page seen by the host is not the same one by the NIC.
When MAP_PRIVATE, it's not the page being locked that causes the copy
(it's not normally locked there, is it?), it's that it's not PageAnon;
or if you're looking at 2.6.12 or older, that page_count is raised.
>
> Adding PAGE_RW to PAGE_COPY will resolve this problem.
No! That would give every user write access to shared files they
should have no write access to.
> In my option, the reason for absense of RW is to save memory by mapping all those only read pages into ZERO_PAGE. But is there really programs which make many read-ops in memory space without even initialize them?
Not just the ZERO_PAGE: initial program data is another common example.
Hugh
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: Re: [RFC] PAGE_RW Should be added to PAGE_COPY ?
@ 2006-09-15 3:38 Yingchao Zhou
2006-09-15 4:30 ` Hugh Dickins
0 siblings, 1 reply; 8+ messages in thread
From: Yingchao Zhou @ 2006-09-15 3:38 UTC (permalink / raw)
To: Hugh Dickins; +Cc: linux-kernel, akpm, alan, zxc
>On Fri, 15 Sep 2006, Yingchao Zhou wrote:
>> >
>> >You want to mmap MAP_SHARED, which will use PAGE_SHARED instead,
>> >including the write bit, both before and after the mprotects.
>> >There should be no problem then: do you actually see a problem
>> >when MAP_SHARED is used?
>> It's ok to mmap MAP_SHARED. But is it not a normal way to malloc() a space and
>> then registered to NIC ?
>
>Not that I know of. How would one register malloc()ed space to a NIC?
>Sorry, I may well be misunderstanding you.
The user-level NIC does this, eg. Myrinet...
>
>> >> Adding PAGE_RW to PAGE_COPY will resolve this problem.
>> >
>> >No! That would give every user write access to shared files they
>> >should have no write access to.
>> I guess you refer to mmap a file MAP_READ|MAP_WRITE in MAP_PRIVATE way.
>> I think it is probably more logical to read the file data into an anoymous page and filled the pte with _PAGE_RW in the first time page-fault. It will probably reduce numbers of page fault interrupt.
>
>do_no_page() does just that when its fault demands write access; but
>doesn't waste memory and time on copying when it's only a read access.
>
Yeah, it is. But this is the source of the problem described above.
>Hugh
>
Yingchao Zhou
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: Re: [RFC] PAGE_RW Should be added to PAGE_COPY ?
2006-09-15 3:38 Re: Re: [RFC] PAGE_RW Should be added to PAGE_COPY ? Yingchao Zhou
@ 2006-09-15 4:30 ` Hugh Dickins
2006-09-15 14:19 ` Hugh Dickins
0 siblings, 1 reply; 8+ messages in thread
From: Hugh Dickins @ 2006-09-15 4:30 UTC (permalink / raw)
To: Yingchao Zhou; +Cc: linux-kernel, akpm, alan, zxc
On Fri, 15 Sep 2006, Yingchao Zhou wrote:
> >On Fri, 15 Sep 2006, Yingchao Zhou wrote:
> >> It's ok to mmap MAP_SHARED. But is it not a normal way to malloc() a space and
> >> then registered to NIC ?
> >
> >Not that I know of. How would one register malloc()ed space to a NIC?
> >Sorry, I may well be misunderstanding you.
> The user-level NIC does this, eg. Myrinet...
Okay, thanks, I know nothing of that, and must defer to those who do.
But it sounds broken to me, in the way that you have described.
And the fix would not be to change the kernel's semantics for private
mappings, but for the app to use a shared mapping instead of a private.
Which would indeed be more work for the app than just using malloc,
since it needs some memory object (e.g. tmpfs file?) to map shared.
Hugh
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: Re: [RFC] PAGE_RW Should be added to PAGE_COPY ?
2006-09-15 4:30 ` Hugh Dickins
@ 2006-09-15 14:19 ` Hugh Dickins
2006-09-16 7:42 ` Nick Piggin
0 siblings, 1 reply; 8+ messages in thread
From: Hugh Dickins @ 2006-09-15 14:19 UTC (permalink / raw)
To: Yingchao Zhou; +Cc: linux-kernel, akpm, alan, zxc
On Fri, 15 Sep 2006, Hugh Dickins wrote:
> On Fri, 15 Sep 2006, Yingchao Zhou wrote:
> > >On Fri, 15 Sep 2006, Yingchao Zhou wrote:
> > >> It's ok to mmap MAP_SHARED. But is it not a normal way to malloc()
> > >> a space and then registered to NIC ?
> > >
> > >Not that I know of. How would one register malloc()ed space to a NIC?
> > >Sorry, I may well be misunderstanding you.
> > The user-level NIC does this, eg. Myrinet...
>
> Okay, thanks, I know nothing of that, and must defer to those who do.
>
> But it sounds broken to me, in the way that you have described.
I replied too hastily. I've since looked back to your original mail
(in which you do mention Infiniband as an example, which has now
helped to jog my brain), and realized that I was misinterpreting
what you meant by "user-level network driver".
The model I had in my head was a driver exporting, say, a ringbuffer
for userspace to mmap via device node, the pages being allocated by
the driver. But that's not what you meant at all, and many of my
comments may therefore have been hard to understand, because they
were quite wrong in your scenario - for example, when I said the
page coming to do_wp_page is not PageAnon: in my model it wasn't,
but in your model it is, or may well be, PageAnon.
Sorry for that.
This mprotect'ing back and forth: that's something you'd like your
userspace to be able to do safely on these areas? Infiniband had a
similar problem when the process fork()s, which write-protects private
ptes in both parent and child, danger of incoherency if either then
writes to the area. It's an awkward problem, and in the end we
side-stepped it by adding madvise(,,MADV_DONTFORK) to help exempt
a particular area from propagation into the child.
>
> And the fix would not be to change the kernel's semantics for private
> mappings,
That I do stand by: your idea to include PAGE_RW in PAGE_COPY
would be very wrong.
> but for the app to use a shared mapping instead of a private.
But that suggestion wasn't helpful: you'd much prefer not to
restrict what areas of userspace are used in this way.
The problem, as I now see it, is precisely with do_wp_page()'s
TestSetPageLocked, as you first said. There is indeed a small
but real chance that will fail. At some time in the past I did
realize that, but pushed it to the back of my mind, waiting for
someone actually to complain: now you have.
Yes, it would be good if we could do that check in some other,
reliable way. The problem is that can_share_swap_page has to
check page_mapcount (and PageSwapCache) and page_swapcount in
an atomic way: the page lock is what we have used to guard the
movement between mapcount and swapcount.
I'll try to think whether we can do that better,
but not until next week.
Hugh
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] PAGE_RW Should be added to PAGE_COPY ?
2006-09-15 14:19 ` Hugh Dickins
@ 2006-09-16 7:42 ` Nick Piggin
2006-09-23 18:51 ` Hugh Dickins
0 siblings, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2006-09-16 7:42 UTC (permalink / raw)
To: Hugh Dickins
Cc: Yingchao Zhou, linux-kernel, akpm, alan, zxc,
Linux Memory Management
(adding linux-mm)
Hugh Dickins wrote:
>>but for the app to use a shared mapping instead of a private.
>
>
> But that suggestion wasn't helpful: you'd much prefer not to
> restrict what areas of userspace are used in this way.
>
> The problem, as I now see it, is precisely with do_wp_page()'s
> TestSetPageLocked, as you first said. There is indeed a small
> but real chance that will fail. At some time in the past I did
> realize that, but pushed it to the back of my mind, waiting for
> someone actually to complain: now you have.
>
> Yes, it would be good if we could do that check in some other,
> reliable way. The problem is that can_share_swap_page has to
> check page_mapcount (and PageSwapCache) and page_swapcount in
> an atomic way: the page lock is what we have used to guard the
> movement between mapcount and swapcount.
>
> I'll try to think whether we can do that better,
> but not until next week.
I don't think TestSetPageLocked is the problem. Indeed you may be
able to get around a few specific cases say, by turning that into
a plain lock_page()... but the problem is still fundamentally COW.
In other words, one should always be able to return 0 from that
can_share_swap_page and have the system continue to work... right?
Because even if you hadn't done that mprotect trick, you may still
have a problem because the page may *have* to be copied on write
if it is shared over fork.
So if we filled in the missing mm/ implementation of VM_DONTCOPY
(and call it MAP_DONTCOPY rather than the confusing MAP_DONTFORK)
such that it withstands such an mprotect sequence, we can then ask
that all userspace drivers do their get_user_pages memory on these
types of vmas.
Would that work?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] PAGE_RW Should be added to PAGE_COPY ?
2006-09-16 7:42 ` Nick Piggin
@ 2006-09-23 18:51 ` Hugh Dickins
2006-09-25 2:53 ` Nick Piggin
0 siblings, 1 reply; 8+ messages in thread
From: Hugh Dickins @ 2006-09-23 18:51 UTC (permalink / raw)
To: Nick Piggin
Cc: Yingchao Zhou, linux-kernel, akpm, alan, zxc,
Linux Memory Management
On Sat, 16 Sep 2006, Nick Piggin wrote:
> Hugh Dickins wrote:
>
> > Yes, it would be good if we could do that check in some other,
> > reliable way. The problem is that can_share_swap_page has to
> > check page_mapcount (and PageSwapCache) and page_swapcount in
> > an atomic way: the page lock is what we have used to guard the
> > movement between mapcount and swapcount.
> >
> > I'll try to think whether we can do that better,
> > but not until next week.
I currently believe we can do it without TestSetPageLocked in
do_wp_page(): just with a few memory barriers added. But that
belief may evaporate once I actually focus on each site: it
wouldn't be the first time I've fooled myself like that, so
please keep on the alert, Nick.
>
> I don't think TestSetPageLocked is the problem. Indeed you may be
> able to get around a few specific cases say, by turning that into
> a plain lock_page()...
Hmm, an actual lock_page, that wasn't what I was intending.
Would be simple, I'm trying to remember why I ruled it out earlier.
Oh, yes, we're holding the pte lock there, that's why.
> ... but the problem is still fundamentally COW.
Well, yes, we wouldn't have all these problems if we didn't have
to respect COW. But generally a process can, one way or another,
make sure it won't get into those problems: Yingchao is concerned
with the way the TestSetPageLocked unpredictably upsets correctness.
I'd say it's a more serious error than the general problems with COW.
>
> In other words, one should always be able to return 0 from that
> can_share_swap_page and have the system continue to work... right?
> Because even if you hadn't done that mprotect trick, you may still
> have a problem because the page may *have* to be copied on write
> if it is shared over fork.
Most processes won't fork, and exec has freed them from sharing
their parents pages, and their private file mappings aren't being
used as buffers. Maybe Yingchao will later have to worry about
those cases, but for now it seems not.
>
> So if we filled in the missing mm/ implementation of VM_DONTCOPY
> (and call it MAP_DONTCOPY rather than the confusing MAP_DONTFORK)
> such that it withstands such an mprotect sequence, we can then ask
> that all userspace drivers do their get_user_pages memory on these
> types of vmas.
(madvise MADV_DONTFORK)
For the longest time I couldn't understand you there at all, perhaps
distracted by your parenthetical line: at last I think you're proposing
we tweak mprotect to behave differently on a VM_DONTCOPY area.
But differently in what way? Allow it to ignore Copy-On-Write?
No, of course not. Go down to the struct page level (one of the
nice things about mprotect is that it doesn't need to look at struct
pages at present, except perhaps inside the ia64 lazy_mmu_prot_update),
and decide if it has to do the copying itself instead? Doesn't sound
a good place to do it; and would involve just the same problematic
TestSetPageLocked within pte lock that do_wp_page is doing,
so wouldn't solve anything.
Or I'm still misunderstanding you.
Hugh
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] PAGE_RW Should be added to PAGE_COPY ?
2006-09-23 18:51 ` Hugh Dickins
@ 2006-09-25 2:53 ` Nick Piggin
0 siblings, 0 replies; 8+ messages in thread
From: Nick Piggin @ 2006-09-25 2:53 UTC (permalink / raw)
To: Hugh Dickins
Cc: Yingchao Zhou, linux-kernel, akpm, alan, zxc,
Linux Memory Management
Hugh Dickins wrote:
>On Sat, 16 Sep 2006, Nick Piggin wrote:
>
>>... but the problem is still fundamentally COW.
>>
>
>Well, yes, we wouldn't have all these problems if we didn't have
>to respect COW. But generally a process can, one way or another,
>make sure it won't get into those problems: Yingchao is concerned
>with the way the TestSetPageLocked unpredictably upsets correctness.
>I'd say it's a more serious error than the general problems with COW.
>
But correctness is no more upset here than with any other reason that
the page gets COWed.
>>In other words, one should always be able to return 0 from that
>>can_share_swap_page and have the system continue to work... right?
>>Because even if you hadn't done that mprotect trick, you may still
>>have a problem because the page may *have* to be copied on write
>>if it is shared over fork.
>>
>
>Most processes won't fork, and exec has freed them from sharing
>their parents pages, and their private file mappings aren't being
>used as buffers. Maybe Yingchao will later have to worry about
>those cases, but for now it seems not.
>
So we should still solve it for once and for all just by turning off
COW completely.
>>So if we filled in the missing mm/ implementation of VM_DONTCOPY
>>(and call it MAP_DONTCOPY rather than the confusing MAP_DONTFORK)
>>such that it withstands such an mprotect sequence, we can then ask
>>that all userspace drivers do their get_user_pages memory on these
>>types of vmas.
>>
>
>(madvise MADV_DONTFORK)
>
>For the longest time I couldn't understand you there at all, perhaps
>distracted by your parenthetical line: at last I think you're proposing
>we tweak mprotect to behave differently on a VM_DONTCOPY area.
>
>But differently in what way? Allow it to ignore Copy-On-Write?
>
Well I think that we should have a flag that just prevents copy
on write from ever happening. Maybe that would mean it be easiest
to implement in mmap rather than as madvise, but that should be
OK.
--
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2006-09-25 9:31 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-15 3:38 Re: Re: [RFC] PAGE_RW Should be added to PAGE_COPY ? Yingchao Zhou
2006-09-15 4:30 ` Hugh Dickins
2006-09-15 14:19 ` Hugh Dickins
2006-09-16 7:42 ` Nick Piggin
2006-09-23 18:51 ` Hugh Dickins
2006-09-25 2:53 ` Nick Piggin
-- strict thread matches above, loose matches on Subject: below --
2006-09-13 14:02 Yingchao Zhou
2006-09-14 14:48 ` Hugh Dickins
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox