* pre12 VM doubts and patch
@ 2001-09-19 17:57 Hugh Dickins
2001-09-19 19:42 ` Hugh Dickins
2001-09-20 5:12 ` Andrea Arcangeli
0 siblings, 2 replies; 19+ messages in thread
From: Hugh Dickins @ 2001-09-19 17:57 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Linus Torvalds, Marcelo Tosatti, linux-kernel
Andrea,
I was unable to do much testing with -pre11 because of the spurious
allocation failures, but -pre12 fixes those for my particular testing:
thank you. A few doubts: in the first case I've appended a patch; for
the rest you may prefer to make your own patch, or ask me for patch,
or reject my doubts.
1. Your changes don't quite fit with my -pre4 swapoff changes. In
free_pages_ok you want to treat PageDirty as a BUG, so you removed
my SetPageDirty from try_to_unuse, and inserted a ClearPageDirty
there. Not good: that gives me hangs, or else data loss, after
swapoff. It is a cruel and unusual use of PageDirty (we could use a
new flag for this purpose if it deserved it), but it fixes the race
where try_to_swapout might clear a pte in between the page being
removed from swap cache, and the pte being marked dirty (as it used
to be) in unuse_pte; and allowed me to speed up swapoff, by not
bothering to mark present ptes dirty at all. See brief comment on
PageDirty which remains in try_to_swap_out. So patch below undoes
your changes there (as before, clears PG_referenced too: I didn't
look very hard, but wondered if that bit became undefined when a
page was allocated in -pre11 and -pre12). I also had more safety
where vmscan.c goes from PageDirty to writepage, could a swapoff'ed
page get there with NULL mapping? I've not seen it, just a doubt.
2. swap_out_mm now takes mmlist_lock while holding a page_table_lock,
whereas try_to_unuse takes page_table_lock while holding mmlist_lock.
Yes, try_to_unuse holds mmlist_lock for a horrible length of time (I
just changed it over from tasklist to mmlist), and now I believe the
SWAP_MAP_MAX case to be imaginary, that could probably be changed.
But it does look rather odd what swap_out_mm is doing, and I wonder
if it wouldn't be better if you changed that? You've clearly thought
about the issues at that end, which I have not, so I'm not proposing
a patch, but we do need to resolve the deadlock. As I understand it,
you don't like the way the old code darted away from a large and
fruitful mm, now it scans the whole mm before going on to the next:
sounds plausible, though I wonder, aren't other swapping out cpus
liable to spin on its page_table_lock when they could be attacking
other mms? At a guess, your way works better when a few large mms
(probably the common case?), worse if many.
3. If exclusive_swap_page, do_swap_page unconditionally makes the pte
writable. I think that's wrong, I think that's the same error Linus
corrected in Rik's version there: if it's come in from swap, we know
that the page has been dirtied in the past, but that does not imply
that writing to it is now permitted. I think you need something like
if (write_access)
pte = pte_mkwrite(pte);
pte = pte_mkdirty(pte);
delete_from_swap_cache_nolock(page);
(and please remove that hesitant "#if 1" and "#if 0" from memory.c).
4. In 2.4.10-pre10, Linus removed the SetPageReferenced(page) from
__find_page_nolock, and was adamant on lkml that it's inappropriate
at that level. Later in the day, Linus produced 2.4.10-pre11 from
your patches, and that SetPageReferenced(page) reappeared: oversight
or intentional? Linus? more a question for you than Andrea.
5. With -pre12 I'm not getting the 0-order allocation failures which
interfered with my -pre11 testing, but I did spend a while yesterday
looking into that, and the patch I found successful was to move the
"int nr_pages = SWAP_CLUSTER_MAX;" in try_to_free_pages from within
the loop to the outer level: try_to_free_pages had freed 114 pages
of the zone, but never as many as 32 in any one go round the loop.
You'll have your own ideas of what's right and wrong here, and I'd
rather stay clear of this area, but thought the observation worth
passing on; and note that try_to_free_pages does look like work in
progress - unused order argument, shrink_caches(,,, nr_pages)
instead of explicit shrink_caches(,,, SWAP_CLUSTER_MAX).
Hugh
--- 2.4.10-pre12/mm/page_alloc.c Wed Sep 19 14:08:14 2001
+++ linux/mm/page_alloc.c Wed Sep 19 16:21:46 2001
@@ -86,8 +86,7 @@
BUG();
if (PageInactive(page))
BUG();
- if (PageDirty(page))
- BUG();
+ page->flags &= ~((1<<PG_referenced) | (1<<PG_dirty));
if (current->flags & PF_FREE_PAGES)
goto local_freelist;
--- 2.4.10-pre12/mm/swapfile.c Wed Sep 19 14:08:14 2001
+++ linux/mm/swapfile.c Wed Sep 19 16:08:08 2001
@@ -452,6 +452,7 @@
lock_page(page);
if (PageSwapCache(page))
delete_from_swap_cache_nolock(page);
+ SetPageDirty(page);
UnlockPage(page);
flush_page_to_ram(page);
@@ -492,7 +493,6 @@
mmput(start_mm);
start_mm = new_start_mm;
}
- ClearPageDirty(page);
page_cache_release(page);
/*
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-19 17:57 pre12 VM doubts and patch Hugh Dickins
@ 2001-09-19 19:42 ` Hugh Dickins
2001-09-19 21:28 ` Andrea Arcangeli
2001-09-20 5:12 ` Andrea Arcangeli
1 sibling, 1 reply; 19+ messages in thread
From: Hugh Dickins @ 2001-09-19 19:42 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Linus Torvalds, Marcelo Tosatti, linux-kernel
On Wed, 19 Sep 2001, Hugh Dickins wrote:
>
> thank you. A few doubts: in the first case I've appended a patch; for
> the rest you may prefer to make your own patch, or ask me for patch,
> or reject my doubts.
Please add another:
6. Why has swap_writepage lost its check for stale entries? If your
other changes have somehow made that too rare a case to bother
about, please remove the comment above swap_writepage instead.
Hugh
--- 2.4.10-pre12/mm/swap_state.c Wed Sep 19 14:05:54 2001
+++ linux/mm/swap_state.c Mon Sep 17 06:30:26 2001
@@ -23,6 +23,17 @@
*/
static int swap_writepage(struct page *page)
{
+ /* One for the page cache, one for this user, one for page->buffers */
+ if (page_count(page) > 2 + !!page->buffers)
+ goto in_use;
+ if (swap_count(page) > 1)
+ goto in_use;
+
+ delete_from_swap_cache_nolock(page);
+ UnlockPage(page);
+ return 0;
+
+in_use:
rw_swap_page(WRITE, page);
return 0;
}
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-19 19:42 ` Hugh Dickins
@ 2001-09-19 21:28 ` Andrea Arcangeli
2001-09-19 23:16 ` Linus Torvalds
2001-09-19 23:51 ` Hugh Dickins
0 siblings, 2 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 21:28 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Linus Torvalds, Marcelo Tosatti, linux-kernel
On Wed, Sep 19, 2001 at 08:42:39PM +0100, Hugh Dickins wrote:
> --- 2.4.10-pre12/mm/swap_state.c Wed Sep 19 14:05:54 2001
> +++ linux/mm/swap_state.c Mon Sep 17 06:30:26 2001
> @@ -23,6 +23,17 @@
> */
> static int swap_writepage(struct page *page)
> {
> + /* One for the page cache, one for this user, one for page->buffers */
> + if (page_count(page) > 2 + !!page->buffers)
this is racy, you have to spin_lock(&pagecache_lock) before you can
expect the page_count() stays constant. then after you checked the page
has count == 1, you must atomically drop it from the pagecache so it's
not visible anymore to the swapin lookups.
Another way to fix the race is to change lookup_swap_cache to do
find_lock_page instead of find_get_page, and then check the page is
still a swapcachepage after you got it locked (that was the old way,
somebody changed it and introduced the race, I like lookup_swap_cache to
use find_get_page so I dropped such check to fix it, it was a minor
optimization but yes probably worthwhile to reintroduce after addressing
this race in one of the two ways described).
It is also buggy, if something it should be "page_count(page) != 1" (not
!= 2).
> + goto in_use;
> + if (swap_count(page) > 1)
> + goto in_use;
> +
> + delete_from_swap_cache_nolock(page);
> + UnlockPage(page);
> + return 0;
> +
> +in_use:
> rw_swap_page(WRITE, page);
> return 0;
> }
Andrea
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-19 21:28 ` Andrea Arcangeli
@ 2001-09-19 23:16 ` Linus Torvalds
2001-09-19 23:31 ` Andrea Arcangeli
2001-09-19 23:51 ` Hugh Dickins
1 sibling, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2001-09-19 23:16 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Hugh Dickins, Marcelo Tosatti, linux-kernel
On Wed, 19 Sep 2001, Andrea Arcangeli wrote:
>
> On Wed, Sep 19, 2001 at 08:42:39PM +0100, Hugh Dickins wrote:
> > --- 2.4.10-pre12/mm/swap_state.c Wed Sep 19 14:05:54 2001
> > +++ linux/mm/swap_state.c Mon Sep 17 06:30:26 2001
> > @@ -23,6 +23,17 @@
> > */
> > static int swap_writepage(struct page *page)
> > {
> > + /* One for the page cache, one for this user, one for page->buffers */
> > + if (page_count(page) > 2 + !!page->buffers)
>
> this is racy, you have to spin_lock(&pagecache_lock) before you can
> expect the page_count() stays constant. then after you checked the page
> has count == 1, you must atomically drop it from the pagecache so it's
> not visible anymore to the swapin lookups.
No.
Note how it is a _heuristic_ only. The "safe" answer is always to say "the
page is in use", and note that once the page_count has dropped to 2 or
less, it won't increase unless somebody else has a swap count..
And we check for the "somebody else has a swap count" two lines lower.
(And the swap count is stable at 1 while the page is locked, because
nobody can increase the swap count without locking the page and dropping
it from their VM).
Do you see anything wrong with that logic?
Linus
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-19 23:16 ` Linus Torvalds
@ 2001-09-19 23:31 ` Andrea Arcangeli
2001-09-19 23:49 ` Andrea Arcangeli
2001-09-20 1:04 ` Jeff Chua
0 siblings, 2 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 23:31 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Hugh Dickins, Marcelo Tosatti, linux-kernel
On Wed, Sep 19, 2001 at 04:16:13PM -0700, Linus Torvalds wrote:
>
> On Wed, 19 Sep 2001, Andrea Arcangeli wrote:
> >
> > On Wed, Sep 19, 2001 at 08:42:39PM +0100, Hugh Dickins wrote:
> > > --- 2.4.10-pre12/mm/swap_state.c Wed Sep 19 14:05:54 2001
> > > +++ linux/mm/swap_state.c Mon Sep 17 06:30:26 2001
> > > @@ -23,6 +23,17 @@
> > > */
> > > static int swap_writepage(struct page *page)
> > > {
> > > + /* One for the page cache, one for this user, one for page->buffers */
> > > + if (page_count(page) > 2 + !!page->buffers)
> >
> > this is racy, you have to spin_lock(&pagecache_lock) before you can
> > expect the page_count() stays constant. then after you checked the page
> > has count == 1, you must atomically drop it from the pagecache so it's
> > not visible anymore to the swapin lookups.
>
> No.
>
> Note how it is a _heuristic_ only. The "safe" answer is always to say "the
> page is in use", and note that once the page_count has dropped to 2 or
> less, it won't increase unless somebody else has a swap count..
the "somebody else has a swap count" is interesting. so we rely on the
fact any swap_duplicate is always run before the swapcache is unlocked.
Also the "> 2" should be "> 1", but really I noticed I cannot avoid
getting a reference so please apply also this patch instead of replacing
"> 2" with "> 1" (it was a race condition):
--- 2.4.10pre11aa1/mm/vmscan.c.~1~ Tue Sep 18 21:23:49 2001
+++ 2.4.10pre11aa1/mm/vmscan.c Thu Sep 20 01:29:58 2001
@@ -415,7 +415,10 @@
spin_unlock(&pagemap_lru_lock);
ClearPageDirty(page);
+
+ page_cache_get(page);
writepage(page);
+ page_cache_release(page);
spin_lock(&pagemap_lru_lock);
continue;
> And we check for the "somebody else has a swap count" two lines lower.
I see.
> Do you see anything wrong with that logic?
Looks ok now :), I guess it would be good to write a comment now, so
maybe other people won't share my worry, the swap_count thing wasn't
very obvious. thanks,
Andrea
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-19 23:31 ` Andrea Arcangeli
@ 2001-09-19 23:49 ` Andrea Arcangeli
2001-09-20 1:04 ` Jeff Chua
1 sibling, 0 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 23:49 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Hugh Dickins, Marcelo Tosatti, linux-kernel
On Thu, Sep 20, 2001 at 01:31:53AM +0200, Andrea Arcangeli wrote:
> --- 2.4.10pre11aa1/mm/vmscan.c.~1~ Tue Sep 18 21:23:49 2001
> +++ 2.4.10pre11aa1/mm/vmscan.c Thu Sep 20 01:29:58 2001
> @@ -415,7 +415,10 @@
> spin_unlock(&pagemap_lru_lock);
>
> ClearPageDirty(page);
> +
> + page_cache_get(page);
> writepage(page);
> + page_cache_release(page);
>
> spin_lock(&pagemap_lru_lock);
> continue;
then in turn we can backout this bit (patch -R)
diff -urN 2.4.10pre11/mm/page_io.c 2.4.10pre12/mm/page_io.c
--- 2.4.10pre11/mm/page_io.c Tue May 1 19:35:33 2001
+++ 2.4.10pre12/mm/page_io.c Thu Sep 20 01:44:20 2001
@@ -78,7 +78,15 @@
if (!wait) {
SetPageDecrAfter(page);
atomic_inc(&nr_async_pages);
- }
+ } else
+ /*
+ * Must hold a reference until after wait_on_page()
+ * returned or it could be freed by the VM after
+ * I/O is completed and the page is been unlocked.
+ * The asynchronous path is fine since it never
+ * references the page after brw_page().
+ */
+ page_cache_get(page);
/* block_size == PAGE_SIZE/zones_used */
brw_page(rw, page, dev, zones, block_size);
@@ -94,6 +102,7 @@
/* This shouldn't happen, but check to be sure. */
if (page_count(page) == 0)
printk(KERN_ERR "rw_swap_page: page unused while waiting!\n");
+ page_cache_release(page);
return 1;
}
Andrea
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-19 21:28 ` Andrea Arcangeli
2001-09-19 23:16 ` Linus Torvalds
@ 2001-09-19 23:51 ` Hugh Dickins
2001-09-20 0:01 ` Andrea Arcangeli
1 sibling, 1 reply; 19+ messages in thread
From: Hugh Dickins @ 2001-09-19 23:51 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Linus Torvalds, Marcelo Tosatti, linux-kernel
On Wed, 19 Sep 2001, Andrea Arcangeli wrote:
> On Wed, Sep 19, 2001 at 08:42:39PM +0100, Hugh Dickins wrote:
> > --- 2.4.10-pre12/mm/swap_state.c Wed Sep 19 14:05:54 2001
> > +++ linux/mm/swap_state.c Mon Sep 17 06:30:26 2001
> > @@ -23,6 +23,17 @@
> > */
> > static int swap_writepage(struct page *page)
> > {
> > + /* One for the page cache, one for this user, one for page->buffers */
> > + if (page_count(page) > 2 + !!page->buffers)
>
> this is racy, you have to spin_lock(&pagecache_lock) before you can
> expect the page_count() stays constant. then after you checked the page
> has count == 1, you must atomically drop it from the pagecache so it's
> not visible anymore to the swapin lookups.
Locking on pagecache_lock is no way to stabilize page count, but this
doesn't need it stabilized: it's just checking there's nothing but us
which can be interested in the page. Okay, we now know that
read_swap_cache_async can get "interested" in surprising pages (when
the swap entry it asks for has meanwhile been replaced), but I don't
see how that endangers the logic here - it could briefly bump count
too high to pass the "don't write" test, but that errs on the safe side.
If you think this racy, how come you still allow "exclusive_swap_page"
use elsewhere? Actually, I think these tests would be better replaced
by use of "exclusive_swap_page", wouldn't they? But perhaps I'll find
I'm off-by-one when I try it tomorrow.
> Another way to fix the race is to change lookup_swap_cache to do
> find_lock_page instead of find_get_page, and then check the page is
> still a swapcachepage after you got it locked (that was the old way,
> somebody changed it and introduced the race, I like lookup_swap_cache to
> use find_get_page so I dropped such check to fix it, it was a minor
> optimization but yes probably worthwhile to reintroduce after addressing
> this race in one of the two ways described).
Well, I certainly agree the system can survive without this optimization,
and I wouldn't want to restore it if it were buggy. I just don't see
the scenario you're afraid of, and nobody had questioned it before.
> It is also buggy, if something it should be "page_count(page) != 1" (not
> != 2).
I don't think so: as the comment says, one for the page cache,
one for the caller of writepage, one (perhaps) for page->buffers.
Hugh
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-19 23:51 ` Hugh Dickins
@ 2001-09-20 0:01 ` Andrea Arcangeli
0 siblings, 0 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 0:01 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Linus Torvalds, Marcelo Tosatti, linux-kernel
On Thu, Sep 20, 2001 at 12:51:06AM +0100, Hugh Dickins wrote:
> I'm off-by-one when I try it tomorrow.
yes you're off by one but that was because of something I wasn't allowed
to do safely ;) sorry (see my last email to Linus).
> I don't think so: as the comment says, one for the page cache,
> one for the caller of writepage, one (perhaps) for page->buffers.
btw, we should be even smarter, even if there are buffers we should
still drop both the buffers and then the page from the pagecaceh. It's
just an orphan, it must be collected away cleanly without any I/O even
if there are buffers. The other option would be to cleanup the orphans
from free_page_and_swap_cache but that would not be optimal as we try to
be lazy and to avoid the swap_count checks in the exit(2)/munmap(2) fast
paths.
Andrea
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-19 23:31 ` Andrea Arcangeli
2001-09-19 23:49 ` Andrea Arcangeli
@ 2001-09-20 1:04 ` Jeff Chua
2001-09-20 1:05 ` Andrea Arcangeli
1 sibling, 1 reply; 19+ messages in thread
From: Jeff Chua @ 2001-09-20 1:04 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Linux Kernel, Jeff Chua
I still can't find pre12. The latest I've seen is pre11.
Where can I get pre12?
Thanks,
Jeff
[ jchua@fedex.com ]
On Thu, 20 Sep 2001, Andrea Arcangeli wrote:
> On Wed, Sep 19, 2001 at 04:16:13PM -0700, Linus Torvalds wrote:
> >
> > On Wed, 19 Sep 2001, Andrea Arcangeli wrote:
> > >
> > > On Wed, Sep 19, 2001 at 08:42:39PM +0100, Hugh Dickins wrote:
> > > > --- 2.4.10-pre12/mm/swap_state.c Wed Sep 19 14:05:54 2001
> > > > +++ linux/mm/swap_state.c Mon Sep 17 06:30:26 2001
> > > > @@ -23,6 +23,17 @@
> > > > */
> > > > static int swap_writepage(struct page *page)
> > > > {
> > > > + /* One for the page cache, one for this user, one for page->buffers */
> > > > + if (page_count(page) > 2 + !!page->buffers)
> > >
> > > this is racy, you have to spin_lock(&pagecache_lock) before you can
> > > expect the page_count() stays constant. then after you checked the page
> > > has count == 1, you must atomically drop it from the pagecache so it's
> > > not visible anymore to the swapin lookups.
> >
> > No.
> >
> > Note how it is a _heuristic_ only. The "safe" answer is always to say "the
> > page is in use", and note that once the page_count has dropped to 2 or
> > less, it won't increase unless somebody else has a swap count..
>
> the "somebody else has a swap count" is interesting. so we rely on the
> fact any swap_duplicate is always run before the swapcache is unlocked.
>
> Also the "> 2" should be "> 1", but really I noticed I cannot avoid
> getting a reference so please apply also this patch instead of replacing
> "> 2" with "> 1" (it was a race condition):
>
> --- 2.4.10pre11aa1/mm/vmscan.c.~1~ Tue Sep 18 21:23:49 2001
> +++ 2.4.10pre11aa1/mm/vmscan.c Thu Sep 20 01:29:58 2001
> @@ -415,7 +415,10 @@
> spin_unlock(&pagemap_lru_lock);
>
> ClearPageDirty(page);
> +
> + page_cache_get(page);
> writepage(page);
> + page_cache_release(page);
>
> spin_lock(&pagemap_lru_lock);
> continue;
>
>
> > And we check for the "somebody else has a swap count" two lines lower.
>
> I see.
>
> > Do you see anything wrong with that logic?
>
> Looks ok now :), I guess it would be good to write a comment now, so
> maybe other people won't share my worry, the swap_count thing wasn't
> very obvious. thanks,
>
> Andrea
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-20 1:04 ` Jeff Chua
@ 2001-09-20 1:05 ` Andrea Arcangeli
2001-09-20 1:12 ` Jeff Chua
0 siblings, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 1:05 UTC (permalink / raw)
To: Jeff Chua; +Cc: Linux Kernel, Jeff Chua
On Thu, Sep 20, 2001 at 09:04:40AM +0800, Jeff Chua wrote:
>
> I still can't find pre12. The latest I've seen is pre11.
>
> Where can I get pre12?
http://kernel.dk/testing
Andrea
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-20 1:05 ` Andrea Arcangeli
@ 2001-09-20 1:12 ` Jeff Chua
0 siblings, 0 replies; 19+ messages in thread
From: Jeff Chua @ 2001-09-20 1:12 UTC (permalink / raw)
To: Andrea Arcangeli, Daniel T. Chen; +Cc: Linux Kernel
On Thu, 20 Sep 2001, Andrea Arcangeli wrote:
> On Thu, Sep 20, 2001 at 09:04:40AM +0800, Jeff Chua wrote:
> >
> > I still can't find pre12. The latest I've seen is pre11.
> >
> > Where can I get pre12?
>
> http://kernel.dk/testing
>
> Andrea
Thanks for the quick response. I'll test it right away.
Jeff
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-19 17:57 pre12 VM doubts and patch Hugh Dickins
2001-09-19 19:42 ` Hugh Dickins
@ 2001-09-20 5:12 ` Andrea Arcangeli
2001-09-20 5:58 ` Linus Torvalds
1 sibling, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 5:12 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Linus Torvalds, Marcelo Tosatti, linux-kernel
On Wed, Sep 19, 2001 at 06:57:20PM +0100, Hugh Dickins wrote:
> 3. If exclusive_swap_page, do_swap_page unconditionally makes the pte
> writable. I think that's wrong, I think that's the same error Linus
> corrected in Rik's version there: if it's come in from swap, we know
> that the page has been dirtied in the past, but that does not imply
> that writing to it is now permitted. I think you need something like
> if (write_access)
> pte = pte_mkwrite(pte);
> pte = pte_mkdirty(pte);
> delete_from_swap_cache_nolock(page);
when the page is exclusive we definitely can write to it, dropping the
page from the swapcache and lefting the pte wrprotected just asks for a
cow page fault that will simply alloc another page, copy the old one
into the new one and finally free (really free) the old one. so I think
that part is correct.
> (and please remove that hesitant "#if 1" and "#if 0" from memory.c).
The #if 1 is for the persistence option "the swap waste thing". If you
can afford to waste swap space you can possibly swapout anonymous pages at
no I/Ocost, and also get virtually consecutive addresses more likely to
be physically consecutive on disk too, and I cannot exclude somebody
with very huge swapspace can afford to keep all its anonymous pages in
swapcache as well. This is why I didn't dropped it (yet). But I don't
think it makes an huge difference (the #if is just to make easy to
switch behaviour to give the other one a spin)
> 4. In 2.4.10-pre10, Linus removed the SetPageReferenced(page) from
> __find_page_nolock, and was adamant on lkml that it's inappropriate
> at that level. Later in the day, Linus produced 2.4.10-pre11 from
> your patches, and that SetPageReferenced(page) reappeared: oversight
> or intentional? Linus? more a question for you than Andrea.
Intentional. Really I moved it away even before Linus, I even did a patch
where readahead isn't marked referenced at all with perfect accounting
but from the numbers it seems we don't want to shrink readahead until we
do the cache hit, so I just moved things back waiting for new
experiments on that area :)
> 5. With -pre12 I'm not getting the 0-order allocation failures which
> interfered with my -pre11 testing, but I did spend a while yesterday
> looking into that, and the patch I found successful was to move the
> "int nr_pages = SWAP_CLUSTER_MAX;" in try_to_free_pages from within
> the loop to the outer level: try_to_free_pages had freed 114 pages
> of the zone, but never as many as 32 in any one go round the loop.
I see, infact it was originally written that way :). But did you also
checked OOM was still handled gracefully after that?
> You'll have your own ideas of what's right and wrong here, and I'd
Such change isn't bad, you may want to give it a spin again and check
how oom reacts and how swapout behaviour reacts. I'm not changing
anything in that area at the moment unless(/until? :) somebody complains
about performance.
> --- 2.4.10-pre12/mm/page_alloc.c Wed Sep 19 14:08:14 2001
> +++ linux/mm/page_alloc.c Wed Sep 19 16:21:46 2001
> @@ -86,8 +86,7 @@
> BUG();
> if (PageInactive(page))
> BUG();
> - if (PageDirty(page))
> - BUG();
> + page->flags &= ~((1<<PG_referenced) | (1<<PG_dirty));
>
> if (current->flags & PF_FREE_PAGES)
> goto local_freelist;
> --- 2.4.10-pre12/mm/swapfile.c Wed Sep 19 14:08:14 2001
> +++ linux/mm/swapfile.c Wed Sep 19 16:08:08 2001
> @@ -452,6 +452,7 @@
> lock_page(page);
> if (PageSwapCache(page))
> delete_from_swap_cache_nolock(page);
> + SetPageDirty(page);
> UnlockPage(page);
> flush_page_to_ram(page);
>
> @@ -492,7 +493,6 @@
> mmput(start_mm);
> start_mm = new_start_mm;
> }
> - ClearPageDirty(page);
> page_cache_release(page);
I dislike it but fine with me for now. BTW, I was aware I wasn't really
correct in such change, see the first description of the vm patch:
I probably have a bug in swapoff but let's ignore it for now, just
try to run swapoff only before shutting down the machine. The fact
is that the 2.4 VM is broken freeing physically dirty pages.
The last owner of the page (usually the VM except in swapoff) has to
clear the dirty flag before freeing the page, in swapoff it may
be a little more complicate (we may need to grab the pagecache_lock
to ensure nobody start using the page while we clear it). And swapoff
is probably racy anyways as usual (swapoff in 2.2 is racy too). In
short I didn't focused on swapoff yet, I just made an hack to make it
to work while shutting down the machine so far.
:)
Andrea
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-20 5:12 ` Andrea Arcangeli
@ 2001-09-20 5:58 ` Linus Torvalds
2001-09-20 6:08 ` Andrea Arcangeli
0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2001-09-20 5:58 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Hugh Dickins, Marcelo Tosatti, linux-kernel
On Thu, 20 Sep 2001, Andrea Arcangeli wrote:
>
> when the page is exclusive we definitely can write to it
NO!
If it is a read-only mapping, we must NOT mark it writable.
The fact is, the page may have been written to earlier, marked read-only
with mprotect(), and the page is dirty but read-only, and swapping it in
MUST NOT markt it writable even if it is our last exclusive copy.
Which we've gotten wrong for a long time, actually. But you #if 0'ed the
fix that happened fairly recently.
Linus
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-20 5:58 ` Linus Torvalds
@ 2001-09-20 6:08 ` Andrea Arcangeli
2001-09-20 6:15 ` Linus Torvalds
2001-09-20 6:26 ` Andrea Arcangeli
0 siblings, 2 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 6:08 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Hugh Dickins, Marcelo Tosatti, linux-kernel
On Wed, Sep 19, 2001 at 10:58:43PM -0700, Linus Torvalds wrote:
>
> On Thu, 20 Sep 2001, Andrea Arcangeli wrote:
> >
> > when the page is exclusive we definitely can write to it
>
> NO!
>
> If it is a read-only mapping, we must NOT mark it writable.
>
> The fact is, the page may have been written to earlier, marked read-only
> with mprotect(), and the page is dirty but read-only, and swapping it in
ah, I didn't thought at mprotect.
> MUST NOT markt it writable even if it is our last exclusive copy.
>
> Which we've gotten wrong for a long time, actually. But you #if 0'ed the
> fix that happened fairly recently.
hmm, the stuff inside #if 0 doesn't seem to be correct either there,
write_access doesn't mean we have the right to write to it, it just mean
we're trying to.
anyways here it is the fix:
--- 2.4.10pre12aa1/mm/memory.c.~1~ Thu Sep 20 07:20:03 2001
+++ 2.4.10pre12aa1/mm/memory.c Thu Sep 20 08:06:29 2001
@@ -1155,15 +1155,8 @@
pte = mk_pte(page, vma->vm_page_prot);
swap_free(entry);
- if (exclusive_swap_page(page)) {
-#if 0
- if (write_access)
- pte = pte_mkwrite(pte_mkdirty(pte));
-#else
+ if (exclusive_swap_page(page))
delete_from_swap_cache_nolock(page);
- pte = pte_mkwrite(pte_mkdirty(pte));
-#endif
- }
UnlockPage(page);
flush_page_to_ram(page);
thanks,
Andrea
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-20 6:08 ` Andrea Arcangeli
@ 2001-09-20 6:15 ` Linus Torvalds
2001-09-20 6:34 ` Andrea Arcangeli
2001-09-20 6:26 ` Andrea Arcangeli
1 sibling, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2001-09-20 6:15 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Hugh Dickins, Marcelo Tosatti, linux-kernel
On Thu, 20 Sep 2001, Andrea Arcangeli wrote:
>
> hmm, the stuff inside #if 0 doesn't seem to be correct either there,
> write_access doesn't mean we have the right to write to it, it just mean
> we're trying to.
No, write_access means not only that we are trying to write to it, but it
(by implication) means that we have the right too - otherwise we would
have SIGSEGV'd.
Anyway, I'm not at all sure that the write_access test is worth it, we
could just never do it (like your patch), or we could test if we allow COW
and do early COW (which the write-access kind of means).
However, your patch isn't right for another reason: if we do delete it
from the swap cache, we'd better mark it dirty so that it gets
re-allocated a swap entry if it later on needs it.
That's why the old code went to such extremes: it marked it dirty and
writable if it was a write access (and exclusive), and it marked it _just_
dirty and removed it from the swap cache if it went over the swap limit.
Whether that complexity is worth it, I don't know.
Linus
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-20 6:08 ` Andrea Arcangeli
2001-09-20 6:15 ` Linus Torvalds
@ 2001-09-20 6:26 ` Andrea Arcangeli
2001-09-20 6:31 ` Linus Torvalds
1 sibling, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 6:26 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Hugh Dickins, Marcelo Tosatti, linux-kernel
On Thu, Sep 20, 2001 at 08:08:37AM +0200, Andrea Arcangeli wrote:
> --- 2.4.10pre12aa1/mm/memory.c.~1~ Thu Sep 20 07:20:03 2001
> +++ 2.4.10pre12aa1/mm/memory.c Thu Sep 20 08:06:29 2001
> @@ -1155,15 +1155,8 @@
> pte = mk_pte(page, vma->vm_page_prot);
>
> swap_free(entry);
> - if (exclusive_swap_page(page)) {
> -#if 0
> - if (write_access)
> - pte = pte_mkwrite(pte_mkdirty(pte));
> -#else
> + if (exclusive_swap_page(page))
> delete_from_swap_cache_nolock(page);
> - pte = pte_mkwrite(pte_mkdirty(pte));
> -#endif
> - }
forget that, of course if write_access is set it means we can write to
the page or the page fault would be finished much earlier. This looks
much better:
--- 2.4.10pre12aa1/mm/memory.c.~1~ Thu Sep 20 07:20:03 2001
+++ 2.4.10pre12aa1/mm/memory.c Thu Sep 20 08:24:03 2001
@@ -1156,13 +1156,11 @@
swap_free(entry);
if (exclusive_swap_page(page)) {
-#if 0
if (write_access)
pte = pte_mkwrite(pte_mkdirty(pte));
-#else
+ else if (vma->vm_flags & VM_WRITE)
+ pte = pte_mkwrite(pte);
delete_from_swap_cache_nolock(page);
- pte = pte_mkwrite(pte_mkdirty(pte));
-#endif
}
UnlockPage(page);
(we'll avoid the page fault this way in the "read" swapin too, by
checking if we're allowed to write to it or not, looks worthwhile check,
however it matters only for the archs where the dirty bit is set in
hardware)
Andrea
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-20 6:26 ` Andrea Arcangeli
@ 2001-09-20 6:31 ` Linus Torvalds
2001-09-20 6:38 ` Andrea Arcangeli
0 siblings, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2001-09-20 6:31 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Hugh Dickins, Marcelo Tosatti, linux-kernel
On Thu, 20 Sep 2001, Andrea Arcangeli wrote:
>
> forget that, of course if write_access is set it means we can write to
> the page or the page fault would be finished much earlier. This looks
> much better:
You still need to mark the pte dirty. Becuase you know that it _is_ dirty,
or it wouldn't have had a swap cache entry allocated to it.
Or you need to not drop the swap cache.
Failing to mark it dirty will cause all kind sof interesting problems when
pages suddenly become zero-filled when the VM scanning just drops the old
contents (which sure makes for good performance, but not for nice
behaviour ;)
So I would suggest:
if (exclusive_swap_page(page)) {
if (vma->vm_flags & VM_WRITE)
pte = pte_mkwrite(pte);
pte = pte_mkdirty(pte);
delete_from_swap_cache_nolock(page);
}
or similar.
Linus
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-20 6:15 ` Linus Torvalds
@ 2001-09-20 6:34 ` Andrea Arcangeli
0 siblings, 0 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 6:34 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Hugh Dickins, Marcelo Tosatti, linux-kernel
On Wed, Sep 19, 2001 at 11:15:47PM -0700, Linus Torvalds wrote:
> and do early COW (which the write-access kind of means).
my new patch does the early COW.
> However, your patch isn't right for another reason: if we do delete it
> from the swap cache, we'd better mark it dirty so that it gets
> re-allocated a swap entry if it later on needs it.
>
> That's why the old code went to such extremes: it marked it dirty and
> writable if it was a write access (and exclusive), and it marked it _just_
> dirty and removed it from the swap cache if it went over the swap limit.
agreed.
Andrea
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: pre12 VM doubts and patch
2001-09-20 6:31 ` Linus Torvalds
@ 2001-09-20 6:38 ` Andrea Arcangeli
0 siblings, 0 replies; 19+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 6:38 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Hugh Dickins, Marcelo Tosatti, linux-kernel
On Wed, Sep 19, 2001 at 11:31:45PM -0700, Linus Torvalds wrote:
> So I would suggest:
>
> if (exclusive_swap_page(page)) {
> if (vma->vm_flags & VM_WRITE)
> pte = pte_mkwrite(pte);
> pte = pte_mkdirty(pte);
> delete_from_swap_cache_nolock(page);
> }
>
> or similar.
indeed, agreed.
Andrea
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2001-09-20 6:38 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-09-19 17:57 pre12 VM doubts and patch Hugh Dickins
2001-09-19 19:42 ` Hugh Dickins
2001-09-19 21:28 ` Andrea Arcangeli
2001-09-19 23:16 ` Linus Torvalds
2001-09-19 23:31 ` Andrea Arcangeli
2001-09-19 23:49 ` Andrea Arcangeli
2001-09-20 1:04 ` Jeff Chua
2001-09-20 1:05 ` Andrea Arcangeli
2001-09-20 1:12 ` Jeff Chua
2001-09-19 23:51 ` Hugh Dickins
2001-09-20 0:01 ` Andrea Arcangeli
2001-09-20 5:12 ` Andrea Arcangeli
2001-09-20 5:58 ` Linus Torvalds
2001-09-20 6:08 ` Andrea Arcangeli
2001-09-20 6:15 ` Linus Torvalds
2001-09-20 6:34 ` Andrea Arcangeli
2001-09-20 6:26 ` Andrea Arcangeli
2001-09-20 6:31 ` Linus Torvalds
2001-09-20 6:38 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox