Re: [PATCH] mm: Implement Swap Prefetching v23

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Con Kolivas <kernel@kolivas.org>
To: Andrew Morton <akpm@osdl.org>
Cc: nickpiggin@yahoo.com.au, linux-mm@kvack.org, ck@vds.kolivas.org,
	pj@sgi.com, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: Implement Swap Prefetching v23
Date: Fri, 10 Feb 2006 14:49:58 +1100	[thread overview]
Message-ID: <200602101449.59486.kernel@kolivas.org> (raw)
In-Reply-To: <20060209192556.2629e36b.akpm@osdl.org>

On Friday 10 February 2006 14:25, Andrew Morton wrote:
> Con Kolivas <kernel@kolivas.org> wrote:
> > Here's a respin with Nick's suggestions and a modification to not cost us
> > extra slab on non-numa.
>
> v23?  I'm sure we can do better than that.

:D

> > This patch implements swap prefetching when the vm is relatively idle and
> > there is free ram available.
>
> I think "free ram available" is the critical thing here.  If it doesn't
> evict anyhing else then OK, it basically uses unutilised disk bandwidth for
> free.
>
> But where does it put the pages?  If it was really "free", they'd go onto
> the tail of the inactive list.

It puts them in swapcache. This seems to work nicely as a nowhere-land place 
where they don't have much affect on anything until we need them or need more 
ram. This has worked well, but I'm open to other suggestions.

> And what about all those unpaged text pages which the app will need to
> fault back in?

Tracking these would make the perceived time to wakeup much much faster but 
also make the list a heck of a lot larger.. but more to the point I have no 
idea how to do it and get what we really want on the list.

> > Once pages have been added to the swapped list, a timer is started,
> > testing for conditions suitable to prefetch swap pages every 5 seconds.
> > Suitable conditions are defined as lack of swapping out or in any pages,
> > and no watermark tests failing. Significant amounts of dirtied ram and
> > changes in free ram representing disk writes or reads also prevent
> > prefetching.
>
> OK.   The has-the-disk-been-used-recently test still isn't there?

No, but you'd have to free up +/- SWAP_SCAN_MAX as many pages as you fill when 
reading from disk for this indirect method to not pick it up. It's easy 
enough to see that it's effective: You can sit and watch vmstat for many 
minutes after say an oom-kill in the hope you see it quiet enough before 
prefetch does anything. After allowing 'tail -f /dev/zero' to be oomkilled in 
a full gui environment it usually takes >5 mins before swap prefetch starts 
prefetching.

> > @@ -844,6 +844,7 @@ int zone_watermark_ok(struct zone *z, in
> >  		if (free_pages <= min)
> >  			return 0;
> >  	}
> > +
> >  	return 1;
> >  }
>
> ?

Legacy of v1->v22..

>
> > +	read_lock(&swapper_space.tree_lock);
>
> That's interesting.  We conventionally do read_lock_irq() on an
> address_space.tree_lock.  Because
> end_page_writeback()->rotate_reclaimable_page()->test_clear_page_writeback(
>) needs to take a write_lock from interrupt context.  end_swap_bio_write()
> calls end_page_writeback() so I don't immediately see why we don't have a
> deadlock here.

Wasn't sure of the semantics. I was lucky I guess.

> Ordinarily we'd just use find_get_page(), but
>
> > +	/* Entry may already exist */
> > +	page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
> > +	read_unlock(&swapper_space.tree_lock);
> > +	if (page) {
> > +		remove_from_swapped_list(entry.val);
> > +		goto out;
> > +	}
>
> you don't take a ref on the page.
>
> Which makes one wonder what happens here if `page' got whipped out of
> swapcache between the lookup and the remove_from_swapped_list().  Probably
> nothing much.
>
> Did you consider borrowing swapper_space.tree_lock to provide the list and
> radix-tree locking throughout here?

I did. I was concerned that since most of the functions occur as we swap in or 
out that we'd end up with more contention of that lock.

> > +out:
> > +	if (mru) {
> > +		spin_lock(&swapped.lock);
> > +		swapped.list.next = mru;
> > +		spin_unlock(&swapped.lock);
> > +	}
>
> That looks strange.  What happens to swapped.list.prev?

Left dangling for future null dereferencing... Only would have been hit on 
numa.

> > +			schedule_timeout_interruptible(MAX_SCHEDULE_TIMEOUT);
>
> 	set_current_state(TASK_INTERRUPTIBLE);
> 	schedule();

Ack.

Thanks! I guess I will be working on v24 in the near future...

Cheers,
Con

WARNING: multiple messages have this Message-ID (diff)

From: Con Kolivas <kernel@kolivas.org>
To: Andrew Morton <akpm@osdl.org>
Cc: nickpiggin@yahoo.com.au, linux-mm@kvack.org, ck@vds.kolivas.org,
	pj@sgi.com, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: Implement Swap Prefetching v23
Date: Fri, 10 Feb 2006 14:49:58 +1100	[thread overview]
Message-ID: <200602101449.59486.kernel@kolivas.org> (raw)
In-Reply-To: <20060209192556.2629e36b.akpm@osdl.org>

On Friday 10 February 2006 14:25, Andrew Morton wrote:
> Con Kolivas <kernel@kolivas.org> wrote:
> > Here's a respin with Nick's suggestions and a modification to not cost us
> > extra slab on non-numa.
>
> v23?  I'm sure we can do better than that.

:D

> > This patch implements swap prefetching when the vm is relatively idle and
> > there is free ram available.
>
> I think "free ram available" is the critical thing here.  If it doesn't
> evict anyhing else then OK, it basically uses unutilised disk bandwidth for
> free.
>
> But where does it put the pages?  If it was really "free", they'd go onto
> the tail of the inactive list.

It puts them in swapcache. This seems to work nicely as a nowhere-land place 
where they don't have much affect on anything until we need them or need more 
ram. This has worked well, but I'm open to other suggestions.

> And what about all those unpaged text pages which the app will need to
> fault back in?

Tracking these would make the perceived time to wakeup much much faster but 
also make the list a heck of a lot larger.. but more to the point I have no 
idea how to do it and get what we really want on the list.

> > Once pages have been added to the swapped list, a timer is started,
> > testing for conditions suitable to prefetch swap pages every 5 seconds.
> > Suitable conditions are defined as lack of swapping out or in any pages,
> > and no watermark tests failing. Significant amounts of dirtied ram and
> > changes in free ram representing disk writes or reads also prevent
> > prefetching.
>
> OK.   The has-the-disk-been-used-recently test still isn't there?

No, but you'd have to free up +/- SWAP_SCAN_MAX as many pages as you fill when 
reading from disk for this indirect method to not pick it up. It's easy 
enough to see that it's effective: You can sit and watch vmstat for many 
minutes after say an oom-kill in the hope you see it quiet enough before 
prefetch does anything. After allowing 'tail -f /dev/zero' to be oomkilled in 
a full gui environment it usually takes >5 mins before swap prefetch starts 
prefetching.

> > @@ -844,6 +844,7 @@ int zone_watermark_ok(struct zone *z, in
> >  		if (free_pages <= min)
> >  			return 0;
> >  	}
> > +
> >  	return 1;
> >  }
>
> ?

Legacy of v1->v22..

>
> > +	read_lock(&swapper_space.tree_lock);
>
> That's interesting.  We conventionally do read_lock_irq() on an
> address_space.tree_lock.  Because
> end_page_writeback()->rotate_reclaimable_page()->test_clear_page_writeback(
>) needs to take a write_lock from interrupt context.  end_swap_bio_write()
> calls end_page_writeback() so I don't immediately see why we don't have a
> deadlock here.

Wasn't sure of the semantics. I was lucky I guess.

> Ordinarily we'd just use find_get_page(), but
>
> > +	/* Entry may already exist */
> > +	page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
> > +	read_unlock(&swapper_space.tree_lock);
> > +	if (page) {
> > +		remove_from_swapped_list(entry.val);
> > +		goto out;
> > +	}
>
> you don't take a ref on the page.
>
> Which makes one wonder what happens here if `page' got whipped out of
> swapcache between the lookup and the remove_from_swapped_list().  Probably
> nothing much.
>
> Did you consider borrowing swapper_space.tree_lock to provide the list and
> radix-tree locking throughout here?

I did. I was concerned that since most of the functions occur as we swap in or 
out that we'd end up with more contention of that lock.

> > +out:
> > +	if (mru) {
> > +		spin_lock(&swapped.lock);
> > +		swapped.list.next = mru;
> > +		spin_unlock(&swapped.lock);
> > +	}
>
> That looks strange.  What happens to swapped.list.prev?

Left dangling for future null dereferencing... Only would have been hit on 
numa.

> > +			schedule_timeout_interruptible(MAX_SCHEDULE_TIMEOUT);
>
> 	set_current_state(TASK_INTERRUPTIBLE);
> 	schedule();

Ack.

Thanks! I guess I will be working on v24 in the near future...

Cheers,
Con

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2006-02-10  3:50 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-02-10  2:55 [PATCH] mm: Implement Swap Prefetching v23 Con Kolivas
2006-02-10  2:55 ` Con Kolivas
2006-02-10  3:25 ` Andrew Morton
2006-02-10  3:25   ` Andrew Morton
2006-02-10  3:49   ` Con Kolivas [this message]
2006-02-10  3:49     ` Con Kolivas
2006-02-10  4:07     ` Nick Piggin
2006-02-10  4:07       ` Nick Piggin
2006-02-10  4:14       ` Con Kolivas
2006-02-10  4:14         ` Con Kolivas
2006-02-10  4:17         ` Nick Piggin
2006-02-10  4:17           ` Nick Piggin
2006-02-10  4:25         ` Andrew Morton
2006-02-10  4:25           ` Andrew Morton
2006-02-10  4:45           ` Nick Piggin
2006-02-10  4:45             ` Nick Piggin
2006-02-10  4:55             ` Andrew Morton
2006-02-10  4:55               ` Andrew Morton
2006-02-10  5:01               ` Nick Piggin
2006-02-10  5:01                 ` Nick Piggin
2006-02-10  5:26                 ` Con Kolivas
2006-02-10  5:26                   ` Con Kolivas
2006-02-10  5:32                   ` Nick Piggin
2006-02-10  5:32                     ` Nick Piggin
2006-02-10  5:37                     ` Con Kolivas
2006-02-10  5:37                       ` Con Kolivas
2006-02-10  5:43                       ` Nick Piggin
2006-02-10  5:43                         ` Nick Piggin
2006-02-10  5:48                         ` Con Kolivas
2006-02-10  5:48                           ` Con Kolivas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200602101449.59486.kernel@kolivas.org \
    --to=kernel@kolivas.org \
    --cc=akpm@osdl.org \
    --cc=ck@vds.kolivas.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=pj@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.