* [PATCH] swap: add a simple detector for inappropriate swapin readahead
@ 2013-04-15 4:01 Shaohua Li
2013-04-16 3:24 ` Simon Jeons
2013-05-10 19:59 ` Andrew Morton
0 siblings, 2 replies; 4+ messages in thread
From: Shaohua Li @ 2013-04-15 4:01 UTC (permalink / raw)
To: linux-mm; +Cc: akpm, hughd, khlebnikov, riel, fengguang.wu, minchan
[-- Attachment #1: Type: text/plain, Size: 8475 bytes --]
This is a patch to improve swap readahead algorithm. It's from Hugh and I
slightly changed it.
Hugh's original changelog:
swapin readahead does a blind readahead, whether or not the swapin
is sequential. This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they
can be reclaimed easily - though, what if their allocation forced
reclaim of useful pages? But on SSD devices large reads are more
expensive than small ones: if the readahead pages are unneeded,
reading them in caused significant overhead.
This patch adds very simplistic random read detection. Stealing
the PageReadahead technique from Konstantin Khlebnikov's patch,
avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
swapin_nr_pages() simply looks at readahead's current success
rate, and narrows or widens its readahead window accordingly.
There is little science to its heuristic: it's about as stupid
as can be whilst remaining effective.
The table below shows elapsed times (in centiseconds) when running
a single repetitive swapping load across a 1000MB mapping in 900MB
ram with 1GB swap (the harddisk tests had taken painfully too long
when I used mem=500M, but SSD shows similar results for that).
Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
patch which Shaohua showed to be defective; HughNew this Nov 14
patch, with page_cluster as usual at default of 3 (8-page reads);
HughPC4 this same patch with page_cluster 4 (16-page reads);
HughPC0 with page_cluster 0 (1-page reads: no readahead).
HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
Seq for sequential access to the mapping, cycling five times around;
Rand for the same number of random touches. Anon for a MAP_PRIVATE
anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
One weakness of Shaohua's vma/anon_vma approach was that it did
not optimize Shmem: seen below. Konstantin's approach was perhaps
mistuned, 50% slower on Seq: did not compete and is not shown below.
HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon 73921 76210 75611 76904 78191 121542
Seq Shmem 73601 73176 73855 72947 74543 118322
Rand Anon 895392 831243 871569 845197 846496 841680
Rand Shmem 1058375 1053486 827935 764955 764376 756489
SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon 24634 24198 24673 25107 21614 70018
Seq Shmem 24959 24932 25052 25703 22030 69678
Rand Anon 43014 26146 28075 25989 26935 25901
Rand Shmem 45349 45215 28249 24268 24138 24332
These tests are, of course, two extremes of a very simple case:
under heavier mixed loads I've not yet observed any consistent
improvement or degradation, and wider testing would be welcome.
Shaohua Li:
Test shows Vanilla is slightly better in sequential workload than Hugh's patch.
I observed with Hugh's patch sometimes the readahead size is shrinked too fast
(from 8 to 1 immediately) in sequential workload if there is no hit. And in
such case, continuing doing readahead is good actually.
I don't prepare a sophisticated algorithm for the sequential workload because
so far we can't guarantee sequential accessed pages are swap out sequentially.
So I slightly change Hugh's heuristic - don't shrink readahead size too fast.
Here is my test result (unit second, 3 runs average):
Vanilla Hugh New
Seq 356 370 360
Random 4525 2447 2444
Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'.
The first part is running a random workload (till around 1200 of the x-axis)
and the second part is running a sequential workload. swapin and swapout
throughput are almost identical in steady state in both workloads. These are
expected behavior. while in Vanilla, swapin is much bigger than swapout
especially in random workload (because wrong readahead).
Original-patch-by: Shaohua Li <shli@fusionio.com>
Original-patch-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
---
include/linux/page-flags.h | 4 +-
mm/swap_state.c | 63 ++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 62 insertions(+), 5 deletions(-)
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h 2013-04-12 15:07:05.011112763 +0800
+++ linux/include/linux/page-flags.h 2013-04-15 11:48:12.161080804 +0800
@@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
PAGEFLAG(MappedToDisk, mappedtodisk)
-/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
+/* PG_readahead is only used for reads; PG_reclaim is only for writes */
PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
-PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */
+PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
#ifdef CONFIG_HIGHMEM
/*
Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c 2013-04-12 15:07:05.003112912 +0800
+++ linux/mm/swap_state.c 2013-04-15 11:48:12.165078764 +0800
@@ -63,6 +63,8 @@ unsigned long total_swapcache_pages(void
return ret;
}
+static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
+
void show_swap_cache_info(void)
{
printk("%lu pages in swap cache\n", total_swapcache_pages());
@@ -286,8 +288,11 @@ struct page * lookup_swap_cache(swp_entr
page = find_get_page(swap_address_space(entry), entry.val);
- if (page)
+ if (page) {
INC_CACHE_INFO(find_success);
+ if (TestClearPageReadahead(page))
+ atomic_inc(&swapin_readahead_hits);
+ }
INC_CACHE_INFO(find_total);
return page;
@@ -373,6 +378,50 @@ struct page *read_swap_cache_async(swp_e
return found_page;
}
+unsigned long swapin_nr_pages(unsigned long offset)
+{
+ static unsigned long prev_offset;
+ unsigned int pages, max_pages, last_ra;
+ static atomic_t last_readahead_pages;
+
+ max_pages = 1 << ACCESS_ONCE(page_cluster);
+ if (max_pages <= 1)
+ return 1;
+
+ /*
+ * This heuristic has been found to work well on both sequential and
+ * random loads, swapping to hard disk or to SSD: please don't ask
+ * what the "+ 2" means, it just happens to work well, that's all.
+ */
+ pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
+ if (pages == 2) {
+ /*
+ * We can have no readahead hits to judge by: but must not get
+ * stuck here forever, so check for an adjacent offset instead
+ * (and don't even bother to check whether swap type is same).
+ */
+ if (offset != prev_offset + 1 && offset != prev_offset - 1)
+ pages = 1;
+ prev_offset = offset;
+ } else {
+ unsigned int roundup = 4;
+ while (roundup < pages)
+ roundup <<= 1;
+ pages = roundup;
+ }
+
+ if (pages > max_pages)
+ pages = max_pages;
+
+ /* Don't shrink readahead too fast */
+ last_ra = atomic_read(&last_readahead_pages) / 2;
+ if (pages < last_ra)
+ pages = last_ra;
+ atomic_set(&last_readahead_pages, pages);
+
+ return pages;
+}
+
/**
* swapin_readahead - swap in pages in hope we need them soon
* @entry: swap entry of this memory
@@ -396,11 +445,16 @@ struct page *swapin_readahead(swp_entry_
struct vm_area_struct *vma, unsigned long addr)
{
struct page *page;
- unsigned long offset = swp_offset(entry);
+ unsigned long entry_offset = swp_offset(entry);
+ unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
- unsigned long mask = (1UL << page_cluster) - 1;
+ unsigned long mask;
struct blk_plug plug;
+ mask = swapin_nr_pages(offset) - 1;
+ if (!mask)
+ goto skip;
+
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
end_offset = offset | mask;
@@ -414,10 +468,13 @@ struct page *swapin_readahead(swp_entry_
gfp_mask, vma, addr);
if (!page)
continue;
+ if (offset != entry_offset)
+ SetPageReadahead(page);
page_cache_release(page);
}
blk_finish_plug(&plug);
lru_add_drain(); /* Push any new pages onto the LRU now */
+skip:
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}
[-- Attachment #2: swapra.png --]
[-- Type: image/png, Size: 52071 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] swap: add a simple detector for inappropriate swapin readahead
2013-04-15 4:01 [PATCH] swap: add a simple detector for inappropriate swapin readahead Shaohua Li
@ 2013-04-16 3:24 ` Simon Jeons
2013-05-10 19:59 ` Andrew Morton
1 sibling, 0 replies; 4+ messages in thread
From: Simon Jeons @ 2013-04-16 3:24 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-mm, akpm, hughd, khlebnikov, riel, fengguang.wu, minchan
Hi Shaohua,
On 04/15/2013 12:01 PM, Shaohua Li wrote:
> This is a patch to improve swap readahead algorithm. It's from Hugh and I
> slightly changed it.
>
> Hugh's original changelog:
>
> swapin readahead does a blind readahead, whether or not the swapin
> is sequential. This may be ok on harddisk, because large reads have
> relatively small costs, and if the readahead pages are unneeded they
> can be reclaimed easily - though, what if their allocation forced
> reclaim of useful pages? But on SSD devices large reads are more
> expensive than small ones: if the readahead pages are unneeded,
> reading them in caused significant overhead.
>
> This patch adds very simplistic random read detection. Stealing
> the PageReadahead technique from Konstantin Khlebnikov's patch,
> avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
> swapin_nr_pages() simply looks at readahead's current success
> rate, and narrows or widens its readahead window accordingly.
> There is little science to its heuristic: it's about as stupid
> as can be whilst remaining effective.
>
> The table below shows elapsed times (in centiseconds) when running
> a single repetitive swapping load across a 1000MB mapping in 900MB
> ram with 1GB swap (the harddisk tests had taken painfully too long
> when I used mem=500M, but SSD shows similar results for that).
>
> Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
> his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
> patch which Shaohua showed to be defective; HughNew this Nov 14
> patch, with page_cluster as usual at default of 3 (8-page reads);
> HughPC4 this same patch with page_cluster 4 (16-page reads);
> HughPC0 with page_cluster 0 (1-page reads: no readahead).
>
> HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
> Seq for sequential access to the mapping, cycling five times around;
> Rand for the same number of random touches. Anon for a MAP_PRIVATE
> anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
>
> One weakness of Shaohua's vma/anon_vma approach was that it did
> not optimize Shmem: seen below. Konstantin's approach was perhaps
> mistuned, 50% slower on Seq: did not compete and is not shown below.
>
> HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
> Seq Anon 73921 76210 75611 76904 78191 121542
> Seq Shmem 73601 73176 73855 72947 74543 118322
> Rand Anon 895392 831243 871569 845197 846496 841680
> Rand Shmem 1058375 1053486 827935 764955 764376 756489
>
> SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
> Seq Anon 24634 24198 24673 25107 21614 70018
> Seq Shmem 24959 24932 25052 25703 22030 69678
> Rand Anon 43014 26146 28075 25989 26935 25901
> Rand Shmem 45349 45215 28249 24268 24138 24332
>
> These tests are, of course, two extremes of a very simple case:
> under heavier mixed loads I've not yet observed any consistent
> improvement or degradation, and wider testing would be welcome.
>
> Shaohua Li:
>
> Test shows Vanilla is slightly better in sequential workload than Hugh's patch.
> I observed with Hugh's patch sometimes the readahead size is shrinked too fast
> (from 8 to 1 immediately) in sequential workload if there is no hit. And in
> such case, continuing doing readahead is good actually.
>
> I don't prepare a sophisticated algorithm for the sequential workload because
> so far we can't guarantee sequential accessed pages are swap out sequentially.
> So I slightly change Hugh's heuristic - don't shrink readahead size too fast.
>
> Here is my test result (unit second, 3 runs average):
> Vanilla Hugh New
> Seq 356 370 360
> Random 4525 2447 2444
>
> Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'.
Could you tell me how you draw this graph?
> The first part is running a random workload (till around 1200 of the x-axis)
> and the second part is running a sequential workload. swapin and swapout
> throughput are almost identical in steady state in both workloads. These are
> expected behavior. while in Vanilla, swapin is much bigger than swapout
> especially in random workload (because wrong readahead).
>
> Original-patch-by: Shaohua Li <shli@fusionio.com>
> Original-patch-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> ---
>
> include/linux/page-flags.h | 4 +-
> mm/swap_state.c | 63 ++++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 62 insertions(+), 5 deletions(-)
>
> Index: linux/include/linux/page-flags.h
> ===================================================================
> --- linux.orig/include/linux/page-flags.h 2013-04-12 15:07:05.011112763 +0800
> +++ linux/include/linux/page-flags.h 2013-04-15 11:48:12.161080804 +0800
> @@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
> TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
> PAGEFLAG(MappedToDisk, mappedtodisk)
>
> -/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
> +/* PG_readahead is only used for reads; PG_reclaim is only for writes */
> PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
> -PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */
> +PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
>
> #ifdef CONFIG_HIGHMEM
> /*
> Index: linux/mm/swap_state.c
> ===================================================================
> --- linux.orig/mm/swap_state.c 2013-04-12 15:07:05.003112912 +0800
> +++ linux/mm/swap_state.c 2013-04-15 11:48:12.165078764 +0800
> @@ -63,6 +63,8 @@ unsigned long total_swapcache_pages(void
> return ret;
> }
>
> +static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
> +
> void show_swap_cache_info(void)
> {
> printk("%lu pages in swap cache\n", total_swapcache_pages());
> @@ -286,8 +288,11 @@ struct page * lookup_swap_cache(swp_entr
>
> page = find_get_page(swap_address_space(entry), entry.val);
>
> - if (page)
> + if (page) {
> INC_CACHE_INFO(find_success);
> + if (TestClearPageReadahead(page))
> + atomic_inc(&swapin_readahead_hits);
> + }
>
> INC_CACHE_INFO(find_total);
> return page;
> @@ -373,6 +378,50 @@ struct page *read_swap_cache_async(swp_e
> return found_page;
> }
>
> +unsigned long swapin_nr_pages(unsigned long offset)
> +{
> + static unsigned long prev_offset;
> + unsigned int pages, max_pages, last_ra;
> + static atomic_t last_readahead_pages;
> +
> + max_pages = 1 << ACCESS_ONCE(page_cluster);
> + if (max_pages <= 1)
> + return 1;
> +
> + /*
> + * This heuristic has been found to work well on both sequential and
> + * random loads, swapping to hard disk or to SSD: please don't ask
> + * what the "+ 2" means, it just happens to work well, that's all.
> + */
> + pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
> + if (pages == 2) {
> + /*
> + * We can have no readahead hits to judge by: but must not get
> + * stuck here forever, so check for an adjacent offset instead
> + * (and don't even bother to check whether swap type is same).
> + */
> + if (offset != prev_offset + 1 && offset != prev_offset - 1)
> + pages = 1;
> + prev_offset = offset;
> + } else {
> + unsigned int roundup = 4;
> + while (roundup < pages)
> + roundup <<= 1;
> + pages = roundup;
> + }
> +
> + if (pages > max_pages)
> + pages = max_pages;
> +
> + /* Don't shrink readahead too fast */
> + last_ra = atomic_read(&last_readahead_pages) / 2;
> + if (pages < last_ra)
> + pages = last_ra;
> + atomic_set(&last_readahead_pages, pages);
> +
> + return pages;
> +}
> +
> /**
> * swapin_readahead - swap in pages in hope we need them soon
> * @entry: swap entry of this memory
> @@ -396,11 +445,16 @@ struct page *swapin_readahead(swp_entry_
> struct vm_area_struct *vma, unsigned long addr)
> {
> struct page *page;
> - unsigned long offset = swp_offset(entry);
> + unsigned long entry_offset = swp_offset(entry);
> + unsigned long offset = entry_offset;
> unsigned long start_offset, end_offset;
> - unsigned long mask = (1UL << page_cluster) - 1;
> + unsigned long mask;
> struct blk_plug plug;
>
> + mask = swapin_nr_pages(offset) - 1;
> + if (!mask)
> + goto skip;
> +
> /* Read a page_cluster sized and aligned cluster around offset. */
> start_offset = offset & ~mask;
> end_offset = offset | mask;
> @@ -414,10 +468,13 @@ struct page *swapin_readahead(swp_entry_
> gfp_mask, vma, addr);
> if (!page)
> continue;
> + if (offset != entry_offset)
> + SetPageReadahead(page);
> page_cache_release(page);
> }
> blk_finish_plug(&plug);
>
> lru_add_drain(); /* Push any new pages onto the LRU now */
> +skip:
> return read_swap_cache_async(entry, gfp_mask, vma, addr);
> }
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] swap: add a simple detector for inappropriate swapin readahead
2013-04-15 4:01 [PATCH] swap: add a simple detector for inappropriate swapin readahead Shaohua Li
2013-04-16 3:24 ` Simon Jeons
@ 2013-05-10 19:59 ` Andrew Morton
2013-05-13 5:22 ` Shaohua Li
1 sibling, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2013-05-10 19:59 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-mm, hughd, khlebnikov, riel, fengguang.wu, minchan
On Mon, 15 Apr 2013 12:01:16 +0800 Shaohua Li <shli@kernel.org> wrote:
> This is a patch to improve swap readahead algorithm. It's from Hugh and I
> slightly changed it.
>
> ...
>
I find the new code a bit harder to follow that it needs to be.
> --- linux.orig/include/linux/page-flags.h 2013-04-12 15:07:05.011112763 +0800
> +++ linux/include/linux/page-flags.h 2013-04-15 11:48:12.161080804 +0800
> @@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
> TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
> PAGEFLAG(MappedToDisk, mappedtodisk)
>
> -/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
> +/* PG_readahead is only used for reads; PG_reclaim is only for writes */
> PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
> -PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */
> +PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
>
> #ifdef CONFIG_HIGHMEM
> /*
> Index: linux/mm/swap_state.c
> ===================================================================
> --- linux.orig/mm/swap_state.c 2013-04-12 15:07:05.003112912 +0800
> +++ linux/mm/swap_state.c 2013-04-15 11:48:12.165078764 +0800
> @@ -63,6 +63,8 @@ unsigned long total_swapcache_pages(void
> return ret;
> }
>
> +static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
Some documentation is needed here explaining this variable's role. If
that is understood then perhaps the reader will be able to work out why
it was initialised to "4". Or perhaps not.
> void show_swap_cache_info(void)
> {
> printk("%lu pages in swap cache\n", total_swapcache_pages());
> @@ -286,8 +288,11 @@ struct page * lookup_swap_cache(swp_entr
>
> page = find_get_page(swap_address_space(entry), entry.val);
>
> - if (page)
> + if (page) {
> INC_CACHE_INFO(find_success);
> + if (TestClearPageReadahead(page))
> + atomic_inc(&swapin_readahead_hits);
> + }
>
> INC_CACHE_INFO(find_total);
> return page;
> @@ -373,6 +378,50 @@ struct page *read_swap_cache_async(swp_e
> return found_page;
> }
>
> +unsigned long swapin_nr_pages(unsigned long offset)
Should be static.
Needs documentation explaining what it does and why.
It would probably be clearer to make `offset' have type pgoff_t.
What's what swp_offset() returned. Ditto `entry_offset' in
swapin_readahead(). It's not *really* a pgoff_t, but that's what we
have and it's more informative than a bare ulong.
The documentation should describe the meaning of this function's return
value.
> +{
> + static unsigned long prev_offset;
> + unsigned int pages, max_pages, last_ra;
> + static atomic_t last_readahead_pages;
> +
> + max_pages = 1 << ACCESS_ONCE(page_cluster);
> + if (max_pages <= 1)
> + return 1;
> +
> + /*
> + * This heuristic has been found to work well on both sequential and
> + * random loads, swapping to hard disk or to SSD: please don't ask
> + * what the "+ 2" means, it just happens to work well, that's all.
OK, I won't.
> + */
> + pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
> + if (pages == 2) {
> + /*
> + * We can have no readahead hits to judge by: but must not get
> + * stuck here forever, so check for an adjacent offset instead
> + * (and don't even bother to check whether swap type is same).
> + */
> + if (offset != prev_offset + 1 && offset != prev_offset - 1)
> + pages = 1;
> + prev_offset = offset;
> + } else {
> + unsigned int roundup = 4;
What does the "4" mean?
> + while (roundup < pages)
> + roundup <<= 1;
Can use something like
roundup = ilog2(pages) + 2;
And what does the "2" mean?
> + pages = roundup;
> + }
> +
> + if (pages > max_pages)
> + pages = max_pages;
min()
> + /* Don't shrink readahead too fast */
> + last_ra = atomic_read(&last_readahead_pages) / 2;
Why not "3"?
> + if (pages < last_ra)
> + pages = last_ra;
> + atomic_set(&last_readahead_pages, pages);
> +
> + return pages;
> +}
> +
> /**
> * swapin_readahead - swap in pages in hope we need them soon
> * @entry: swap entry of this memory
> @@ -396,11 +445,16 @@ struct page *swapin_readahead(swp_entry_
> struct vm_area_struct *vma, unsigned long addr)
> {
> struct page *page;
> - unsigned long offset = swp_offset(entry);
> + unsigned long entry_offset = swp_offset(entry);
> + unsigned long offset = entry_offset;
> unsigned long start_offset, end_offset;
> - unsigned long mask = (1UL << page_cluster) - 1;
> + unsigned long mask;
> struct blk_plug plug;
>
> + mask = swapin_nr_pages(offset) - 1;
This I in fact found to be the most obscure part of the patch.
swapin_nr_pages() returns a count, but here we're copying it into a
variable which appears to hold a bitmask. That's a weird thing to do
and only makes sense if it is assured (and designed) that
swapin_nr_pages() returns a power of 2.
Wanna see if we can clear all these things up please?
> + if (!mask)
> + goto skip;
> +
> /* Read a page_cluster sized and aligned cluster around offset. */
> start_offset = offset & ~mask;
> end_offset = offset | mask;
> ...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] swap: add a simple detector for inappropriate swapin readahead
2013-05-10 19:59 ` Andrew Morton
@ 2013-05-13 5:22 ` Shaohua Li
0 siblings, 0 replies; 4+ messages in thread
From: Shaohua Li @ 2013-05-13 5:22 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, hughd, khlebnikov, riel, fengguang.wu, minchan
On Fri, May 10, 2013 at 12:59:06PM -0700, Andrew Morton wrote:
> On Mon, 15 Apr 2013 12:01:16 +0800 Shaohua Li <shli@kernel.org> wrote:
>
> > This is a patch to improve swap readahead algorithm. It's from Hugh and I
> > slightly changed it.
> >
> > ...
> >
>
> I find the new code a bit harder to follow that it needs to be.
The patch detects random workload to avoid false readahead. For sequential
workload, it's known to be hard to do readahead, because we can't guarantee
memory of sequential workload live together in disk. The original blind
readahead doesn't work very well for sequential worload too. So the goal is to
not regress for sequential workload. There are some magics here for this. I'd
say I can't prove the magics are ok, but it just happens to work for simple
workload, sorry :)!
> > --- linux.orig/include/linux/page-flags.h 2013-04-12 15:07:05.011112763 +0800
> > +++ linux/include/linux/page-flags.h 2013-04-15 11:48:12.161080804 +0800
> > @@ -228,9 +228,9 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
> > TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
> > PAGEFLAG(MappedToDisk, mappedtodisk)
> >
> > -/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
> > +/* PG_readahead is only used for reads; PG_reclaim is only for writes */
> > PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
> > -PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */
> > +PAGEFLAG(Readahead, reclaim) TESTCLEARFLAG(Readahead, reclaim)
> >
> > #ifdef CONFIG_HIGHMEM
> > /*
> > Index: linux/mm/swap_state.c
> > ===================================================================
> > --- linux.orig/mm/swap_state.c 2013-04-12 15:07:05.003112912 +0800
> > +++ linux/mm/swap_state.c 2013-04-15 11:48:12.165078764 +0800
> > @@ -63,6 +63,8 @@ unsigned long total_swapcache_pages(void
> > return ret;
> > }
> >
> > +static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
>
> Some documentation is needed here explaining this variable's role. If
> that is understood then perhaps the reader will be able to work out why
> it was initialised to "4". Or perhaps not.
Ok, explained it, but the '4' is still a magic.
> > void show_swap_cache_info(void)
> > {
> > printk("%lu pages in swap cache\n", total_swapcache_pages());
> > @@ -286,8 +288,11 @@ struct page * lookup_swap_cache(swp_entr
> >
> > page = find_get_page(swap_address_space(entry), entry.val);
> >
> > - if (page)
> > + if (page) {
> > INC_CACHE_INFO(find_success);
> > + if (TestClearPageReadahead(page))
> > + atomic_inc(&swapin_readahead_hits);
> > + }
> >
> > INC_CACHE_INFO(find_total);
> > return page;
> > @@ -373,6 +378,50 @@ struct page *read_swap_cache_async(swp_e
> > return found_page;
> > }
> >
> > +unsigned long swapin_nr_pages(unsigned long offset)
>
> Should be static.
>
> Needs documentation explaining what it does and why.
>
> It would probably be clearer to make `offset' have type pgoff_t.
> What's what swp_offset() returned. Ditto `entry_offset' in
> swapin_readahead(). It's not *really* a pgoff_t, but that's what we
> have and it's more informative than a bare ulong.
>
> The documentation should describe the meaning of this function's return
> value.
done.
> > +{
> > + static unsigned long prev_offset;
> > + unsigned int pages, max_pages, last_ra;
> > + static atomic_t last_readahead_pages;
> > +
> > + max_pages = 1 << ACCESS_ONCE(page_cluster);
> > + if (max_pages <= 1)
> > + return 1;
> > +
> > + /*
> > + * This heuristic has been found to work well on both sequential and
> > + * random loads, swapping to hard disk or to SSD: please don't ask
> > + * what the "+ 2" means, it just happens to work well, that's all.
>
> OK, I won't.
Thanks :)
> > + */
> > + pages = atomic_xchg(&swapin_readahead_hits, 0) + 2;
> > + if (pages == 2) {
> > + /*
> > + * We can have no readahead hits to judge by: but must not get
> > + * stuck here forever, so check for an adjacent offset instead
> > + * (and don't even bother to check whether swap type is same).
> > + */
> > + if (offset != prev_offset + 1 && offset != prev_offset - 1)
> > + pages = 1;
> > + prev_offset = offset;
> > + } else {
> > + unsigned int roundup = 4;
>
> What does the "4" mean?
The same magic.
> > + while (roundup < pages)
> > + roundup <<= 1;
>
> Can use something like
>
> roundup = ilog2(pages) + 2;
>
> And what does the "2" mean?
ilog2 doesn't work here.
> > + pages = roundup;
> > + }
> > +
> > + if (pages > max_pages)
> > + pages = max_pages;
>
> min()
ok
> > + /* Don't shrink readahead too fast */
> > + last_ra = atomic_read(&last_readahead_pages) / 2;
>
> Why not "3"?
A magic again
> > + if (pages < last_ra)
> > + pages = last_ra;
> > + atomic_set(&last_readahead_pages, pages);
> > +
> > + return pages;
> > +}
> > +
> > /**
> > * swapin_readahead - swap in pages in hope we need them soon
> > * @entry: swap entry of this memory
> > @@ -396,11 +445,16 @@ struct page *swapin_readahead(swp_entry_
> > struct vm_area_struct *vma, unsigned long addr)
> > {
> > struct page *page;
> > - unsigned long offset = swp_offset(entry);
> > + unsigned long entry_offset = swp_offset(entry);
> > + unsigned long offset = entry_offset;
> > unsigned long start_offset, end_offset;
> > - unsigned long mask = (1UL << page_cluster) - 1;
> > + unsigned long mask;
> > struct blk_plug plug;
> >
> > + mask = swapin_nr_pages(offset) - 1;
>
> This I in fact found to be the most obscure part of the patch.
> swapin_nr_pages() returns a count, but here we're copying it into a
> variable which appears to hold a bitmask. That's a weird thing to do
> and only makes sense if it is assured (and designed) that
> swapin_nr_pages() returns a power of 2.
Yes, it's guaranteed to return a power of 2, now commented in patch.
---
mm/swap_state.c | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)
Index: linux/mm/swap_state.c
===================================================================
--- linux.orig/mm/swap_state.c 2013-05-13 11:43:00.137490065 +0800
+++ linux/mm/swap_state.c 2013-05-13 13:18:44.573272832 +0800
@@ -63,6 +63,11 @@ unsigned long total_swapcache_pages(void
return ret;
}
+/*
+ * Track how many swap readahead pages are truly hit. We readahead at least
+ * swapin_readahead_hits pages. The "4" is arbitary, if there are hits, at
+ * least readahead 4 pages.
+ */
static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
void show_swap_cache_info(void)
@@ -378,9 +383,20 @@ struct page *read_swap_cache_async(swp_e
return found_page;
}
-unsigned long swapin_nr_pages(unsigned long offset)
+/*
+ * Return how many swap pages should be readahead. This detects random workload
+ * to avoid false readahead. It's hard to correctly do readahead for sequential
+ * workload, as we can't guarantee memory of sequential workload live in disk
+ * sequentially. This still tries to readahead as more pages as possible if
+ * swapin readahead hits (for example, any hit causes at least 4 pages
+ * readahead; shrinking only allows shrink to half of last readahead pages).
+ *
+ * This is guaranteed to return power of 2 pages, as swapin_readahead reads
+ * ahead an aligned cluster.
+ */
+static unsigned long swapin_nr_pages(pgoff_t offset)
{
- static unsigned long prev_offset;
+ static pgoff_t prev_offset;
unsigned int pages, max_pages, last_ra;
static atomic_t last_readahead_pages;
@@ -410,13 +426,12 @@ unsigned long swapin_nr_pages(unsigned l
pages = roundup;
}
- if (pages > max_pages)
- pages = max_pages;
+ pages = min(pages, max_pages);
/* Don't shrink readahead too fast */
last_ra = atomic_read(&last_readahead_pages) / 2;
- if (pages < last_ra)
- pages = last_ra;
+ pages = max(pages, last_ra);
+
atomic_set(&last_readahead_pages, pages);
return pages;
@@ -445,9 +460,9 @@ struct page *swapin_readahead(swp_entry_
struct vm_area_struct *vma, unsigned long addr)
{
struct page *page;
- unsigned long entry_offset = swp_offset(entry);
- unsigned long offset = entry_offset;
- unsigned long start_offset, end_offset;
+ pgoff_t entry_offset = swp_offset(entry);
+ pgoff_t offset = entry_offset;
+ pgoff_t start_offset, end_offset;
unsigned long mask;
struct blk_plug plug;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-05-13 5:22 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-15 4:01 [PATCH] swap: add a simple detector for inappropriate swapin readahead Shaohua Li
2013-04-16 3:24 ` Simon Jeons
2013-05-10 19:59 ` Andrew Morton
2013-05-13 5:22 ` Shaohua Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).