From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v4) Date: Fri, 27 Jan 2012 11:05:24 +0800 Message-ID: <20120127030524.854259561@intel.com> Cc: Andi Kleen To: Andrew Morton Return-path: cc: Linux Memory Management List , Cc: Wu Fengguang , LKML Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Andrew, Will you include it into the -mm tree? This introduces the per-cpu readahead stats, tracing, backwards prefetching, fixes context readahead for SSD random reads and does some other minor changes. Changes since v3: - default to CONFIG_READAHEAD_STATS=n - drop "block: limit default readahead size for small devices" (and expect some distro udev rules to do the job) - use percpu_counter for the readahead stats Changes since v2: - use per-cpu counters for readahead stats - make context readahead more conservative - simplify readahead tracing format and use __print_symbolic() - backwards prefetching and snap to EOF fixes and cleanups Changes since v1: - use bit fields: pattern, for_mmap, for_metadata, lseek - comment the various readahead patterns - drop boot options "readahead=" and "readahead_stats=" - add for_metadata - add snapping to EOF [PATCH 1/9] readahead: make context readahead more conservative [PATCH 2/9] readahead: record readahead patterns [PATCH 3/9] readahead: tag mmap page fault call sites [PATCH 4/9] readahead: tag metadata call sites [PATCH 5/9] readahead: add vfs/readahead tracing event [PATCH 6/9] readahead: add /debug/readahead/stats [PATCH 7/9] readahead: basic support for backwards prefetching [PATCH 8/9] readahead: dont do start-of-file readahead after lseek() [PATCH 9/9] readahead: snap readahead request to EOF fs/Makefile | 1 fs/ext3/dir.c | 1 fs/ext4/dir.c | 1 fs/read_write.c | 3 fs/trace.c | 2 include/linux/fs.h | 41 ++++ include/linux/mm.h | 4 include/trace/events/vfs.h | 78 ++++++++ mm/Kconfig | 15 + mm/filemap.c | 9 - mm/readahead.c | 310 +++++++++++++++++++++++++++++++++-- 11 files changed, 450 insertions(+), 15 deletions(-) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: [PATCH 1/9] readahead: make context readahead more conservative Date: Fri, 27 Jan 2012 11:05:25 +0800 Message-ID: <20120127031326.469063803@intel.com> References: <20120127030524.854259561@intel.com> Cc: Andi Kleen , Wu Fengguang To: Andrew Morton Return-path: cc: Linux Memory Management List , Cc: LKML Content-Disposition: inline; filename=readahead-context-tt Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Try to prevent negatively impact moderately dense random reads on SSD. Transaction-Per-Second numbers provided by Taobao: QPS case ------------------------------------------------------- 7536 disable context readahead totally w/ patch: 7129 slower size rampup and start RA on the 3rd read 6717 slower size rampup w/o patch: 5581 unmodified context readahead Before, readahead will be started whenever reading page N+1 when it happen to read N recently. After patch, we'll only start readahead when *three* random reads happen to access pages N, N+1, N+2. The probability of this happening is extremely low for pure random reads, unless they are very dense, which actually deserves some readahead. Also start with a smaller readahead window. The impact to interleaved sequential reads should be small, because for a long run stream, the the small readahead window rampup phase is negletable. The context readahead actually benefits clustered random reads on HDD whose seek cost is pretty high. However as SSD is increasingly used for random read workloads it's better for the context readahead to concentrate on interleaved sequential reads. Tested-by: Tao Ma Signed-off-by: Wu Fengguang --- mm/readahead.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:49.000000000 +0800 @@ -369,10 +369,10 @@ static int try_context_readahead(struct size = count_history_pages(mapping, ra, offset, max); /* - * no history pages: + * not enough history pages: * it could be a random read */ - if (!size) + if (size <= req_size) return 0; /* @@ -383,8 +383,8 @@ static int try_context_readahead(struct size *= 2; ra->start = offset; - ra->size = get_init_ra_size(size + req_size, max); - ra->async_size = ra->size; + ra->size = min(size + req_size, max); + ra->async_size = 1; return 1; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: [PATCH 9/9] readahead: snap readahead request to EOF Date: Fri, 27 Jan 2012 11:05:33 +0800 Message-ID: <20120127031327.567686120@intel.com> References: <20120127030524.854259561@intel.com> Cc: Andi Kleen , Jan Kara , Wu Fengguang To: Andrew Morton Return-path: cc: Linux Memory Management List , Cc: LKML Content-Disposition: inline; filename=readahead-eof Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org If the file size is 20kb and readahead request is [0, 16kb), it's better to expand the readahead request to [0, 20kb), which will likely save one followup I/O for the ending [16kb, 20kb). If the readahead request already covers EOF, trimm it down to EOF. Also don't set the PG_readahead mark to avoid an unnecessary future invocation of the readahead code. This special handling looks worthwhile because small to medium sized files are pretty common. Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- mm/readahead.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:58.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:59.000000000 +0800 @@ -466,6 +466,25 @@ unsigned long max_sane_readahead(unsigne + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); } +static void snap_to_eof(struct file_ra_state *ra, struct address_space *mapping) +{ + pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1; + pgoff_t start = ra->start; + unsigned int size = ra->size; + + /* + * skip backwards and random reads + */ + if (ra->pattern > RA_PATTERN_MMAP_AROUND) + return; + + size += min(size / 2, ra->ra_pages / 4); + if (start + size > eof) { + ra->size = eof - start; + ra->async_size = 0; + } +} + /* * Submit IO for the read-ahead request in file_ra_state. */ @@ -477,6 +496,8 @@ unsigned long ra_submit(struct file_ra_s { int actual; + snap_to_eof(ra, mapping); + actual = __do_page_cache_readahead(mapping, filp, ra->start, ra->size, ra->async_size); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: [PATCH 3/9] readahead: tag mmap page fault call sites Date: Fri, 27 Jan 2012 11:05:27 +0800 Message-ID: <20120127031326.752229208@intel.com> References: <20120127030524.854259561@intel.com> Cc: Andi Kleen , Jan Kara , Wu Fengguang To: Andrew Morton Return-path: cc: Linux Memory Management List , Cc: LKML Content-Disposition: inline; filename=readahead-for-mmap Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Introduce a bit field ra->for_mmap for tagging mmap reads. The tag will be cleared immediate after submitting the IO. Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- include/linux/fs.h | 1 + mm/filemap.c | 6 +++++- mm/readahead.c | 1 + 3 files changed, 7 insertions(+), 1 deletion(-) --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:50.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:51.000000000 +0800 @@ -954,6 +954,7 @@ struct file_ra_state { unsigned int ra_pages; /* Maximum readahead window */ u16 mmap_miss; /* Cache miss stat for mmap accesses */ u8 pattern; /* one of RA_PATTERN_* */ + unsigned int for_mmap:1; /* readahead for mmap accesses */ loff_t prev_pos; /* Cache last read() position */ }; --- linux-next.orig/mm/filemap.c 2012-01-25 15:57:50.000000000 +0800 +++ linux-next/mm/filemap.c 2012-01-25 15:57:51.000000000 +0800 @@ -1578,6 +1578,7 @@ static void do_sync_mmap_readahead(struc return; if (VM_SequentialReadHint(vma)) { + ra->for_mmap = 1; page_cache_sync_readahead(mapping, ra, file, offset, ra->ra_pages); return; @@ -1597,6 +1598,7 @@ static void do_sync_mmap_readahead(struc /* * mmap read-around */ + ra->for_mmap = 1; ra->pattern = RA_PATTERN_MMAP_AROUND; ra_pages = max_sane_readahead(ra->ra_pages); ra->start = max_t(long, 0, offset - ra_pages / 2); @@ -1622,9 +1624,11 @@ static void do_async_mmap_readahead(stru return; if (ra->mmap_miss > 0) ra->mmap_miss--; - if (PageReadahead(page)) + if (PageReadahead(page)) { + ra->for_mmap = 1; page_cache_async_readahead(mapping, ra, file, page, offset, ra->ra_pages); + } } /** --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:50.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:51.000000000 +0800 @@ -259,6 +259,7 @@ unsigned long ra_submit(struct file_ra_s actual = __do_page_cache_readahead(mapping, filp, ra->start, ra->size, ra->async_size); + ra->for_mmap = 0; return actual; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: [PATCH 8/9] readahead: dont do start-of-file readahead after lseek() Date: Fri, 27 Jan 2012 11:05:32 +0800 Message-ID: <20120127031327.430238053@intel.com> References: <20120127030524.854259561@intel.com> Cc: Andi Kleen , Rik van Riel , Linus Torvalds , Wu Fengguang To: Andrew Morton Return-path: cc: Linux Memory Management List , Cc: LKML Content-Disposition: inline; filename=readahead-lseek.patch Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Some applications (eg. blkid, id3tool etc.) seek around the file to get information. For example, blkid does seek to 0 read 1024 seek to 1536 read 16384 The start-of-file readahead heuristic is wrong for them, whose access pattern can be identified by lseek() calls. So test-and-set a READAHEAD_LSEEK flag on lseek() and don't do start-of-file readahead on seeing it. Proposed by Linus. Acked-by: Rik van Riel Acked-by: Linus Torvalds Signed-off-by: Wu Fengguang --- fs/read_write.c | 3 +++ include/linux/fs.h | 1 + mm/readahead.c | 4 ++++ 3 files changed, 8 insertions(+) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:57.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:58.000000000 +0800 @@ -485,6 +485,7 @@ unsigned long ra_submit(struct file_ra_s ra->pattern, ra->start, ra->size, ra->async_size, actual); + ra->lseek = 0; ra->for_mmap = 0; ra->for_metadata = 0; return actual; @@ -636,6 +637,8 @@ ondemand_readahead(struct address_space * start of file */ if (!offset) { + if (ra->lseek && req_size < max) + goto random_read; ra->pattern = RA_PATTERN_INITIAL; goto initial_readahead; } @@ -721,6 +724,7 @@ ondemand_readahead(struct address_space if (try_context_readahead(mapping, ra, offset, req_size, max)) goto readit; +random_read: /* * standalone, small random read */ --- linux-next.orig/fs/read_write.c 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/read_write.c 2012-01-25 15:57:58.000000000 +0800 @@ -47,6 +47,9 @@ static loff_t lseek_execute(struct file file->f_pos = offset; file->f_version = 0; } + + file->f_ra.lseek = 1; + return offset; } --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:57.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:58.000000000 +0800 @@ -956,6 +956,7 @@ struct file_ra_state { u8 pattern; /* one of RA_PATTERN_* */ unsigned int for_mmap:1; /* readahead for mmap accesses */ unsigned int for_metadata:1; /* readahead for meta data */ + unsigned int lseek:1; /* this read has a leading lseek */ loff_t prev_pos; /* Cache last read() position */ }; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: [PATCH 4/9] readahead: tag metadata call sites Date: Fri, 27 Jan 2012 11:05:28 +0800 Message-ID: <20120127031326.881533433@intel.com> References: <20120127030524.854259561@intel.com> Cc: Andi Kleen , Jan Kara , Wu Fengguang To: Andrew Morton Return-path: cc: Linux Memory Management List , Cc: LKML Content-Disposition: inline; filename=readahead-for-metadata Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org We may be doing more metadata readahead in future. Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- fs/ext3/dir.c | 1 + fs/ext4/dir.c | 1 + include/linux/fs.h | 1 + mm/readahead.c | 1 + 4 files changed, 4 insertions(+) --- linux-next.orig/fs/ext3/dir.c 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/ext3/dir.c 2012-01-25 15:57:52.000000000 +0800 @@ -136,6 +136,7 @@ static int ext3_readdir(struct file * fi pgoff_t index = map_bh.b_blocknr >> (PAGE_CACHE_SHIFT - inode->i_blkbits); if (!ra_has_index(&filp->f_ra, index)) + filp->f_ra.for_metadata = 1; page_cache_sync_readahead( sb->s_bdev->bd_inode->i_mapping, &filp->f_ra, filp, --- linux-next.orig/fs/ext4/dir.c 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/ext4/dir.c 2012-01-25 15:57:52.000000000 +0800 @@ -153,6 +153,7 @@ static int ext4_readdir(struct file *fil pgoff_t index = map.m_pblk >> (PAGE_CACHE_SHIFT - inode->i_blkbits); if (!ra_has_index(&filp->f_ra, index)) + filp->f_ra.for_metadata = 1; page_cache_sync_readahead( sb->s_bdev->bd_inode->i_mapping, &filp->f_ra, filp, --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:51.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:52.000000000 +0800 @@ -955,6 +955,7 @@ struct file_ra_state { u16 mmap_miss; /* Cache miss stat for mmap accesses */ u8 pattern; /* one of RA_PATTERN_* */ unsigned int for_mmap:1; /* readahead for mmap accesses */ + unsigned int for_metadata:1; /* readahead for meta data */ loff_t prev_pos; /* Cache last read() position */ }; --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:51.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 @@ -260,6 +260,7 @@ unsigned long ra_submit(struct file_ra_s ra->start, ra->size, ra->async_size); ra->for_mmap = 0; + ra->for_metadata = 0; return actual; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: [PATCH 7/9] readahead: basic support for backwards prefetching Date: Fri, 27 Jan 2012 11:05:31 +0800 Message-ID: <20120127031327.293145482@intel.com> References: <20120127030524.854259561@intel.com> Cc: Andi Kleen , Li Shaohua , Jan Kara , Wu Fengguang To: Andrew Morton Return-path: cc: Linux Memory Management List , Cc: LKML Content-Disposition: inline; filename=readahead-backwards.patch Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Add the backwards prefetching feature. It's pretty simple if we don't support async prefetching and interleaved reads. tail and tac are observed to have the reverse read pattern: tail-3501 [006] 111.881191: readahead: readahead-random(bdi=0:16, ino=1548450, req=750+1, ra=750+1-0, async=0) = 1 tail-3501 [006] 111.881506: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=748+2, ra=746+5-0, async=0) = 4 tail-3501 [006] 111.882021: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=744+2, ra=726+25-0, async=0) = 20 tail-3501 [006] 111.883713: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=724+2, ra=626+125-0, async=0) = 100 tac-3528 [001] 118.671924: readahead: readahead-random(bdi=0:16, ino=1548445, req=750+1, ra=750+1-0, async=0) = 1 tac-3528 [001] 118.672371: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=748+2, ra=746+5-0, async=0) = 4 tac-3528 [001] 118.673039: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=744+2, ra=726+25-0, async=0) = 20 Here is the behavior with an 8-page read sequence from 10000 down to 0. (The readahead size is a bit large since it's an NFS mount.) readahead-random(dev=0:16, ino=3948605, req=10000+8, ra=10000+8-0, async=0) = 8 readahead-backwards(dev=0:16, ino=3948605, req=9992+8, ra=9968+32-0, async=0) = 32 readahead-backwards(dev=0:16, ino=3948605, req=9960+8, ra=9840+128-0, async=0) = 128 readahead-backwards(dev=0:16, ino=3948605, req=9832+8, ra=9584+256-0, async=0) = 256 readahead-backwards(dev=0:16, ino=3948605, req=9576+8, ra=9072+512-0, async=0) = 512 readahead-backwards(dev=0:16, ino=3948605, req=9064+8, ra=8048+1024-0, async=0) = 1024 readahead-backwards(dev=0:16, ino=3948605, req=8040+8, ra=6128+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=6120+8, ra=4208+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=4200+8, ra=2288+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=2280+8, ra=368+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=360+8, ra=0+368-0, async=0) = 368 And a simple 1-page read sequence from 10000 down to 0. readahead-random(dev=0:16, ino=3948605, req=10000+1, ra=10000+1-0, async=0) = 1 readahead-backwards(dev=0:16, ino=3948605, req=9999+1, ra=9996+4-0, async=0) = 4 readahead-backwards(dev=0:16, ino=3948605, req=9995+1, ra=9980+16-0, async=0) = 16 readahead-backwards(dev=0:16, ino=3948605, req=9979+1, ra=9916+64-0, async=0) = 64 readahead-backwards(dev=0:16, ino=3948605, req=9915+1, ra=9660+256-0, async=0) = 256 readahead-backwards(dev=0:16, ino=3948605, req=9659+1, ra=9148+512-0, async=0) = 512 readahead-backwards(dev=0:16, ino=3948605, req=9147+1, ra=8124+1024-0, async=0) = 1024 readahead-backwards(dev=0:16, ino=3948605, req=8123+1, ra=6204+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=6203+1, ra=4284+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=4283+1, ra=2364+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=2363+1, ra=444+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=443+1, ra=0+444-0, async=0) = 444 CC: Andi Kleen CC: Li Shaohua Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- include/linux/fs.h | 2 ++ include/trace/events/vfs.h | 1 + mm/readahead.c | 20 ++++++++++++++++++++ 3 files changed, 23 insertions(+) --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:57.000000000 +0800 @@ -975,6 +975,7 @@ struct file_ra_state { * streams. * RA_PATTERN_MMAP_AROUND read-around on mmap page faults * (w/o any sequential/random hints) + * RA_PATTERN_BACKWARDS reverse reading detected * RA_PATTERN_FADVISE triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM * RA_PATTERN_OVERSIZE a random read larger than max readahead size, * do max readahead to break down the read size @@ -985,6 +986,7 @@ enum readahead_pattern { RA_PATTERN_SUBSEQUENT, RA_PATTERN_CONTEXT, RA_PATTERN_MMAP_AROUND, + RA_PATTERN_BACKWARDS, RA_PATTERN_FADVISE, RA_PATTERN_OVERSIZE, RA_PATTERN_RANDOM, --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:53.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:57.000000000 +0800 @@ -695,6 +695,26 @@ ondemand_readahead(struct address_space } /* + * backwards reading + */ + if (offset < ra->start && offset + req_size >= ra->start) { + ra->pattern = RA_PATTERN_BACKWARDS; + ra->size = get_next_ra_size(ra, max); + if (ra->size > ra->start) { + /* + * ra->start may be concurrently set to some huge + * value, the min() at least avoids submitting huge IO + * in this race condition + */ + ra->size = min(ra->start, max); + ra->start = 0; + } else + ra->start -= ra->size; + ra->async_size = 0; + goto readit; + } + + /* * Query the page cache and look for the traces(cached history pages) * that a sequential stream would leave behind. */ --- linux-next.orig/include/trace/events/vfs.h 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/include/trace/events/vfs.h 2012-01-25 15:57:57.000000000 +0800 @@ -14,6 +14,7 @@ { RA_PATTERN_SUBSEQUENT, "subsequent" }, \ { RA_PATTERN_CONTEXT, "context" }, \ { RA_PATTERN_MMAP_AROUND, "around" }, \ + { RA_PATTERN_BACKWARDS, "backwards" }, \ { RA_PATTERN_FADVISE, "fadvise" }, \ { RA_PATTERN_OVERSIZE, "oversize" }, \ { RA_PATTERN_RANDOM, "random" }, \ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: [PATCH 6/9] readahead: add /debug/readahead/stats Date: Fri, 27 Jan 2012 11:05:30 +0800 Message-ID: <20120127031327.159293683@intel.com> References: <20120127030524.854259561@intel.com> Cc: Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Wu Fengguang To: Andrew Morton Return-path: cc: Linux Memory Management List , Cc: LKML Content-Disposition: inline; filename=readahead-stats.patch Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y), and will remain inactive by default. It can be runtime enabled/disabled through the debugfs interface echo 1 > /debug/readahead/stats_enable echo 0 > /debug/readahead/stats_enable Example output: (taken from a fresh booted NFS-ROOT console box with rsize=524288) $ cat /debug/readahead/stats pattern readahead eof_hit cache_hit io sync_io mmap_io meta_io size async_size io_size initial 702 511 0 692 692 0 0 2 0 2 subsequent 7 0 1 7 1 1 0 23 22 23 context 160 161 0 2 0 1 0 0 0 16 around 184 184 177 184 184 184 0 58 0 53 backwards 2 0 2 2 2 0 0 4 0 3 fadvise 2593 47 8 2588 2588 0 0 1 0 1 oversize 0 0 0 0 0 0 0 0 0 0 random 45 20 0 44 44 0 0 1 0 1 all 3697 923 188 3519 3511 186 0 4 0 4 The two most important columns are - io number of readahead IO - io_size average readahead IO size CC: Ingo Molnar CC: Jens Axboe CC: Peter Zijlstra Acked-by: Rik van Riel Signed-off-by: Wu Fengguang --- mm/Kconfig | 15 +++ mm/readahead.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 217 insertions(+) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:53.000000000 +0800 @@ -33,6 +33,202 @@ EXPORT_SYMBOL_GPL(file_ra_state_init); #define list_to_page(head) (list_entry((head)->prev, struct page, lru)) +#ifdef CONFIG_READAHEAD_STATS +#include +#include +#include + +static u32 readahead_stats_enable __read_mostly; + +static const struct trace_print_flags ra_pattern_names[] = { + READAHEAD_PATTERNS +}; + +enum ra_account { + /* number of readaheads */ + RA_ACCOUNT_COUNT, /* readahead request */ + RA_ACCOUNT_EOF, /* readahead request covers EOF */ + RA_ACCOUNT_CACHE_HIT, /* readahead request covers some cached pages */ + RA_ACCOUNT_IOCOUNT, /* readahead IO */ + RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */ + RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */ + RA_ACCOUNT_METADATA, /* readahead IO on metadata */ + /* number of readahead pages */ + RA_ACCOUNT_SIZE, /* readahead size */ + RA_ACCOUNT_ASYNC_SIZE, /* readahead async size */ + RA_ACCOUNT_ACTUAL, /* readahead actual IO size */ + /* end mark */ + RA_ACCOUNT_MAX, +}; + +#define RA_STAT_BATCH (INT_MAX / 2) +static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX]; + +static inline void add_ra_stat(int i, int j, s64 amount) +{ + __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH); +} + +static inline void inc_ra_stat(int i, int j) +{ + add_ra_stat(i, j, 1); +} + +static void readahead_stats(struct address_space *mapping, + pgoff_t offset, + unsigned long req_size, + bool for_mmap, + bool for_metadata, + enum readahead_pattern pattern, + pgoff_t start, + unsigned long size, + unsigned long async_size, + int actual) +{ + pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1; + + inc_ra_stat(pattern, RA_ACCOUNT_COUNT); + add_ra_stat(pattern, RA_ACCOUNT_SIZE, size); + add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size); + add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual); + + if (start + size >= eof) + inc_ra_stat(pattern, RA_ACCOUNT_EOF); + if (actual < size) + inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT); + + if (actual) { + inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT); + + if (start <= offset && offset < start + size) + inc_ra_stat(pattern, RA_ACCOUNT_SYNC); + + if (for_mmap) + inc_ra_stat(pattern, RA_ACCOUNT_MMAP); + if (for_metadata) + inc_ra_stat(pattern, RA_ACCOUNT_METADATA); + } +} + +static void readahead_stats_reset(void) +{ + int i, j; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) + percpu_counter_set(&ra_stat[i][j], 0); +} + +static void +readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]) +{ + int i, j; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) { + s64 n = percpu_counter_sum(&ra_stat[i][j]); + ra_stats[i][j] += n; + ra_stats[RA_PATTERN_ALL][j] += n; + } +} + +static int readahead_stats_show(struct seq_file *s, void *_) +{ + long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]; + int i; + + seq_printf(s, + "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n", + "pattern", "readahead", "eof_hit", "cache_hit", + "io", "sync_io", "mmap_io", "meta_io", + "size", "async_size", "io_size"); + + memset(ra_stats, 0, sizeof(ra_stats)); + readahead_stats_sum(ra_stats); + + for (i = 0; i < RA_PATTERN_MAX; i++) { + unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT]; + unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT]; + /* + * avoid division-by-zero + */ + if (count == 0) + count = 1; + if (iocount == 0) + iocount = 1; + + seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld " + "%10lld %10lld %10lld %10lld %10lld\n", + ra_pattern_names[i].name, + ra_stats[i][RA_ACCOUNT_COUNT], + ra_stats[i][RA_ACCOUNT_EOF], + ra_stats[i][RA_ACCOUNT_CACHE_HIT], + ra_stats[i][RA_ACCOUNT_IOCOUNT], + ra_stats[i][RA_ACCOUNT_SYNC], + ra_stats[i][RA_ACCOUNT_MMAP], + ra_stats[i][RA_ACCOUNT_METADATA], + ra_stats[i][RA_ACCOUNT_SIZE] / count, + ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count, + ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount); + } + + return 0; +} + +static int readahead_stats_open(struct inode *inode, struct file *file) +{ + return single_open(file, readahead_stats_show, NULL); +} + +static ssize_t readahead_stats_write(struct file *file, const char __user *buf, + size_t size, loff_t *offset) +{ + readahead_stats_reset(); + return size; +} + +static const struct file_operations readahead_stats_fops = { + .owner = THIS_MODULE, + .open = readahead_stats_open, + .write = readahead_stats_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int __init readahead_create_debugfs(void) +{ + struct dentry *root; + struct dentry *entry; + int i, j; + + root = debugfs_create_dir("readahead", NULL); + if (!root) + goto out; + + entry = debugfs_create_file("stats", 0644, root, + NULL, &readahead_stats_fops); + if (!entry) + goto out; + + entry = debugfs_create_bool("stats_enable", 0644, root, + &readahead_stats_enable); + if (!entry) + goto out; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) + percpu_counter_init(&ra_stat[i][j], 0); + + return 0; +out: + printk(KERN_ERR "readahead: failed to create debugfs entries\n"); + return -ENOMEM; +} + +late_initcall(readahead_create_debugfs); +#endif + static inline void readahead_event(struct address_space *mapping, pgoff_t offset, unsigned long req_size, @@ -44,6 +240,12 @@ static inline void readahead_event(struc unsigned long async_size, int actual) { +#ifdef CONFIG_READAHEAD_STATS + if (readahead_stats_enable) + readahead_stats(mapping, offset, req_size, + for_mmap, for_metadata, + pattern, start, size, async_size, actual); +#endif trace_readahead(mapping, offset, req_size, pattern, start, size, async_size, actual); } --- linux-next.orig/mm/Kconfig 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/mm/Kconfig 2012-01-25 15:57:53.000000000 +0800 @@ -379,3 +379,18 @@ config CLEANCACHE in a negligible performance hit. If unsure, say Y to enable cleancache + +config READAHEAD_STATS + bool "Collect page cache readahead stats" + depends on DEBUG_FS + default n + help + This provides the readahead events accounting facilities. + + To do readahead accounting for a workload: + + echo 1 > /sys/kernel/debug/readahead/stats_enable + echo 0 > /sys/kernel/debug/readahead/stats # reset counters + # run the workload + cat /sys/kernel/debug/readahead/stats # check counters + echo 0 > /sys/kernel/debug/readahead/stats_enable -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: [PATCH 2/9] readahead: record readahead patterns Date: Fri, 27 Jan 2012 11:05:26 +0800 Message-ID: <20120127031326.619964905@intel.com> References: <20120127030524.854259561@intel.com> Cc: Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Jan Kara , Rik van Riel , Wu Fengguang To: Andrew Morton Return-path: cc: Linux Memory Management List , Cc: LKML Content-Disposition: inline; filename=readahead-tracepoints.patch Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Record the readahead pattern in ra->pattern and extend ra_submit() parameters, to be used by the next readahead tracing/stats patches. 7 patterns are defined: pattern readahead for ----------------------------------------------------------- RA_PATTERN_INITIAL start-of-file read RA_PATTERN_SUBSEQUENT trivial sequential read RA_PATTERN_CONTEXT interleaved sequential read RA_PATTERN_OVERSIZE oversize read RA_PATTERN_MMAP_AROUND mmap fault RA_PATTERN_FADVISE posix_fadvise() RA_PATTERN_RANDOM random read Note that random reads will be recorded in file_ra_state now. This won't deteriorate cache bouncing because the ra->prev_pos update in do_generic_file_read() already pollutes the data cache, and filemap_fault() will stop calling into us after MMAP_LOTSAMISS. CC: Ingo Molnar CC: Jens Axboe CC: Peter Zijlstra Acked-by: Jan Kara Acked-by: Rik van Riel Signed-off-by: Wu Fengguang --- include/linux/fs.h | 36 +++++++++++++++++++++++++++++++++++- include/linux/mm.h | 4 +++- mm/filemap.c | 3 ++- mm/readahead.c | 29 ++++++++++++++++++++++------- 4 files changed, 62 insertions(+), 10 deletions(-) --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:50.000000000 +0800 @@ -952,11 +952,45 @@ struct file_ra_state { there are only # of pages ahead */ unsigned int ra_pages; /* Maximum readahead window */ - unsigned int mmap_miss; /* Cache miss stat for mmap accesses */ + u16 mmap_miss; /* Cache miss stat for mmap accesses */ + u8 pattern; /* one of RA_PATTERN_* */ + loff_t prev_pos; /* Cache last read() position */ }; /* + * Which policy makes decision to do the current read-ahead IO? + * + * RA_PATTERN_INITIAL readahead window is initially opened, + * normally when reading from start of file + * RA_PATTERN_SUBSEQUENT readahead window is pushed forward + * RA_PATTERN_CONTEXT no readahead window available, querying the + * page cache to decide readahead start/size. + * This typically happens on interleaved reads (eg. + * reading pages 0, 1000, 1, 1001, 2, 1002, ...) + * where one file_ra_state struct is not enough + * for recording 2+ interleaved sequential read + * streams. + * RA_PATTERN_MMAP_AROUND read-around on mmap page faults + * (w/o any sequential/random hints) + * RA_PATTERN_FADVISE triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM + * RA_PATTERN_OVERSIZE a random read larger than max readahead size, + * do max readahead to break down the read size + * RA_PATTERN_RANDOM a small random read + */ +enum readahead_pattern { + RA_PATTERN_INITIAL, + RA_PATTERN_SUBSEQUENT, + RA_PATTERN_CONTEXT, + RA_PATTERN_MMAP_AROUND, + RA_PATTERN_FADVISE, + RA_PATTERN_OVERSIZE, + RA_PATTERN_RANDOM, + RA_PATTERN_ALL, /* for summary stats */ + RA_PATTERN_MAX +}; + +/* * Check if @index falls in the readahead windows. */ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:49.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:50.000000000 +0800 @@ -249,7 +249,10 @@ unsigned long max_sane_readahead(unsigne * Submit IO for the read-ahead request in file_ra_state. */ unsigned long ra_submit(struct file_ra_state *ra, - struct address_space *mapping, struct file *filp) + struct address_space *mapping, + struct file *filp, + pgoff_t offset, + unsigned long req_size) { int actual; @@ -382,6 +385,7 @@ static int try_context_readahead(struct if (size >= offset) size *= 2; + ra->pattern = RA_PATTERN_CONTEXT; ra->start = offset; ra->size = min(size + req_size, max); ra->async_size = 1; @@ -403,8 +407,10 @@ ondemand_readahead(struct address_space /* * start of file */ - if (!offset) + if (!offset) { + ra->pattern = RA_PATTERN_INITIAL; goto initial_readahead; + } /* * It's the expected callback offset, assume sequential access. @@ -412,6 +418,7 @@ ondemand_readahead(struct address_space */ if ((offset == (ra->start + ra->size - ra->async_size) || offset == (ra->start + ra->size))) { + ra->pattern = RA_PATTERN_SUBSEQUENT; ra->start += ra->size; ra->size = get_next_ra_size(ra, max); ra->async_size = ra->size; @@ -434,6 +441,7 @@ ondemand_readahead(struct address_space if (!start || start - offset > max) return 0; + ra->pattern = RA_PATTERN_CONTEXT; ra->start = start; ra->size = start - offset; /* old async_size */ ra->size += req_size; @@ -445,14 +453,18 @@ ondemand_readahead(struct address_space /* * oversize read */ - if (req_size > max) + if (req_size > max) { + ra->pattern = RA_PATTERN_OVERSIZE; goto initial_readahead; + } /* * sequential cache miss */ - if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) + if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) { + ra->pattern = RA_PATTERN_INITIAL; goto initial_readahead; + } /* * Query the page cache and look for the traces(cached history pages) @@ -463,9 +475,12 @@ ondemand_readahead(struct address_space /* * standalone, small random read - * Read as is, and do not pollute the readahead state. */ - return __do_page_cache_readahead(mapping, filp, offset, req_size, 0); + ra->pattern = RA_PATTERN_RANDOM; + ra->start = offset; + ra->size = req_size; + ra->async_size = 0; + goto readit; initial_readahead: ra->start = offset; @@ -483,7 +498,7 @@ readit: ra->size += ra->async_size; } - return ra_submit(ra, mapping, filp); + return ra_submit(ra, mapping, filp, offset, req_size); } /** --- linux-next.orig/include/linux/mm.h 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/include/linux/mm.h 2012-01-25 15:57:50.000000000 +0800 @@ -1448,7 +1448,9 @@ void page_cache_async_readahead(struct a unsigned long max_sane_readahead(unsigned long nr); unsigned long ra_submit(struct file_ra_state *ra, struct address_space *mapping, - struct file *filp); + struct file *filp, + pgoff_t offset, + unsigned long req_size); /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ extern int expand_stack(struct vm_area_struct *vma, unsigned long address); --- linux-next.orig/mm/filemap.c 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/mm/filemap.c 2012-01-25 15:57:50.000000000 +0800 @@ -1597,11 +1597,12 @@ static void do_sync_mmap_readahead(struc /* * mmap read-around */ + ra->pattern = RA_PATTERN_MMAP_AROUND; ra_pages = max_sane_readahead(ra->ra_pages); ra->start = max_t(long, 0, offset - ra_pages / 2); ra->size = ra_pages; ra->async_size = ra_pages / 4; - ra_submit(ra, mapping, file); + ra_submit(ra, mapping, file, offset, 1); } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: [PATCH 5/9] readahead: add vfs/readahead tracing event Date: Fri, 27 Jan 2012 11:05:29 +0800 Message-ID: <20120127031327.020100004@intel.com> References: <20120127030524.854259561@intel.com> Cc: Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Jan Kara , Rik van Riel , Steven Rostedt , Wu Fengguang To: Andrew Morton Return-path: cc: Linux Memory Management List , Cc: LKML Content-Disposition: inline; filename=readahead-tracer.patch Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org This is very useful for verifying whether the readahead algorithms are working to the expectation. Example output: # echo 1 > /debug/tracing/events/vfs/readahead/enable # cp test-file /dev/null # cat /debug/tracing/trace # trimmed output pattern=initial bdi=0:16 ino=100177 req=0+2 ra=0+4-2 async=0 actual=4 pattern=subsequent bdi=0:16 ino=100177 req=2+2 ra=4+8-8 async=1 actual=8 pattern=subsequent bdi=0:16 ino=100177 req=4+2 ra=12+16-16 async=1 actual=16 pattern=subsequent bdi=0:16 ino=100177 req=12+2 ra=28+32-32 async=1 actual=32 pattern=subsequent bdi=0:16 ino=100177 req=28+2 ra=60+60-60 async=1 actual=24 pattern=subsequent bdi=0:16 ino=100177 req=60+2 ra=120+60-60 async=1 actual=0 CC: Ingo Molnar CC: Jens Axboe CC: Peter Zijlstra Acked-by: Jan Kara Acked-by: Rik van Riel Acked-by: Steven Rostedt Signed-off-by: Wu Fengguang --- fs/Makefile | 1 fs/trace.c | 2 include/trace/events/vfs.h | 77 +++++++++++++++++++++++++++++++++++ mm/readahead.c | 24 ++++++++++ 4 files changed, 104 insertions(+) --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-next/include/trace/events/vfs.h 2012-01-25 15:57:52.000000000 +0800 @@ -0,0 +1,77 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM vfs + +#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_VFS_H + +#include +#include +#include +#include + +#define READAHEAD_PATTERNS \ + { RA_PATTERN_INITIAL, "initial" }, \ + { RA_PATTERN_SUBSEQUENT, "subsequent" }, \ + { RA_PATTERN_CONTEXT, "context" }, \ + { RA_PATTERN_MMAP_AROUND, "around" }, \ + { RA_PATTERN_FADVISE, "fadvise" }, \ + { RA_PATTERN_OVERSIZE, "oversize" }, \ + { RA_PATTERN_RANDOM, "random" }, \ + { RA_PATTERN_ALL, "all" } + +TRACE_EVENT(readahead, + TP_PROTO(struct address_space *mapping, + pgoff_t offset, + unsigned long req_size, + enum readahead_pattern pattern, + pgoff_t start, + unsigned long size, + unsigned long async_size, + unsigned int actual), + + TP_ARGS(mapping, offset, req_size, pattern, start, size, async_size, + actual), + + TP_STRUCT__entry( + __array(char, bdi, 32) + __field(ino_t, ino) + __field(pgoff_t, offset) + __field(unsigned long, req_size) + __field(unsigned int, pattern) + __field(pgoff_t, start) + __field(unsigned int, size) + __field(unsigned int, async_size) + __field(unsigned int, actual) + ), + + TP_fast_assign( + strncpy(__entry->bdi, + dev_name(mapping->backing_dev_info->dev), 32); + __entry->ino = mapping->host->i_ino; + __entry->offset = offset; + __entry->req_size = req_size; + __entry->pattern = pattern; + __entry->start = start; + __entry->size = size; + __entry->async_size = async_size; + __entry->actual = actual; + ), + + TP_printk("pattern=%s bdi=%s ino=%lu " + "req=%lu+%lu ra=%lu+%d-%d async=%d actual=%d", + __print_symbolic(__entry->pattern, READAHEAD_PATTERNS), + __entry->bdi, + __entry->ino, + __entry->offset, + __entry->req_size, + __entry->start, + __entry->size, + __entry->async_size, + __entry->start > __entry->offset, + __entry->actual) +); + +#endif /* _TRACE_VFS_H */ + +/* This part must be outside protection */ +#include --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 @@ -17,6 +17,7 @@ #include #include #include +#include /* * Initialise a struct file's readahead state. Assumes that the caller has @@ -32,6 +33,21 @@ EXPORT_SYMBOL_GPL(file_ra_state_init); #define list_to_page(head) (list_entry((head)->prev, struct page, lru)) +static inline void readahead_event(struct address_space *mapping, + pgoff_t offset, + unsigned long req_size, + bool for_mmap, + bool for_metadata, + enum readahead_pattern pattern, + pgoff_t start, + unsigned long size, + unsigned long async_size, + int actual) +{ + trace_readahead(mapping, offset, req_size, + pattern, start, size, async_size, actual); +} + /* * see if a page needs releasing upon read_cache_pages() failure * - the caller of read_cache_pages() may have set PG_private or PG_fscache @@ -228,6 +244,9 @@ int force_page_cache_readahead(struct ad ret = err; break; } + readahead_event(mapping, offset, nr_to_read, 0, 0, + RA_PATTERN_FADVISE, offset, this_chunk, 0, + err); ret += err; offset += this_chunk; nr_to_read -= this_chunk; @@ -259,6 +278,11 @@ unsigned long ra_submit(struct file_ra_s actual = __do_page_cache_readahead(mapping, filp, ra->start, ra->size, ra->async_size); + readahead_event(mapping, offset, req_size, + ra->for_mmap, ra->for_metadata, + ra->pattern, ra->start, ra->size, ra->async_size, + actual); + ra->for_mmap = 0; ra->for_metadata = 0; return actual; --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-next/fs/trace.c 2012-01-25 15:57:52.000000000 +0800 @@ -0,0 +1,2 @@ +#define CREATE_TRACE_POINTS +#include --- linux-next.orig/fs/Makefile 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/Makefile 2012-01-25 15:57:52.000000000 +0800 @@ -50,6 +50,7 @@ obj-$(CONFIG_NFS_COMMON) += nfs_common/ obj-$(CONFIG_GENERIC_ACL) += generic_acl.o obj-$(CONFIG_FHANDLE) += fhandle.o +obj-$(CONFIG_TRACEPOINTS) += trace.o obj-y += quota/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoph Lameter Subject: Re: [PATCH 6/9] readahead: add /debug/readahead/stats Date: Fri, 27 Jan 2012 10:21:36 -0600 (CST) Message-ID: References: <20120127030524.854259561@intel.com> <20120127031327.159293683@intel.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Andrew Morton , Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML To: Wu Fengguang Return-path: In-Reply-To: <20120127031327.159293683@intel.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, 27 Jan 2012, Wu Fengguang wrote: > + > +#define RA_STAT_BATCH (INT_MAX / 2) > +static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX]; Why use percpu counter here? The stats structures are not dynamically allocated so you can just use a DECLARE_PER_CPU statement. That way you do not have the overhead of percpu counter calls. Instead simple instructions are generated to deal with the counter. There are also no calls to any of the fast access functions for percpu counter so percpu_counter has to always having to loop over all counters anyways to get the results. The batching of the percpu_counters is therefore not used. Its simpler to just do a loop that sums over all counters when displaying the results. > +static inline void add_ra_stat(int i, int j, s64 amount) > +{ > + __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH); __this_cpu_add(ra_stat[i][j], amount); > +} > + > +static void readahead_stats_reset(void) > +{ > + int i, j; > + > + for (i = 0; i < RA_PATTERN_ALL; i++) > + for (j = 0; j < RA_ACCOUNT_MAX; j++) > + percpu_counter_set(&ra_stat[i][j], 0); for_each_online(cpu) memset(per_cpu_ptr(&ra_stat, cpu), 0, sizeof(ra_stat)); > +} > + > +static void > +readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]) > +{ > + int i, j; > + > + for (i = 0; i < RA_PATTERN_ALL; i++) > + for (j = 0; j < RA_ACCOUNT_MAX; j++) { > + s64 n = percpu_counter_sum(&ra_stat[i][j]); > + ra_stats[i][j] += n; > + ra_stats[RA_PATTERN_ALL][j] += n; > + } > +} Define a function stats instead? static long get_stat_sum(long __per_cpu *x) { int cpu; long sum; for_each_online(cpu) sum += *per_cpu_ptr(x, cpu); return sum; } > + > +static int readahead_stats_show(struct seq_file *s, void *_) > +{ > + readahead_stats_sum(ra_stats); > + > + for (i = 0; i < RA_PATTERN_MAX; i++) { > + unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT]; = get_stats(&ra_stats[i][RA_ACCOUNT]); ... ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Morton Subject: Re: [PATCH 6/9] readahead: add /debug/readahead/stats Date: Fri, 27 Jan 2012 12:15:51 -0800 Message-ID: <20120127121551.acd256aa.akpm@linux-foundation.org> References: <20120127030524.854259561@intel.com> <20120127031327.159293683@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: Wu Fengguang , Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML To: Christoph Lameter Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, 27 Jan 2012 10:21:36 -0600 (CST) Christoph Lameter wrote: > > + > > +static void readahead_stats_reset(void) > > +{ > > + int i, j; > > + > > + for (i = 0; i < RA_PATTERN_ALL; i++) > > + for (j = 0; j < RA_ACCOUNT_MAX; j++) > > + percpu_counter_set(&ra_stat[i][j], 0); > > for_each_online(cpu) > memset(per_cpu_ptr(&ra_stat, cpu), 0, sizeof(ra_stat)); for_each_possible_cpu(). And that's one reason to not open-code the operation. Another is so we don't have tiresome open-coded loops all over the place. But before doing either of those things we should choose boring old atomic_inc(). Has it been shown that the cost of doing so is unacceptable? Bearing this in mind: > The accounting code will be compiled in by default > (CONFIG_READAHEAD_STATS=y), and will remain inactive by default. I agree with those choices. They effectively mean that the stats will be a developer-only/debugger-only thing. So even if the atomic_inc() costs are measurable during these develop/debug sessions, is anyone likely to care? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: Re: [PATCH 6/9] readahead: add /debug/readahead/stats Date: Sun, 29 Jan 2012 13:07:22 +0800 Message-ID: <20120129050722.GC26244@localhost> References: <20120127030524.854259561@intel.com> <20120127031327.159293683@intel.com> <20120127121551.acd256aa.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Christoph Lameter , Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML To: Andrew Morton Return-path: Content-Disposition: inline In-Reply-To: <20120127121551.acd256aa.akpm@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Jan 27, 2012 at 12:15:51PM -0800, Andrew Morton wrote: > > The accounting code will be compiled in by default > > (CONFIG_READAHEAD_STATS=y), and will remain inactive by default. > > I agree with those choices. They effectively mean that the stats will > be a developer-only/debugger-only thing. So even if the atomic_inc() > costs are measurable during these develop/debug sessions, is anyone > likely to care? Sorry I have changed the default to CONFIG_READAHEAD_STATS=n to avoid bloating the kernel (and forgot to edit the changelog accordingly). I'm not sure how many people are going to check the readahead stats. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [PATCH 6/9] readahead: add /debug/readahead/stats Date: Mon, 30 Jan 2012 15:02:39 +1100 Message-ID: <20120130040239.GB9090@dastard> References: <20120127030524.854259561@intel.com> <20120127031327.159293683@intel.com> <20120127121551.acd256aa.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Christoph Lameter , Wu Fengguang , Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML To: Andrew Morton Return-path: Received: from ipmail04.adl6.internode.on.net ([150.101.137.141]:15730 "EHLO ipmail04.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753557Ab2A3ECm (ORCPT ); Sun, 29 Jan 2012 23:02:42 -0500 Content-Disposition: inline In-Reply-To: <20120127121551.acd256aa.akpm@linux-foundation.org> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Fri, Jan 27, 2012 at 12:15:51PM -0800, Andrew Morton wrote: > On Fri, 27 Jan 2012 10:21:36 -0600 (CST) > Christoph Lameter wrote: > > > > + > > > +static void readahead_stats_reset(void) > > > +{ > > > + int i, j; > > > + > > > + for (i = 0; i < RA_PATTERN_ALL; i++) > > > + for (j = 0; j < RA_ACCOUNT_MAX; j++) > > > + percpu_counter_set(&ra_stat[i][j], 0); > > > > for_each_online(cpu) > > memset(per_cpu_ptr(&ra_stat, cpu), 0, sizeof(ra_stat)); > > for_each_possible_cpu(). And that's one reason to not open-code the > operation. Another is so we don't have tiresome open-coded loops all > over the place. Amen, brother! > But before doing either of those things we should choose boring old > atomic_inc(). Has it been shown that the cost of doing so is > unacceptable? Bearing this in mind: atomics for stats in the IO path have long been known not to scale well enough - especially now we have PCIe SSDs that can do hundreds of thousands of reads per second if you have enough CPU concurrency to drive them that hard. Under that sort of workload, atomics won't scale. > > > The accounting code will be compiled in by default > > (CONFIG_READAHEAD_STATS=y), and will remain inactive by default. > > I agree with those choices. They effectively mean that the stats will > be a developer-only/debugger-only thing. So even if the atomic_inc() > costs are measurable during these develop/debug sessions, is anyone > likely to care? I do. If I need the debugging stats, the overhead must not perturb the behaviour I'm trying to understand/debug for them to be useful.... Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: [PATCH 6/9 update changelog] readahead: add /debug/readahead/stats Date: Thu, 9 Feb 2012 11:22:55 +0800 Message-ID: <20120209032255.GA27396@localhost> References: <20120127030524.854259561@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Wu Fengguang , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML To: Andrew Morton Return-path: Content-Disposition: inline; filename="readahead-stats.patch" Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org This accounting code is effectively a no-op by default (CONFIG_READAHEAD_STATS=n). It's expected to be runtime reset and enabled before using: echo 0 > /debug/readahead/stats # reset counters echo 1 > /debug/readahead/stats_enable # run test workload echo 0 > /debug/readahead/stats_enable Example output: (taken from a fresh booted NFS-ROOT console box with rsize=524288) $ cat /debug/readahead/stats pattern readahead eof_hit cache_hit io sync_io mmap_io meta_io size async_size io_size initial 702 511 0 692 692 0 0 2 0 2 subsequent 7 0 1 7 1 1 0 23 22 23 context 160 161 0 2 0 1 0 0 0 16 around 184 184 177 184 184 184 0 58 0 53 backwards 2 0 2 2 2 0 0 4 0 3 fadvise 2593 47 8 2588 2588 0 0 1 0 1 oversize 0 0 0 0 0 0 0 0 0 0 random 45 20 0 44 44 0 0 1 0 1 all 3697 923 188 3519 3511 186 0 4 0 4 The two most important columns are - io number of readahead IO - io_size average readahead IO size CC: Ingo Molnar CC: Jens Axboe CC: Peter Zijlstra Acked-by: Rik van Riel Signed-off-by: Wu Fengguang --- mm/Kconfig | 15 +++ mm/readahead.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 217 insertions(+) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:53.000000000 +0800 @@ -33,6 +33,202 @@ EXPORT_SYMBOL_GPL(file_ra_state_init); #define list_to_page(head) (list_entry((head)->prev, struct page, lru)) +#ifdef CONFIG_READAHEAD_STATS +#include +#include +#include + +static u32 readahead_stats_enable __read_mostly; + +static const struct trace_print_flags ra_pattern_names[] = { + READAHEAD_PATTERNS +}; + +enum ra_account { + /* number of readaheads */ + RA_ACCOUNT_COUNT, /* readahead request */ + RA_ACCOUNT_EOF, /* readahead request covers EOF */ + RA_ACCOUNT_CACHE_HIT, /* readahead request covers some cached pages */ + RA_ACCOUNT_IOCOUNT, /* readahead IO */ + RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */ + RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */ + RA_ACCOUNT_METADATA, /* readahead IO on metadata */ + /* number of readahead pages */ + RA_ACCOUNT_SIZE, /* readahead size */ + RA_ACCOUNT_ASYNC_SIZE, /* readahead async size */ + RA_ACCOUNT_ACTUAL, /* readahead actual IO size */ + /* end mark */ + RA_ACCOUNT_MAX, +}; + +#define RA_STAT_BATCH (INT_MAX / 2) +static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX]; + +static inline void add_ra_stat(int i, int j, s64 amount) +{ + __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH); +} + +static inline void inc_ra_stat(int i, int j) +{ + add_ra_stat(i, j, 1); +} + +static void readahead_stats(struct address_space *mapping, + pgoff_t offset, + unsigned long req_size, + bool for_mmap, + bool for_metadata, + enum readahead_pattern pattern, + pgoff_t start, + unsigned long size, + unsigned long async_size, + int actual) +{ + pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1; + + inc_ra_stat(pattern, RA_ACCOUNT_COUNT); + add_ra_stat(pattern, RA_ACCOUNT_SIZE, size); + add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size); + add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual); + + if (start + size >= eof) + inc_ra_stat(pattern, RA_ACCOUNT_EOF); + if (actual < size) + inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT); + + if (actual) { + inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT); + + if (start <= offset && offset < start + size) + inc_ra_stat(pattern, RA_ACCOUNT_SYNC); + + if (for_mmap) + inc_ra_stat(pattern, RA_ACCOUNT_MMAP); + if (for_metadata) + inc_ra_stat(pattern, RA_ACCOUNT_METADATA); + } +} + +static void readahead_stats_reset(void) +{ + int i, j; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) + percpu_counter_set(&ra_stat[i][j], 0); +} + +static void +readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]) +{ + int i, j; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) { + s64 n = percpu_counter_sum(&ra_stat[i][j]); + ra_stats[i][j] += n; + ra_stats[RA_PATTERN_ALL][j] += n; + } +} + +static int readahead_stats_show(struct seq_file *s, void *_) +{ + long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]; + int i; + + seq_printf(s, + "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n", + "pattern", "readahead", "eof_hit", "cache_hit", + "io", "sync_io", "mmap_io", "meta_io", + "size", "async_size", "io_size"); + + memset(ra_stats, 0, sizeof(ra_stats)); + readahead_stats_sum(ra_stats); + + for (i = 0; i < RA_PATTERN_MAX; i++) { + unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT]; + unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT]; + /* + * avoid division-by-zero + */ + if (count == 0) + count = 1; + if (iocount == 0) + iocount = 1; + + seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld " + "%10lld %10lld %10lld %10lld %10lld\n", + ra_pattern_names[i].name, + ra_stats[i][RA_ACCOUNT_COUNT], + ra_stats[i][RA_ACCOUNT_EOF], + ra_stats[i][RA_ACCOUNT_CACHE_HIT], + ra_stats[i][RA_ACCOUNT_IOCOUNT], + ra_stats[i][RA_ACCOUNT_SYNC], + ra_stats[i][RA_ACCOUNT_MMAP], + ra_stats[i][RA_ACCOUNT_METADATA], + ra_stats[i][RA_ACCOUNT_SIZE] / count, + ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count, + ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount); + } + + return 0; +} + +static int readahead_stats_open(struct inode *inode, struct file *file) +{ + return single_open(file, readahead_stats_show, NULL); +} + +static ssize_t readahead_stats_write(struct file *file, const char __user *buf, + size_t size, loff_t *offset) +{ + readahead_stats_reset(); + return size; +} + +static const struct file_operations readahead_stats_fops = { + .owner = THIS_MODULE, + .open = readahead_stats_open, + .write = readahead_stats_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int __init readahead_create_debugfs(void) +{ + struct dentry *root; + struct dentry *entry; + int i, j; + + root = debugfs_create_dir("readahead", NULL); + if (!root) + goto out; + + entry = debugfs_create_file("stats", 0644, root, + NULL, &readahead_stats_fops); + if (!entry) + goto out; + + entry = debugfs_create_bool("stats_enable", 0644, root, + &readahead_stats_enable); + if (!entry) + goto out; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) + percpu_counter_init(&ra_stat[i][j], 0); + + return 0; +out: + printk(KERN_ERR "readahead: failed to create debugfs entries\n"); + return -ENOMEM; +} + +late_initcall(readahead_create_debugfs); +#endif + static inline void readahead_event(struct address_space *mapping, pgoff_t offset, unsigned long req_size, @@ -44,6 +240,12 @@ static inline void readahead_event(struc unsigned long async_size, int actual) { +#ifdef CONFIG_READAHEAD_STATS + if (readahead_stats_enable) + readahead_stats(mapping, offset, req_size, + for_mmap, for_metadata, + pattern, start, size, async_size, actual); +#endif trace_readahead(mapping, offset, req_size, pattern, start, size, async_size, actual); } --- linux-next.orig/mm/Kconfig 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/mm/Kconfig 2012-01-25 15:57:53.000000000 +0800 @@ -379,3 +379,18 @@ config CLEANCACHE in a negligible performance hit. If unsure, say Y to enable cleancache + +config READAHEAD_STATS + bool "Collect page cache readahead stats" + depends on DEBUG_FS + default n + help + This provides the readahead events accounting facilities. + + To do readahead accounting for a workload: + + echo 0 > /sys/kernel/debug/readahead/stats # reset counters + echo 1 > /sys/kernel/debug/readahead/stats_enable + # run the workload + cat /sys/kernel/debug/readahead/stats # check counters + echo 0 > /sys/kernel/debug/readahead/stats_enable -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id 17B466B0075 for ; Thu, 26 Jan 2012 22:40:37 -0500 (EST) Message-Id: <20120127030524.854259561@intel.com> Date: Fri, 27 Jan 2012 11:05:24 +0800 From: Wu Fengguang Subject: [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v4) Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andi Kleen , Linux Memory Management List , linux-fsdevel@vger.kernel.org, Wu Fengguang , LKML Andrew, Will you include it into the -mm tree? This introduces the per-cpu readahead stats, tracing, backwards prefetching, fixes context readahead for SSD random reads and does some other minor changes. Changes since v3: - default to CONFIG_READAHEAD_STATS=n - drop "block: limit default readahead size for small devices" (and expect some distro udev rules to do the job) - use percpu_counter for the readahead stats Changes since v2: - use per-cpu counters for readahead stats - make context readahead more conservative - simplify readahead tracing format and use __print_symbolic() - backwards prefetching and snap to EOF fixes and cleanups Changes since v1: - use bit fields: pattern, for_mmap, for_metadata, lseek - comment the various readahead patterns - drop boot options "readahead=" and "readahead_stats=" - add for_metadata - add snapping to EOF [PATCH 1/9] readahead: make context readahead more conservative [PATCH 2/9] readahead: record readahead patterns [PATCH 3/9] readahead: tag mmap page fault call sites [PATCH 4/9] readahead: tag metadata call sites [PATCH 5/9] readahead: add vfs/readahead tracing event [PATCH 6/9] readahead: add /debug/readahead/stats [PATCH 7/9] readahead: basic support for backwards prefetching [PATCH 8/9] readahead: dont do start-of-file readahead after lseek() [PATCH 9/9] readahead: snap readahead request to EOF fs/Makefile | 1 fs/ext3/dir.c | 1 fs/ext4/dir.c | 1 fs/read_write.c | 3 fs/trace.c | 2 include/linux/fs.h | 41 ++++ include/linux/mm.h | 4 include/trace/events/vfs.h | 78 ++++++++ mm/Kconfig | 15 + mm/filemap.c | 9 - mm/readahead.c | 310 +++++++++++++++++++++++++++++++++-- 11 files changed, 450 insertions(+), 15 deletions(-) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id 2DD696B0085 for ; Thu, 26 Jan 2012 22:40:38 -0500 (EST) Message-Id: <20120127031326.469063803@intel.com> Date: Fri, 27 Jan 2012 11:05:25 +0800 From: Wu Fengguang Subject: [PATCH 1/9] readahead: make context readahead more conservative References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-context-tt Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andi Kleen , Wu Fengguang , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML Try to prevent negatively impact moderately dense random reads on SSD. Transaction-Per-Second numbers provided by Taobao: QPS case ------------------------------------------------------- 7536 disable context readahead totally w/ patch: 7129 slower size rampup and start RA on the 3rd read 6717 slower size rampup w/o patch: 5581 unmodified context readahead Before, readahead will be started whenever reading page N+1 when it happen to read N recently. After patch, we'll only start readahead when *three* random reads happen to access pages N, N+1, N+2. The probability of this happening is extremely low for pure random reads, unless they are very dense, which actually deserves some readahead. Also start with a smaller readahead window. The impact to interleaved sequential reads should be small, because for a long run stream, the the small readahead window rampup phase is negletable. The context readahead actually benefits clustered random reads on HDD whose seek cost is pretty high. However as SSD is increasingly used for random read workloads it's better for the context readahead to concentrate on interleaved sequential reads. Tested-by: Tao Ma Signed-off-by: Wu Fengguang --- mm/readahead.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:49.000000000 +0800 @@ -369,10 +369,10 @@ static int try_context_readahead(struct size = count_history_pages(mapping, ra, offset, max); /* - * no history pages: + * not enough history pages: * it could be a random read */ - if (!size) + if (size <= req_size) return 0; /* @@ -383,8 +383,8 @@ static int try_context_readahead(struct size *= 2; ra->start = offset; - ra->size = get_init_ra_size(size + req_size, max); - ra->async_size = ra->size; + ra->size = min(size + req_size, max); + ra->async_size = 1; return 1; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id 4B7326B0087 for ; Thu, 26 Jan 2012 22:40:38 -0500 (EST) Message-Id: <20120127031327.567686120@intel.com> Date: Fri, 27 Jan 2012 11:05:33 +0800 From: Wu Fengguang Subject: [PATCH 9/9] readahead: snap readahead request to EOF References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-eof Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andi Kleen , Jan Kara , Wu Fengguang , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML If the file size is 20kb and readahead request is [0, 16kb), it's better to expand the readahead request to [0, 20kb), which will likely save one followup I/O for the ending [16kb, 20kb). If the readahead request already covers EOF, trimm it down to EOF. Also don't set the PG_readahead mark to avoid an unnecessary future invocation of the readahead code. This special handling looks worthwhile because small to medium sized files are pretty common. Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- mm/readahead.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:58.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:59.000000000 +0800 @@ -466,6 +466,25 @@ unsigned long max_sane_readahead(unsigne + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); } +static void snap_to_eof(struct file_ra_state *ra, struct address_space *mapping) +{ + pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1; + pgoff_t start = ra->start; + unsigned int size = ra->size; + + /* + * skip backwards and random reads + */ + if (ra->pattern > RA_PATTERN_MMAP_AROUND) + return; + + size += min(size / 2, ra->ra_pages / 4); + if (start + size > eof) { + ra->size = eof - start; + ra->async_size = 0; + } +} + /* * Submit IO for the read-ahead request in file_ra_state. */ @@ -477,6 +496,8 @@ unsigned long ra_submit(struct file_ra_s { int actual; + snap_to_eof(ra, mapping); + actual = __do_page_cache_readahead(mapping, filp, ra->start, ra->size, ra->async_size); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id 359A16B0088 for ; Thu, 26 Jan 2012 22:40:39 -0500 (EST) Message-Id: <20120127031326.752229208@intel.com> Date: Fri, 27 Jan 2012 11:05:27 +0800 From: Wu Fengguang Subject: [PATCH 3/9] readahead: tag mmap page fault call sites References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-for-mmap Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andi Kleen , Jan Kara , Wu Fengguang , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML Introduce a bit field ra->for_mmap for tagging mmap reads. The tag will be cleared immediate after submitting the IO. Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- include/linux/fs.h | 1 + mm/filemap.c | 6 +++++- mm/readahead.c | 1 + 3 files changed, 7 insertions(+), 1 deletion(-) --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:50.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:51.000000000 +0800 @@ -954,6 +954,7 @@ struct file_ra_state { unsigned int ra_pages; /* Maximum readahead window */ u16 mmap_miss; /* Cache miss stat for mmap accesses */ u8 pattern; /* one of RA_PATTERN_* */ + unsigned int for_mmap:1; /* readahead for mmap accesses */ loff_t prev_pos; /* Cache last read() position */ }; --- linux-next.orig/mm/filemap.c 2012-01-25 15:57:50.000000000 +0800 +++ linux-next/mm/filemap.c 2012-01-25 15:57:51.000000000 +0800 @@ -1578,6 +1578,7 @@ static void do_sync_mmap_readahead(struc return; if (VM_SequentialReadHint(vma)) { + ra->for_mmap = 1; page_cache_sync_readahead(mapping, ra, file, offset, ra->ra_pages); return; @@ -1597,6 +1598,7 @@ static void do_sync_mmap_readahead(struc /* * mmap read-around */ + ra->for_mmap = 1; ra->pattern = RA_PATTERN_MMAP_AROUND; ra_pages = max_sane_readahead(ra->ra_pages); ra->start = max_t(long, 0, offset - ra_pages / 2); @@ -1622,9 +1624,11 @@ static void do_async_mmap_readahead(stru return; if (ra->mmap_miss > 0) ra->mmap_miss--; - if (PageReadahead(page)) + if (PageReadahead(page)) { + ra->for_mmap = 1; page_cache_async_readahead(mapping, ra, file, page, offset, ra->ra_pages); + } } /** --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:50.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:51.000000000 +0800 @@ -259,6 +259,7 @@ unsigned long ra_submit(struct file_ra_s actual = __do_page_cache_readahead(mapping, filp, ra->start, ra->size, ra->async_size); + ra->for_mmap = 0; return actual; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id C28936B0087 for ; Thu, 26 Jan 2012 22:40:39 -0500 (EST) Message-Id: <20120127031327.430238053@intel.com> Date: Fri, 27 Jan 2012 11:05:32 +0800 From: Wu Fengguang Subject: [PATCH 8/9] readahead: dont do start-of-file readahead after lseek() References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-lseek.patch Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andi Kleen , Rik van Riel , Linus Torvalds , Wu Fengguang , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML Some applications (eg. blkid, id3tool etc.) seek around the file to get information. For example, blkid does seek to 0 read 1024 seek to 1536 read 16384 The start-of-file readahead heuristic is wrong for them, whose access pattern can be identified by lseek() calls. So test-and-set a READAHEAD_LSEEK flag on lseek() and don't do start-of-file readahead on seeing it. Proposed by Linus. Acked-by: Rik van Riel Acked-by: Linus Torvalds Signed-off-by: Wu Fengguang --- fs/read_write.c | 3 +++ include/linux/fs.h | 1 + mm/readahead.c | 4 ++++ 3 files changed, 8 insertions(+) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:57.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:58.000000000 +0800 @@ -485,6 +485,7 @@ unsigned long ra_submit(struct file_ra_s ra->pattern, ra->start, ra->size, ra->async_size, actual); + ra->lseek = 0; ra->for_mmap = 0; ra->for_metadata = 0; return actual; @@ -636,6 +637,8 @@ ondemand_readahead(struct address_space * start of file */ if (!offset) { + if (ra->lseek && req_size < max) + goto random_read; ra->pattern = RA_PATTERN_INITIAL; goto initial_readahead; } @@ -721,6 +724,7 @@ ondemand_readahead(struct address_space if (try_context_readahead(mapping, ra, offset, req_size, max)) goto readit; +random_read: /* * standalone, small random read */ --- linux-next.orig/fs/read_write.c 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/read_write.c 2012-01-25 15:57:58.000000000 +0800 @@ -47,6 +47,9 @@ static loff_t lseek_execute(struct file file->f_pos = offset; file->f_version = 0; } + + file->f_ra.lseek = 1; + return offset; } --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:57.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:58.000000000 +0800 @@ -956,6 +956,7 @@ struct file_ra_state { u8 pattern; /* one of RA_PATTERN_* */ unsigned int for_mmap:1; /* readahead for mmap accesses */ unsigned int for_metadata:1; /* readahead for meta data */ + unsigned int lseek:1; /* this read has a leading lseek */ loff_t prev_pos; /* Cache last read() position */ }; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx175.postini.com [74.125.245.175]) by kanga.kvack.org (Postfix) with SMTP id 7CFAE6B008A for ; Thu, 26 Jan 2012 22:40:39 -0500 (EST) Message-Id: <20120127031326.881533433@intel.com> Date: Fri, 27 Jan 2012 11:05:28 +0800 From: Wu Fengguang Subject: [PATCH 4/9] readahead: tag metadata call sites References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-for-metadata Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andi Kleen , Jan Kara , Wu Fengguang , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML We may be doing more metadata readahead in future. Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- fs/ext3/dir.c | 1 + fs/ext4/dir.c | 1 + include/linux/fs.h | 1 + mm/readahead.c | 1 + 4 files changed, 4 insertions(+) --- linux-next.orig/fs/ext3/dir.c 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/ext3/dir.c 2012-01-25 15:57:52.000000000 +0800 @@ -136,6 +136,7 @@ static int ext3_readdir(struct file * fi pgoff_t index = map_bh.b_blocknr >> (PAGE_CACHE_SHIFT - inode->i_blkbits); if (!ra_has_index(&filp->f_ra, index)) + filp->f_ra.for_metadata = 1; page_cache_sync_readahead( sb->s_bdev->bd_inode->i_mapping, &filp->f_ra, filp, --- linux-next.orig/fs/ext4/dir.c 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/ext4/dir.c 2012-01-25 15:57:52.000000000 +0800 @@ -153,6 +153,7 @@ static int ext4_readdir(struct file *fil pgoff_t index = map.m_pblk >> (PAGE_CACHE_SHIFT - inode->i_blkbits); if (!ra_has_index(&filp->f_ra, index)) + filp->f_ra.for_metadata = 1; page_cache_sync_readahead( sb->s_bdev->bd_inode->i_mapping, &filp->f_ra, filp, --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:51.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:52.000000000 +0800 @@ -955,6 +955,7 @@ struct file_ra_state { u16 mmap_miss; /* Cache miss stat for mmap accesses */ u8 pattern; /* one of RA_PATTERN_* */ unsigned int for_mmap:1; /* readahead for mmap accesses */ + unsigned int for_metadata:1; /* readahead for meta data */ loff_t prev_pos; /* Cache last read() position */ }; --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:51.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 @@ -260,6 +260,7 @@ unsigned long ra_submit(struct file_ra_s ra->start, ra->size, ra->async_size); ra->for_mmap = 0; + ra->for_metadata = 0; return actual; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id 650B16B0088 for ; Thu, 26 Jan 2012 22:40:40 -0500 (EST) Message-Id: <20120127031327.293145482@intel.com> Date: Fri, 27 Jan 2012 11:05:31 +0800 From: Wu Fengguang Subject: [PATCH 7/9] readahead: basic support for backwards prefetching References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-backwards.patch Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andi Kleen , Li Shaohua , Jan Kara , Wu Fengguang , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML Add the backwards prefetching feature. It's pretty simple if we don't support async prefetching and interleaved reads. tail and tac are observed to have the reverse read pattern: tail-3501 [006] 111.881191: readahead: readahead-random(bdi=0:16, ino=1548450, req=750+1, ra=750+1-0, async=0) = 1 tail-3501 [006] 111.881506: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=748+2, ra=746+5-0, async=0) = 4 tail-3501 [006] 111.882021: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=744+2, ra=726+25-0, async=0) = 20 tail-3501 [006] 111.883713: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=724+2, ra=626+125-0, async=0) = 100 tac-3528 [001] 118.671924: readahead: readahead-random(bdi=0:16, ino=1548445, req=750+1, ra=750+1-0, async=0) = 1 tac-3528 [001] 118.672371: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=748+2, ra=746+5-0, async=0) = 4 tac-3528 [001] 118.673039: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=744+2, ra=726+25-0, async=0) = 20 Here is the behavior with an 8-page read sequence from 10000 down to 0. (The readahead size is a bit large since it's an NFS mount.) readahead-random(dev=0:16, ino=3948605, req=10000+8, ra=10000+8-0, async=0) = 8 readahead-backwards(dev=0:16, ino=3948605, req=9992+8, ra=9968+32-0, async=0) = 32 readahead-backwards(dev=0:16, ino=3948605, req=9960+8, ra=9840+128-0, async=0) = 128 readahead-backwards(dev=0:16, ino=3948605, req=9832+8, ra=9584+256-0, async=0) = 256 readahead-backwards(dev=0:16, ino=3948605, req=9576+8, ra=9072+512-0, async=0) = 512 readahead-backwards(dev=0:16, ino=3948605, req=9064+8, ra=8048+1024-0, async=0) = 1024 readahead-backwards(dev=0:16, ino=3948605, req=8040+8, ra=6128+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=6120+8, ra=4208+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=4200+8, ra=2288+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=2280+8, ra=368+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=360+8, ra=0+368-0, async=0) = 368 And a simple 1-page read sequence from 10000 down to 0. readahead-random(dev=0:16, ino=3948605, req=10000+1, ra=10000+1-0, async=0) = 1 readahead-backwards(dev=0:16, ino=3948605, req=9999+1, ra=9996+4-0, async=0) = 4 readahead-backwards(dev=0:16, ino=3948605, req=9995+1, ra=9980+16-0, async=0) = 16 readahead-backwards(dev=0:16, ino=3948605, req=9979+1, ra=9916+64-0, async=0) = 64 readahead-backwards(dev=0:16, ino=3948605, req=9915+1, ra=9660+256-0, async=0) = 256 readahead-backwards(dev=0:16, ino=3948605, req=9659+1, ra=9148+512-0, async=0) = 512 readahead-backwards(dev=0:16, ino=3948605, req=9147+1, ra=8124+1024-0, async=0) = 1024 readahead-backwards(dev=0:16, ino=3948605, req=8123+1, ra=6204+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=6203+1, ra=4284+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=4283+1, ra=2364+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=2363+1, ra=444+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=443+1, ra=0+444-0, async=0) = 444 CC: Andi Kleen CC: Li Shaohua Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- include/linux/fs.h | 2 ++ include/trace/events/vfs.h | 1 + mm/readahead.c | 20 ++++++++++++++++++++ 3 files changed, 23 insertions(+) --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:57.000000000 +0800 @@ -975,6 +975,7 @@ struct file_ra_state { * streams. * RA_PATTERN_MMAP_AROUND read-around on mmap page faults * (w/o any sequential/random hints) + * RA_PATTERN_BACKWARDS reverse reading detected * RA_PATTERN_FADVISE triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM * RA_PATTERN_OVERSIZE a random read larger than max readahead size, * do max readahead to break down the read size @@ -985,6 +986,7 @@ enum readahead_pattern { RA_PATTERN_SUBSEQUENT, RA_PATTERN_CONTEXT, RA_PATTERN_MMAP_AROUND, + RA_PATTERN_BACKWARDS, RA_PATTERN_FADVISE, RA_PATTERN_OVERSIZE, RA_PATTERN_RANDOM, --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:53.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:57.000000000 +0800 @@ -695,6 +695,26 @@ ondemand_readahead(struct address_space } /* + * backwards reading + */ + if (offset < ra->start && offset + req_size >= ra->start) { + ra->pattern = RA_PATTERN_BACKWARDS; + ra->size = get_next_ra_size(ra, max); + if (ra->size > ra->start) { + /* + * ra->start may be concurrently set to some huge + * value, the min() at least avoids submitting huge IO + * in this race condition + */ + ra->size = min(ra->start, max); + ra->start = 0; + } else + ra->start -= ra->size; + ra->async_size = 0; + goto readit; + } + + /* * Query the page cache and look for the traces(cached history pages) * that a sequential stream would leave behind. */ --- linux-next.orig/include/trace/events/vfs.h 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/include/trace/events/vfs.h 2012-01-25 15:57:57.000000000 +0800 @@ -14,6 +14,7 @@ { RA_PATTERN_SUBSEQUENT, "subsequent" }, \ { RA_PATTERN_CONTEXT, "context" }, \ { RA_PATTERN_MMAP_AROUND, "around" }, \ + { RA_PATTERN_BACKWARDS, "backwards" }, \ { RA_PATTERN_FADVISE, "fadvise" }, \ { RA_PATTERN_OVERSIZE, "oversize" }, \ { RA_PATTERN_RANDOM, "random" }, \ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx175.postini.com [74.125.245.175]) by kanga.kvack.org (Postfix) with SMTP id 015BB6B0098 for ; Thu, 26 Jan 2012 22:40:40 -0500 (EST) Message-Id: <20120127031327.159293683@intel.com> Date: Fri, 27 Jan 2012 11:05:30 +0800 From: Wu Fengguang Subject: [PATCH 6/9] readahead: add /debug/readahead/stats References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-stats.patch Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Wu Fengguang , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y), and will remain inactive by default. It can be runtime enabled/disabled through the debugfs interface echo 1 > /debug/readahead/stats_enable echo 0 > /debug/readahead/stats_enable Example output: (taken from a fresh booted NFS-ROOT console box with rsize=524288) $ cat /debug/readahead/stats pattern readahead eof_hit cache_hit io sync_io mmap_io meta_io size async_size io_size initial 702 511 0 692 692 0 0 2 0 2 subsequent 7 0 1 7 1 1 0 23 22 23 context 160 161 0 2 0 1 0 0 0 16 around 184 184 177 184 184 184 0 58 0 53 backwards 2 0 2 2 2 0 0 4 0 3 fadvise 2593 47 8 2588 2588 0 0 1 0 1 oversize 0 0 0 0 0 0 0 0 0 0 random 45 20 0 44 44 0 0 1 0 1 all 3697 923 188 3519 3511 186 0 4 0 4 The two most important columns are - io number of readahead IO - io_size average readahead IO size CC: Ingo Molnar CC: Jens Axboe CC: Peter Zijlstra Acked-by: Rik van Riel Signed-off-by: Wu Fengguang --- mm/Kconfig | 15 +++ mm/readahead.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 217 insertions(+) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:53.000000000 +0800 @@ -33,6 +33,202 @@ EXPORT_SYMBOL_GPL(file_ra_state_init); #define list_to_page(head) (list_entry((head)->prev, struct page, lru)) +#ifdef CONFIG_READAHEAD_STATS +#include +#include +#include + +static u32 readahead_stats_enable __read_mostly; + +static const struct trace_print_flags ra_pattern_names[] = { + READAHEAD_PATTERNS +}; + +enum ra_account { + /* number of readaheads */ + RA_ACCOUNT_COUNT, /* readahead request */ + RA_ACCOUNT_EOF, /* readahead request covers EOF */ + RA_ACCOUNT_CACHE_HIT, /* readahead request covers some cached pages */ + RA_ACCOUNT_IOCOUNT, /* readahead IO */ + RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */ + RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */ + RA_ACCOUNT_METADATA, /* readahead IO on metadata */ + /* number of readahead pages */ + RA_ACCOUNT_SIZE, /* readahead size */ + RA_ACCOUNT_ASYNC_SIZE, /* readahead async size */ + RA_ACCOUNT_ACTUAL, /* readahead actual IO size */ + /* end mark */ + RA_ACCOUNT_MAX, +}; + +#define RA_STAT_BATCH (INT_MAX / 2) +static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX]; + +static inline void add_ra_stat(int i, int j, s64 amount) +{ + __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH); +} + +static inline void inc_ra_stat(int i, int j) +{ + add_ra_stat(i, j, 1); +} + +static void readahead_stats(struct address_space *mapping, + pgoff_t offset, + unsigned long req_size, + bool for_mmap, + bool for_metadata, + enum readahead_pattern pattern, + pgoff_t start, + unsigned long size, + unsigned long async_size, + int actual) +{ + pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1; + + inc_ra_stat(pattern, RA_ACCOUNT_COUNT); + add_ra_stat(pattern, RA_ACCOUNT_SIZE, size); + add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size); + add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual); + + if (start + size >= eof) + inc_ra_stat(pattern, RA_ACCOUNT_EOF); + if (actual < size) + inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT); + + if (actual) { + inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT); + + if (start <= offset && offset < start + size) + inc_ra_stat(pattern, RA_ACCOUNT_SYNC); + + if (for_mmap) + inc_ra_stat(pattern, RA_ACCOUNT_MMAP); + if (for_metadata) + inc_ra_stat(pattern, RA_ACCOUNT_METADATA); + } +} + +static void readahead_stats_reset(void) +{ + int i, j; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) + percpu_counter_set(&ra_stat[i][j], 0); +} + +static void +readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]) +{ + int i, j; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) { + s64 n = percpu_counter_sum(&ra_stat[i][j]); + ra_stats[i][j] += n; + ra_stats[RA_PATTERN_ALL][j] += n; + } +} + +static int readahead_stats_show(struct seq_file *s, void *_) +{ + long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]; + int i; + + seq_printf(s, + "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n", + "pattern", "readahead", "eof_hit", "cache_hit", + "io", "sync_io", "mmap_io", "meta_io", + "size", "async_size", "io_size"); + + memset(ra_stats, 0, sizeof(ra_stats)); + readahead_stats_sum(ra_stats); + + for (i = 0; i < RA_PATTERN_MAX; i++) { + unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT]; + unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT]; + /* + * avoid division-by-zero + */ + if (count == 0) + count = 1; + if (iocount == 0) + iocount = 1; + + seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld " + "%10lld %10lld %10lld %10lld %10lld\n", + ra_pattern_names[i].name, + ra_stats[i][RA_ACCOUNT_COUNT], + ra_stats[i][RA_ACCOUNT_EOF], + ra_stats[i][RA_ACCOUNT_CACHE_HIT], + ra_stats[i][RA_ACCOUNT_IOCOUNT], + ra_stats[i][RA_ACCOUNT_SYNC], + ra_stats[i][RA_ACCOUNT_MMAP], + ra_stats[i][RA_ACCOUNT_METADATA], + ra_stats[i][RA_ACCOUNT_SIZE] / count, + ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count, + ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount); + } + + return 0; +} + +static int readahead_stats_open(struct inode *inode, struct file *file) +{ + return single_open(file, readahead_stats_show, NULL); +} + +static ssize_t readahead_stats_write(struct file *file, const char __user *buf, + size_t size, loff_t *offset) +{ + readahead_stats_reset(); + return size; +} + +static const struct file_operations readahead_stats_fops = { + .owner = THIS_MODULE, + .open = readahead_stats_open, + .write = readahead_stats_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int __init readahead_create_debugfs(void) +{ + struct dentry *root; + struct dentry *entry; + int i, j; + + root = debugfs_create_dir("readahead", NULL); + if (!root) + goto out; + + entry = debugfs_create_file("stats", 0644, root, + NULL, &readahead_stats_fops); + if (!entry) + goto out; + + entry = debugfs_create_bool("stats_enable", 0644, root, + &readahead_stats_enable); + if (!entry) + goto out; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) + percpu_counter_init(&ra_stat[i][j], 0); + + return 0; +out: + printk(KERN_ERR "readahead: failed to create debugfs entries\n"); + return -ENOMEM; +} + +late_initcall(readahead_create_debugfs); +#endif + static inline void readahead_event(struct address_space *mapping, pgoff_t offset, unsigned long req_size, @@ -44,6 +240,12 @@ static inline void readahead_event(struc unsigned long async_size, int actual) { +#ifdef CONFIG_READAHEAD_STATS + if (readahead_stats_enable) + readahead_stats(mapping, offset, req_size, + for_mmap, for_metadata, + pattern, start, size, async_size, actual); +#endif trace_readahead(mapping, offset, req_size, pattern, start, size, async_size, actual); } --- linux-next.orig/mm/Kconfig 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/mm/Kconfig 2012-01-25 15:57:53.000000000 +0800 @@ -379,3 +379,18 @@ config CLEANCACHE in a negligible performance hit. If unsure, say Y to enable cleancache + +config READAHEAD_STATS + bool "Collect page cache readahead stats" + depends on DEBUG_FS + default n + help + This provides the readahead events accounting facilities. + + To do readahead accounting for a workload: + + echo 1 > /sys/kernel/debug/readahead/stats_enable + echo 0 > /sys/kernel/debug/readahead/stats # reset counters + # run the workload + cat /sys/kernel/debug/readahead/stats # check counters + echo 0 > /sys/kernel/debug/readahead/stats_enable -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx187.postini.com [74.125.245.187]) by kanga.kvack.org (Postfix) with SMTP id 1BCB76B0088 for ; Thu, 26 Jan 2012 22:40:42 -0500 (EST) Message-Id: <20120127031326.619964905@intel.com> Date: Fri, 27 Jan 2012 11:05:26 +0800 From: Wu Fengguang Subject: [PATCH 2/9] readahead: record readahead patterns References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-tracepoints.patch Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Jan Kara , Rik van Riel , Wu Fengguang , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML Record the readahead pattern in ra->pattern and extend ra_submit() parameters, to be used by the next readahead tracing/stats patches. 7 patterns are defined: pattern readahead for ----------------------------------------------------------- RA_PATTERN_INITIAL start-of-file read RA_PATTERN_SUBSEQUENT trivial sequential read RA_PATTERN_CONTEXT interleaved sequential read RA_PATTERN_OVERSIZE oversize read RA_PATTERN_MMAP_AROUND mmap fault RA_PATTERN_FADVISE posix_fadvise() RA_PATTERN_RANDOM random read Note that random reads will be recorded in file_ra_state now. This won't deteriorate cache bouncing because the ra->prev_pos update in do_generic_file_read() already pollutes the data cache, and filemap_fault() will stop calling into us after MMAP_LOTSAMISS. CC: Ingo Molnar CC: Jens Axboe CC: Peter Zijlstra Acked-by: Jan Kara Acked-by: Rik van Riel Signed-off-by: Wu Fengguang --- include/linux/fs.h | 36 +++++++++++++++++++++++++++++++++++- include/linux/mm.h | 4 +++- mm/filemap.c | 3 ++- mm/readahead.c | 29 ++++++++++++++++++++++------- 4 files changed, 62 insertions(+), 10 deletions(-) --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:50.000000000 +0800 @@ -952,11 +952,45 @@ struct file_ra_state { there are only # of pages ahead */ unsigned int ra_pages; /* Maximum readahead window */ - unsigned int mmap_miss; /* Cache miss stat for mmap accesses */ + u16 mmap_miss; /* Cache miss stat for mmap accesses */ + u8 pattern; /* one of RA_PATTERN_* */ + loff_t prev_pos; /* Cache last read() position */ }; /* + * Which policy makes decision to do the current read-ahead IO? + * + * RA_PATTERN_INITIAL readahead window is initially opened, + * normally when reading from start of file + * RA_PATTERN_SUBSEQUENT readahead window is pushed forward + * RA_PATTERN_CONTEXT no readahead window available, querying the + * page cache to decide readahead start/size. + * This typically happens on interleaved reads (eg. + * reading pages 0, 1000, 1, 1001, 2, 1002, ...) + * where one file_ra_state struct is not enough + * for recording 2+ interleaved sequential read + * streams. + * RA_PATTERN_MMAP_AROUND read-around on mmap page faults + * (w/o any sequential/random hints) + * RA_PATTERN_FADVISE triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM + * RA_PATTERN_OVERSIZE a random read larger than max readahead size, + * do max readahead to break down the read size + * RA_PATTERN_RANDOM a small random read + */ +enum readahead_pattern { + RA_PATTERN_INITIAL, + RA_PATTERN_SUBSEQUENT, + RA_PATTERN_CONTEXT, + RA_PATTERN_MMAP_AROUND, + RA_PATTERN_FADVISE, + RA_PATTERN_OVERSIZE, + RA_PATTERN_RANDOM, + RA_PATTERN_ALL, /* for summary stats */ + RA_PATTERN_MAX +}; + +/* * Check if @index falls in the readahead windows. */ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:49.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:50.000000000 +0800 @@ -249,7 +249,10 @@ unsigned long max_sane_readahead(unsigne * Submit IO for the read-ahead request in file_ra_state. */ unsigned long ra_submit(struct file_ra_state *ra, - struct address_space *mapping, struct file *filp) + struct address_space *mapping, + struct file *filp, + pgoff_t offset, + unsigned long req_size) { int actual; @@ -382,6 +385,7 @@ static int try_context_readahead(struct if (size >= offset) size *= 2; + ra->pattern = RA_PATTERN_CONTEXT; ra->start = offset; ra->size = min(size + req_size, max); ra->async_size = 1; @@ -403,8 +407,10 @@ ondemand_readahead(struct address_space /* * start of file */ - if (!offset) + if (!offset) { + ra->pattern = RA_PATTERN_INITIAL; goto initial_readahead; + } /* * It's the expected callback offset, assume sequential access. @@ -412,6 +418,7 @@ ondemand_readahead(struct address_space */ if ((offset == (ra->start + ra->size - ra->async_size) || offset == (ra->start + ra->size))) { + ra->pattern = RA_PATTERN_SUBSEQUENT; ra->start += ra->size; ra->size = get_next_ra_size(ra, max); ra->async_size = ra->size; @@ -434,6 +441,7 @@ ondemand_readahead(struct address_space if (!start || start - offset > max) return 0; + ra->pattern = RA_PATTERN_CONTEXT; ra->start = start; ra->size = start - offset; /* old async_size */ ra->size += req_size; @@ -445,14 +453,18 @@ ondemand_readahead(struct address_space /* * oversize read */ - if (req_size > max) + if (req_size > max) { + ra->pattern = RA_PATTERN_OVERSIZE; goto initial_readahead; + } /* * sequential cache miss */ - if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) + if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) { + ra->pattern = RA_PATTERN_INITIAL; goto initial_readahead; + } /* * Query the page cache and look for the traces(cached history pages) @@ -463,9 +475,12 @@ ondemand_readahead(struct address_space /* * standalone, small random read - * Read as is, and do not pollute the readahead state. */ - return __do_page_cache_readahead(mapping, filp, offset, req_size, 0); + ra->pattern = RA_PATTERN_RANDOM; + ra->start = offset; + ra->size = req_size; + ra->async_size = 0; + goto readit; initial_readahead: ra->start = offset; @@ -483,7 +498,7 @@ readit: ra->size += ra->async_size; } - return ra_submit(ra, mapping, filp); + return ra_submit(ra, mapping, filp, offset, req_size); } /** --- linux-next.orig/include/linux/mm.h 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/include/linux/mm.h 2012-01-25 15:57:50.000000000 +0800 @@ -1448,7 +1448,9 @@ void page_cache_async_readahead(struct a unsigned long max_sane_readahead(unsigned long nr); unsigned long ra_submit(struct file_ra_state *ra, struct address_space *mapping, - struct file *filp); + struct file *filp, + pgoff_t offset, + unsigned long req_size); /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ extern int expand_stack(struct vm_area_struct *vma, unsigned long address); --- linux-next.orig/mm/filemap.c 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/mm/filemap.c 2012-01-25 15:57:50.000000000 +0800 @@ -1597,11 +1597,12 @@ static void do_sync_mmap_readahead(struc /* * mmap read-around */ + ra->pattern = RA_PATTERN_MMAP_AROUND; ra_pages = max_sane_readahead(ra->ra_pages); ra->start = max_t(long, 0, offset - ra_pages / 2); ra->size = ra_pages; ra->async_size = ra_pages / 4; - ra_submit(ra, mapping, file); + ra_submit(ra, mapping, file, offset, 1); } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx175.postini.com [74.125.245.175]) by kanga.kvack.org (Postfix) with SMTP id 0945E6B0099 for ; Thu, 26 Jan 2012 22:40:42 -0500 (EST) Message-Id: <20120127031327.020100004@intel.com> Date: Fri, 27 Jan 2012 11:05:29 +0800 From: Wu Fengguang Subject: [PATCH 5/9] readahead: add vfs/readahead tracing event References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-tracer.patch Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Jan Kara , Rik van Riel , Steven Rostedt , Wu Fengguang , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML This is very useful for verifying whether the readahead algorithms are working to the expectation. Example output: # echo 1 > /debug/tracing/events/vfs/readahead/enable # cp test-file /dev/null # cat /debug/tracing/trace # trimmed output pattern=initial bdi=0:16 ino=100177 req=0+2 ra=0+4-2 async=0 actual=4 pattern=subsequent bdi=0:16 ino=100177 req=2+2 ra=4+8-8 async=1 actual=8 pattern=subsequent bdi=0:16 ino=100177 req=4+2 ra=12+16-16 async=1 actual=16 pattern=subsequent bdi=0:16 ino=100177 req=12+2 ra=28+32-32 async=1 actual=32 pattern=subsequent bdi=0:16 ino=100177 req=28+2 ra=60+60-60 async=1 actual=24 pattern=subsequent bdi=0:16 ino=100177 req=60+2 ra=120+60-60 async=1 actual=0 CC: Ingo Molnar CC: Jens Axboe CC: Peter Zijlstra Acked-by: Jan Kara Acked-by: Rik van Riel Acked-by: Steven Rostedt Signed-off-by: Wu Fengguang --- fs/Makefile | 1 fs/trace.c | 2 include/trace/events/vfs.h | 77 +++++++++++++++++++++++++++++++++++ mm/readahead.c | 24 ++++++++++ 4 files changed, 104 insertions(+) --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-next/include/trace/events/vfs.h 2012-01-25 15:57:52.000000000 +0800 @@ -0,0 +1,77 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM vfs + +#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_VFS_H + +#include +#include +#include +#include + +#define READAHEAD_PATTERNS \ + { RA_PATTERN_INITIAL, "initial" }, \ + { RA_PATTERN_SUBSEQUENT, "subsequent" }, \ + { RA_PATTERN_CONTEXT, "context" }, \ + { RA_PATTERN_MMAP_AROUND, "around" }, \ + { RA_PATTERN_FADVISE, "fadvise" }, \ + { RA_PATTERN_OVERSIZE, "oversize" }, \ + { RA_PATTERN_RANDOM, "random" }, \ + { RA_PATTERN_ALL, "all" } + +TRACE_EVENT(readahead, + TP_PROTO(struct address_space *mapping, + pgoff_t offset, + unsigned long req_size, + enum readahead_pattern pattern, + pgoff_t start, + unsigned long size, + unsigned long async_size, + unsigned int actual), + + TP_ARGS(mapping, offset, req_size, pattern, start, size, async_size, + actual), + + TP_STRUCT__entry( + __array(char, bdi, 32) + __field(ino_t, ino) + __field(pgoff_t, offset) + __field(unsigned long, req_size) + __field(unsigned int, pattern) + __field(pgoff_t, start) + __field(unsigned int, size) + __field(unsigned int, async_size) + __field(unsigned int, actual) + ), + + TP_fast_assign( + strncpy(__entry->bdi, + dev_name(mapping->backing_dev_info->dev), 32); + __entry->ino = mapping->host->i_ino; + __entry->offset = offset; + __entry->req_size = req_size; + __entry->pattern = pattern; + __entry->start = start; + __entry->size = size; + __entry->async_size = async_size; + __entry->actual = actual; + ), + + TP_printk("pattern=%s bdi=%s ino=%lu " + "req=%lu+%lu ra=%lu+%d-%d async=%d actual=%d", + __print_symbolic(__entry->pattern, READAHEAD_PATTERNS), + __entry->bdi, + __entry->ino, + __entry->offset, + __entry->req_size, + __entry->start, + __entry->size, + __entry->async_size, + __entry->start > __entry->offset, + __entry->actual) +); + +#endif /* _TRACE_VFS_H */ + +/* This part must be outside protection */ +#include --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 @@ -17,6 +17,7 @@ #include #include #include +#include /* * Initialise a struct file's readahead state. Assumes that the caller has @@ -32,6 +33,21 @@ EXPORT_SYMBOL_GPL(file_ra_state_init); #define list_to_page(head) (list_entry((head)->prev, struct page, lru)) +static inline void readahead_event(struct address_space *mapping, + pgoff_t offset, + unsigned long req_size, + bool for_mmap, + bool for_metadata, + enum readahead_pattern pattern, + pgoff_t start, + unsigned long size, + unsigned long async_size, + int actual) +{ + trace_readahead(mapping, offset, req_size, + pattern, start, size, async_size, actual); +} + /* * see if a page needs releasing upon read_cache_pages() failure * - the caller of read_cache_pages() may have set PG_private or PG_fscache @@ -228,6 +244,9 @@ int force_page_cache_readahead(struct ad ret = err; break; } + readahead_event(mapping, offset, nr_to_read, 0, 0, + RA_PATTERN_FADVISE, offset, this_chunk, 0, + err); ret += err; offset += this_chunk; nr_to_read -= this_chunk; @@ -259,6 +278,11 @@ unsigned long ra_submit(struct file_ra_s actual = __do_page_cache_readahead(mapping, filp, ra->start, ra->size, ra->async_size); + readahead_event(mapping, offset, req_size, + ra->for_mmap, ra->for_metadata, + ra->pattern, ra->start, ra->size, ra->async_size, + actual); + ra->for_mmap = 0; ra->for_metadata = 0; return actual; --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-next/fs/trace.c 2012-01-25 15:57:52.000000000 +0800 @@ -0,0 +1,2 @@ +#define CREATE_TRACE_POINTS +#include --- linux-next.orig/fs/Makefile 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/Makefile 2012-01-25 15:57:52.000000000 +0800 @@ -50,6 +50,7 @@ obj-$(CONFIG_NFS_COMMON) += nfs_common/ obj-$(CONFIG_GENERIC_ACL) += generic_acl.o obj-$(CONFIG_FHANDLE) += fhandle.o +obj-$(CONFIG_TRACEPOINTS) += trace.o obj-y += quota/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx158.postini.com [74.125.245.158]) by kanga.kvack.org (Postfix) with SMTP id E3E186B004D for ; Sun, 29 Jan 2012 23:02:43 -0500 (EST) Date: Mon, 30 Jan 2012 15:02:39 +1100 From: Dave Chinner Subject: Re: [PATCH 6/9] readahead: add /debug/readahead/stats Message-ID: <20120130040239.GB9090@dastard> References: <20120127030524.854259561@intel.com> <20120127031327.159293683@intel.com> <20120127121551.acd256aa.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120127121551.acd256aa.akpm@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Christoph Lameter , Wu Fengguang , Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML On Fri, Jan 27, 2012 at 12:15:51PM -0800, Andrew Morton wrote: > On Fri, 27 Jan 2012 10:21:36 -0600 (CST) > Christoph Lameter wrote: > > > > + > > > +static void readahead_stats_reset(void) > > > +{ > > > + int i, j; > > > + > > > + for (i = 0; i < RA_PATTERN_ALL; i++) > > > + for (j = 0; j < RA_ACCOUNT_MAX; j++) > > > + percpu_counter_set(&ra_stat[i][j], 0); > > > > for_each_online(cpu) > > memset(per_cpu_ptr(&ra_stat, cpu), 0, sizeof(ra_stat)); > > for_each_possible_cpu(). And that's one reason to not open-code the > operation. Another is so we don't have tiresome open-coded loops all > over the place. Amen, brother! > But before doing either of those things we should choose boring old > atomic_inc(). Has it been shown that the cost of doing so is > unacceptable? Bearing this in mind: atomics for stats in the IO path have long been known not to scale well enough - especially now we have PCIe SSDs that can do hundreds of thousands of reads per second if you have enough CPU concurrency to drive them that hard. Under that sort of workload, atomics won't scale. > > > The accounting code will be compiled in by default > > (CONFIG_READAHEAD_STATS=y), and will remain inactive by default. > > I agree with those choices. They effectively mean that the stats will > be a developer-only/debugger-only thing. So even if the atomic_inc() > costs are measurable during these develop/debug sessions, is anyone > likely to care? I do. If I need the debugging stats, the overhead must not perturb the behaviour I'm trying to understand/debug for them to be useful.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753892Ab2A0Dki (ORCPT ); Thu, 26 Jan 2012 22:40:38 -0500 Received: from mga14.intel.com ([143.182.124.37]:41153 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751728Ab2A0Dkh (ORCPT ); Thu, 26 Jan 2012 22:40:37 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="100232686" Message-Id: <20120127030524.854259561@intel.com> User-Agent: quilt/0.48-1 Date: Fri, 27 Jan 2012 11:05:24 +0800 From: Wu Fengguang To: Andrew Morton cc: Andi Kleen cc: Linux Memory Management List , Cc: Wu Fengguang , LKML Subject: [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v4) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andrew, Will you include it into the -mm tree? This introduces the per-cpu readahead stats, tracing, backwards prefetching, fixes context readahead for SSD random reads and does some other minor changes. Changes since v3: - default to CONFIG_READAHEAD_STATS=n - drop "block: limit default readahead size for small devices" (and expect some distro udev rules to do the job) - use percpu_counter for the readahead stats Changes since v2: - use per-cpu counters for readahead stats - make context readahead more conservative - simplify readahead tracing format and use __print_symbolic() - backwards prefetching and snap to EOF fixes and cleanups Changes since v1: - use bit fields: pattern, for_mmap, for_metadata, lseek - comment the various readahead patterns - drop boot options "readahead=" and "readahead_stats=" - add for_metadata - add snapping to EOF [PATCH 1/9] readahead: make context readahead more conservative [PATCH 2/9] readahead: record readahead patterns [PATCH 3/9] readahead: tag mmap page fault call sites [PATCH 4/9] readahead: tag metadata call sites [PATCH 5/9] readahead: add vfs/readahead tracing event [PATCH 6/9] readahead: add /debug/readahead/stats [PATCH 7/9] readahead: basic support for backwards prefetching [PATCH 8/9] readahead: dont do start-of-file readahead after lseek() [PATCH 9/9] readahead: snap readahead request to EOF fs/Makefile | 1 fs/ext3/dir.c | 1 fs/ext4/dir.c | 1 fs/read_write.c | 3 fs/trace.c | 2 include/linux/fs.h | 41 ++++ include/linux/mm.h | 4 include/trace/events/vfs.h | 78 ++++++++ mm/Kconfig | 15 + mm/filemap.c | 9 - mm/readahead.c | 310 +++++++++++++++++++++++++++++++++-- 11 files changed, 450 insertions(+), 15 deletions(-) Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754512Ab2A0Dkk (ORCPT ); Thu, 26 Jan 2012 22:40:40 -0500 Received: from mga14.intel.com ([143.182.124.37]:41153 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751875Ab2A0Dkh (ORCPT ); Thu, 26 Jan 2012 22:40:37 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="100232689" Message-Id: <20120127031326.469063803@intel.com> User-Agent: quilt/0.48-1 Date: Fri, 27 Jan 2012 11:05:25 +0800 From: Wu Fengguang To: Andrew Morton cc: Andi Kleen , Wu Fengguang cc: Linux Memory Management List , Cc: LKML Subject: [PATCH 1/9] readahead: make context readahead more conservative References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-context-tt Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Try to prevent negatively impact moderately dense random reads on SSD. Transaction-Per-Second numbers provided by Taobao: QPS case ------------------------------------------------------- 7536 disable context readahead totally w/ patch: 7129 slower size rampup and start RA on the 3rd read 6717 slower size rampup w/o patch: 5581 unmodified context readahead Before, readahead will be started whenever reading page N+1 when it happen to read N recently. After patch, we'll only start readahead when *three* random reads happen to access pages N, N+1, N+2. The probability of this happening is extremely low for pure random reads, unless they are very dense, which actually deserves some readahead. Also start with a smaller readahead window. The impact to interleaved sequential reads should be small, because for a long run stream, the the small readahead window rampup phase is negletable. The context readahead actually benefits clustered random reads on HDD whose seek cost is pretty high. However as SSD is increasingly used for random read workloads it's better for the context readahead to concentrate on interleaved sequential reads. Tested-by: Tao Ma Signed-off-by: Wu Fengguang --- mm/readahead.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:49.000000000 +0800 @@ -369,10 +369,10 @@ static int try_context_readahead(struct size = count_history_pages(mapping, ra, offset, max); /* - * no history pages: + * not enough history pages: * it could be a random read */ - if (!size) + if (size <= req_size) return 0; /* @@ -383,8 +383,8 @@ static int try_context_readahead(struct size *= 2; ra->start = offset; - ra->size = get_init_ra_size(size + req_size, max); - ra->async_size = ra->size; + ra->size = min(size + req_size, max); + ra->async_size = 1; return 1; } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754961Ab2A0Dl3 (ORCPT ); Thu, 26 Jan 2012 22:41:29 -0500 Received: from mga14.intel.com ([143.182.124.37]:48949 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752218Ab2A0Dki (ORCPT ); Thu, 26 Jan 2012 22:40:38 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="100232694" Message-Id: <20120127031326.752229208@intel.com> User-Agent: quilt/0.48-1 Date: Fri, 27 Jan 2012 11:05:27 +0800 From: Wu Fengguang To: Andrew Morton cc: Andi Kleen , Jan Kara , Wu Fengguang cc: Linux Memory Management List , Cc: LKML Subject: [PATCH 3/9] readahead: tag mmap page fault call sites References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-for-mmap Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Introduce a bit field ra->for_mmap for tagging mmap reads. The tag will be cleared immediate after submitting the IO. Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- include/linux/fs.h | 1 + mm/filemap.c | 6 +++++- mm/readahead.c | 1 + 3 files changed, 7 insertions(+), 1 deletion(-) --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:50.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:51.000000000 +0800 @@ -954,6 +954,7 @@ struct file_ra_state { unsigned int ra_pages; /* Maximum readahead window */ u16 mmap_miss; /* Cache miss stat for mmap accesses */ u8 pattern; /* one of RA_PATTERN_* */ + unsigned int for_mmap:1; /* readahead for mmap accesses */ loff_t prev_pos; /* Cache last read() position */ }; --- linux-next.orig/mm/filemap.c 2012-01-25 15:57:50.000000000 +0800 +++ linux-next/mm/filemap.c 2012-01-25 15:57:51.000000000 +0800 @@ -1578,6 +1578,7 @@ static void do_sync_mmap_readahead(struc return; if (VM_SequentialReadHint(vma)) { + ra->for_mmap = 1; page_cache_sync_readahead(mapping, ra, file, offset, ra->ra_pages); return; @@ -1597,6 +1598,7 @@ static void do_sync_mmap_readahead(struc /* * mmap read-around */ + ra->for_mmap = 1; ra->pattern = RA_PATTERN_MMAP_AROUND; ra_pages = max_sane_readahead(ra->ra_pages); ra->start = max_t(long, 0, offset - ra_pages / 2); @@ -1622,9 +1624,11 @@ static void do_async_mmap_readahead(stru return; if (ra->mmap_miss > 0) ra->mmap_miss--; - if (PageReadahead(page)) + if (PageReadahead(page)) { + ra->for_mmap = 1; page_cache_async_readahead(mapping, ra, file, page, offset, ra->ra_pages); + } } /** --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:50.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:51.000000000 +0800 @@ -259,6 +259,7 @@ unsigned long ra_submit(struct file_ra_s actual = __do_page_cache_readahead(mapping, filp, ra->start, ra->size, ra->async_size); + ra->for_mmap = 0; return actual; } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754981Ab2A0Dlb (ORCPT ); Thu, 26 Jan 2012 22:41:31 -0500 Received: from mga14.intel.com ([143.182.124.37]:41153 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752178Ab2A0Dki (ORCPT ); Thu, 26 Jan 2012 22:40:38 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="100232691" Message-Id: <20120127031327.567686120@intel.com> User-Agent: quilt/0.48-1 Date: Fri, 27 Jan 2012 11:05:33 +0800 From: Wu Fengguang To: Andrew Morton cc: Andi Kleen , Jan Kara , Wu Fengguang cc: Linux Memory Management List , Cc: LKML Subject: [PATCH 9/9] readahead: snap readahead request to EOF References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-eof Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If the file size is 20kb and readahead request is [0, 16kb), it's better to expand the readahead request to [0, 20kb), which will likely save one followup I/O for the ending [16kb, 20kb). If the readahead request already covers EOF, trimm it down to EOF. Also don't set the PG_readahead mark to avoid an unnecessary future invocation of the readahead code. This special handling looks worthwhile because small to medium sized files are pretty common. Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- mm/readahead.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:58.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:59.000000000 +0800 @@ -466,6 +466,25 @@ unsigned long max_sane_readahead(unsigne + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); } +static void snap_to_eof(struct file_ra_state *ra, struct address_space *mapping) +{ + pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1; + pgoff_t start = ra->start; + unsigned int size = ra->size; + + /* + * skip backwards and random reads + */ + if (ra->pattern > RA_PATTERN_MMAP_AROUND) + return; + + size += min(size / 2, ra->ra_pages / 4); + if (start + size > eof) { + ra->size = eof - start; + ra->async_size = 0; + } +} + /* * Submit IO for the read-ahead request in file_ra_state. */ @@ -477,6 +496,8 @@ unsigned long ra_submit(struct file_ra_s { int actual; + snap_to_eof(ra, mapping); + actual = __do_page_cache_readahead(mapping, filp, ra->start, ra->size, ra->async_size); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754942Ab2A0Dl2 (ORCPT ); Thu, 26 Jan 2012 22:41:28 -0500 Received: from mga14.intel.com ([143.182.124.37]:41153 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751728Ab2A0Dkj (ORCPT ); Thu, 26 Jan 2012 22:40:39 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="100232701" Message-Id: <20120127031327.293145482@intel.com> User-Agent: quilt/0.48-1 Date: Fri, 27 Jan 2012 11:05:31 +0800 From: Wu Fengguang To: Andrew Morton cc: Andi Kleen , Li Shaohua , Jan Kara , Wu Fengguang cc: Linux Memory Management List , Cc: LKML Subject: [PATCH 7/9] readahead: basic support for backwards prefetching References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-backwards.patch Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add the backwards prefetching feature. It's pretty simple if we don't support async prefetching and interleaved reads. tail and tac are observed to have the reverse read pattern: tail-3501 [006] 111.881191: readahead: readahead-random(bdi=0:16, ino=1548450, req=750+1, ra=750+1-0, async=0) = 1 tail-3501 [006] 111.881506: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=748+2, ra=746+5-0, async=0) = 4 tail-3501 [006] 111.882021: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=744+2, ra=726+25-0, async=0) = 20 tail-3501 [006] 111.883713: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=724+2, ra=626+125-0, async=0) = 100 tac-3528 [001] 118.671924: readahead: readahead-random(bdi=0:16, ino=1548445, req=750+1, ra=750+1-0, async=0) = 1 tac-3528 [001] 118.672371: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=748+2, ra=746+5-0, async=0) = 4 tac-3528 [001] 118.673039: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=744+2, ra=726+25-0, async=0) = 20 Here is the behavior with an 8-page read sequence from 10000 down to 0. (The readahead size is a bit large since it's an NFS mount.) readahead-random(dev=0:16, ino=3948605, req=10000+8, ra=10000+8-0, async=0) = 8 readahead-backwards(dev=0:16, ino=3948605, req=9992+8, ra=9968+32-0, async=0) = 32 readahead-backwards(dev=0:16, ino=3948605, req=9960+8, ra=9840+128-0, async=0) = 128 readahead-backwards(dev=0:16, ino=3948605, req=9832+8, ra=9584+256-0, async=0) = 256 readahead-backwards(dev=0:16, ino=3948605, req=9576+8, ra=9072+512-0, async=0) = 512 readahead-backwards(dev=0:16, ino=3948605, req=9064+8, ra=8048+1024-0, async=0) = 1024 readahead-backwards(dev=0:16, ino=3948605, req=8040+8, ra=6128+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=6120+8, ra=4208+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=4200+8, ra=2288+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=2280+8, ra=368+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=360+8, ra=0+368-0, async=0) = 368 And a simple 1-page read sequence from 10000 down to 0. readahead-random(dev=0:16, ino=3948605, req=10000+1, ra=10000+1-0, async=0) = 1 readahead-backwards(dev=0:16, ino=3948605, req=9999+1, ra=9996+4-0, async=0) = 4 readahead-backwards(dev=0:16, ino=3948605, req=9995+1, ra=9980+16-0, async=0) = 16 readahead-backwards(dev=0:16, ino=3948605, req=9979+1, ra=9916+64-0, async=0) = 64 readahead-backwards(dev=0:16, ino=3948605, req=9915+1, ra=9660+256-0, async=0) = 256 readahead-backwards(dev=0:16, ino=3948605, req=9659+1, ra=9148+512-0, async=0) = 512 readahead-backwards(dev=0:16, ino=3948605, req=9147+1, ra=8124+1024-0, async=0) = 1024 readahead-backwards(dev=0:16, ino=3948605, req=8123+1, ra=6204+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=6203+1, ra=4284+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=4283+1, ra=2364+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=2363+1, ra=444+1920-0, async=0) = 1920 readahead-backwards(dev=0:16, ino=3948605, req=443+1, ra=0+444-0, async=0) = 444 CC: Andi Kleen CC: Li Shaohua Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- include/linux/fs.h | 2 ++ include/trace/events/vfs.h | 1 + mm/readahead.c | 20 ++++++++++++++++++++ 3 files changed, 23 insertions(+) --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:57.000000000 +0800 @@ -975,6 +975,7 @@ struct file_ra_state { * streams. * RA_PATTERN_MMAP_AROUND read-around on mmap page faults * (w/o any sequential/random hints) + * RA_PATTERN_BACKWARDS reverse reading detected * RA_PATTERN_FADVISE triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM * RA_PATTERN_OVERSIZE a random read larger than max readahead size, * do max readahead to break down the read size @@ -985,6 +986,7 @@ enum readahead_pattern { RA_PATTERN_SUBSEQUENT, RA_PATTERN_CONTEXT, RA_PATTERN_MMAP_AROUND, + RA_PATTERN_BACKWARDS, RA_PATTERN_FADVISE, RA_PATTERN_OVERSIZE, RA_PATTERN_RANDOM, --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:53.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:57.000000000 +0800 @@ -695,6 +695,26 @@ ondemand_readahead(struct address_space } /* + * backwards reading + */ + if (offset < ra->start && offset + req_size >= ra->start) { + ra->pattern = RA_PATTERN_BACKWARDS; + ra->size = get_next_ra_size(ra, max); + if (ra->size > ra->start) { + /* + * ra->start may be concurrently set to some huge + * value, the min() at least avoids submitting huge IO + * in this race condition + */ + ra->size = min(ra->start, max); + ra->start = 0; + } else + ra->start -= ra->size; + ra->async_size = 0; + goto readit; + } + + /* * Query the page cache and look for the traces(cached history pages) * that a sequential stream would leave behind. */ --- linux-next.orig/include/trace/events/vfs.h 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/include/trace/events/vfs.h 2012-01-25 15:57:57.000000000 +0800 @@ -14,6 +14,7 @@ { RA_PATTERN_SUBSEQUENT, "subsequent" }, \ { RA_PATTERN_CONTEXT, "context" }, \ { RA_PATTERN_MMAP_AROUND, "around" }, \ + { RA_PATTERN_BACKWARDS, "backwards" }, \ { RA_PATTERN_FADVISE, "fadvise" }, \ { RA_PATTERN_OVERSIZE, "oversize" }, \ { RA_PATTERN_RANDOM, "random" }, \ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754929Ab2A0DlZ (ORCPT ); Thu, 26 Jan 2012 22:41:25 -0500 Received: from mga14.intel.com ([143.182.124.37]:48949 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754155Ab2A0Dki (ORCPT ); Thu, 26 Jan 2012 22:40:38 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="100232698" Message-Id: <20120127031327.430238053@intel.com> User-Agent: quilt/0.48-1 Date: Fri, 27 Jan 2012 11:05:32 +0800 From: Wu Fengguang To: Andrew Morton cc: Andi Kleen , Rik van Riel , Linus Torvalds , Wu Fengguang cc: Linux Memory Management List , Cc: LKML Subject: [PATCH 8/9] readahead: dont do start-of-file readahead after lseek() References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-lseek.patch Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Some applications (eg. blkid, id3tool etc.) seek around the file to get information. For example, blkid does seek to 0 read 1024 seek to 1536 read 16384 The start-of-file readahead heuristic is wrong for them, whose access pattern can be identified by lseek() calls. So test-and-set a READAHEAD_LSEEK flag on lseek() and don't do start-of-file readahead on seeing it. Proposed by Linus. Acked-by: Rik van Riel Acked-by: Linus Torvalds Signed-off-by: Wu Fengguang --- fs/read_write.c | 3 +++ include/linux/fs.h | 1 + mm/readahead.c | 4 ++++ 3 files changed, 8 insertions(+) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:57.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:58.000000000 +0800 @@ -485,6 +485,7 @@ unsigned long ra_submit(struct file_ra_s ra->pattern, ra->start, ra->size, ra->async_size, actual); + ra->lseek = 0; ra->for_mmap = 0; ra->for_metadata = 0; return actual; @@ -636,6 +637,8 @@ ondemand_readahead(struct address_space * start of file */ if (!offset) { + if (ra->lseek && req_size < max) + goto random_read; ra->pattern = RA_PATTERN_INITIAL; goto initial_readahead; } @@ -721,6 +724,7 @@ ondemand_readahead(struct address_space if (try_context_readahead(mapping, ra, offset, req_size, max)) goto readit; +random_read: /* * standalone, small random read */ --- linux-next.orig/fs/read_write.c 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/read_write.c 2012-01-25 15:57:58.000000000 +0800 @@ -47,6 +47,9 @@ static loff_t lseek_execute(struct file file->f_pos = offset; file->f_version = 0; } + + file->f_ra.lseek = 1; + return offset; } --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:57.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:58.000000000 +0800 @@ -956,6 +956,7 @@ struct file_ra_state { u8 pattern; /* one of RA_PATTERN_* */ unsigned int for_mmap:1; /* readahead for mmap accesses */ unsigned int for_metadata:1; /* readahead for meta data */ + unsigned int lseek:1; /* this read has a leading lseek */ loff_t prev_pos; /* Cache last read() position */ }; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754914Ab2A0DlY (ORCPT ); Thu, 26 Jan 2012 22:41:24 -0500 Received: from mga14.intel.com ([143.182.124.37]:41153 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753619Ab2A0Dki (ORCPT ); Thu, 26 Jan 2012 22:40:38 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="100232696" Message-Id: <20120127031326.881533433@intel.com> User-Agent: quilt/0.48-1 Date: Fri, 27 Jan 2012 11:05:28 +0800 From: Wu Fengguang To: Andrew Morton cc: Andi Kleen , Jan Kara , Wu Fengguang cc: Linux Memory Management List , Cc: LKML Subject: [PATCH 4/9] readahead: tag metadata call sites References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-for-metadata Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We may be doing more metadata readahead in future. Acked-by: Jan Kara Signed-off-by: Wu Fengguang --- fs/ext3/dir.c | 1 + fs/ext4/dir.c | 1 + include/linux/fs.h | 1 + mm/readahead.c | 1 + 4 files changed, 4 insertions(+) --- linux-next.orig/fs/ext3/dir.c 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/ext3/dir.c 2012-01-25 15:57:52.000000000 +0800 @@ -136,6 +136,7 @@ static int ext3_readdir(struct file * fi pgoff_t index = map_bh.b_blocknr >> (PAGE_CACHE_SHIFT - inode->i_blkbits); if (!ra_has_index(&filp->f_ra, index)) + filp->f_ra.for_metadata = 1; page_cache_sync_readahead( sb->s_bdev->bd_inode->i_mapping, &filp->f_ra, filp, --- linux-next.orig/fs/ext4/dir.c 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/ext4/dir.c 2012-01-25 15:57:52.000000000 +0800 @@ -153,6 +153,7 @@ static int ext4_readdir(struct file *fil pgoff_t index = map.m_pblk >> (PAGE_CACHE_SHIFT - inode->i_blkbits); if (!ra_has_index(&filp->f_ra, index)) + filp->f_ra.for_metadata = 1; page_cache_sync_readahead( sb->s_bdev->bd_inode->i_mapping, &filp->f_ra, filp, --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:51.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:52.000000000 +0800 @@ -955,6 +955,7 @@ struct file_ra_state { u16 mmap_miss; /* Cache miss stat for mmap accesses */ u8 pattern; /* one of RA_PATTERN_* */ unsigned int for_mmap:1; /* readahead for mmap accesses */ + unsigned int for_metadata:1; /* readahead for meta data */ loff_t prev_pos; /* Cache last read() position */ }; --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:51.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 @@ -260,6 +260,7 @@ unsigned long ra_submit(struct file_ra_s ra->start, ra->size, ra->async_size); ra->for_mmap = 0; + ra->for_metadata = 0; return actual; } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753408Ab2A0DlW (ORCPT ); Thu, 26 Jan 2012 22:41:22 -0500 Received: from mga14.intel.com ([143.182.124.37]:45958 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754738Ab2A0Dkl (ORCPT ); Thu, 26 Jan 2012 22:40:41 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="100232713" Message-Id: <20120127031326.619964905@intel.com> User-Agent: quilt/0.48-1 Date: Fri, 27 Jan 2012 11:05:26 +0800 From: Wu Fengguang To: Andrew Morton cc: Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Jan Kara , Rik van Riel , Wu Fengguang cc: Linux Memory Management List , Cc: LKML Subject: [PATCH 2/9] readahead: record readahead patterns References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-tracepoints.patch Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Record the readahead pattern in ra->pattern and extend ra_submit() parameters, to be used by the next readahead tracing/stats patches. 7 patterns are defined: pattern readahead for ----------------------------------------------------------- RA_PATTERN_INITIAL start-of-file read RA_PATTERN_SUBSEQUENT trivial sequential read RA_PATTERN_CONTEXT interleaved sequential read RA_PATTERN_OVERSIZE oversize read RA_PATTERN_MMAP_AROUND mmap fault RA_PATTERN_FADVISE posix_fadvise() RA_PATTERN_RANDOM random read Note that random reads will be recorded in file_ra_state now. This won't deteriorate cache bouncing because the ra->prev_pos update in do_generic_file_read() already pollutes the data cache, and filemap_fault() will stop calling into us after MMAP_LOTSAMISS. CC: Ingo Molnar CC: Jens Axboe CC: Peter Zijlstra Acked-by: Jan Kara Acked-by: Rik van Riel Signed-off-by: Wu Fengguang --- include/linux/fs.h | 36 +++++++++++++++++++++++++++++++++++- include/linux/mm.h | 4 +++- mm/filemap.c | 3 ++- mm/readahead.c | 29 ++++++++++++++++++++++------- 4 files changed, 62 insertions(+), 10 deletions(-) --- linux-next.orig/include/linux/fs.h 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/include/linux/fs.h 2012-01-25 15:57:50.000000000 +0800 @@ -952,11 +952,45 @@ struct file_ra_state { there are only # of pages ahead */ unsigned int ra_pages; /* Maximum readahead window */ - unsigned int mmap_miss; /* Cache miss stat for mmap accesses */ + u16 mmap_miss; /* Cache miss stat for mmap accesses */ + u8 pattern; /* one of RA_PATTERN_* */ + loff_t prev_pos; /* Cache last read() position */ }; /* + * Which policy makes decision to do the current read-ahead IO? + * + * RA_PATTERN_INITIAL readahead window is initially opened, + * normally when reading from start of file + * RA_PATTERN_SUBSEQUENT readahead window is pushed forward + * RA_PATTERN_CONTEXT no readahead window available, querying the + * page cache to decide readahead start/size. + * This typically happens on interleaved reads (eg. + * reading pages 0, 1000, 1, 1001, 2, 1002, ...) + * where one file_ra_state struct is not enough + * for recording 2+ interleaved sequential read + * streams. + * RA_PATTERN_MMAP_AROUND read-around on mmap page faults + * (w/o any sequential/random hints) + * RA_PATTERN_FADVISE triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM + * RA_PATTERN_OVERSIZE a random read larger than max readahead size, + * do max readahead to break down the read size + * RA_PATTERN_RANDOM a small random read + */ +enum readahead_pattern { + RA_PATTERN_INITIAL, + RA_PATTERN_SUBSEQUENT, + RA_PATTERN_CONTEXT, + RA_PATTERN_MMAP_AROUND, + RA_PATTERN_FADVISE, + RA_PATTERN_OVERSIZE, + RA_PATTERN_RANDOM, + RA_PATTERN_ALL, /* for summary stats */ + RA_PATTERN_MAX +}; + +/* * Check if @index falls in the readahead windows. */ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:49.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:50.000000000 +0800 @@ -249,7 +249,10 @@ unsigned long max_sane_readahead(unsigne * Submit IO for the read-ahead request in file_ra_state. */ unsigned long ra_submit(struct file_ra_state *ra, - struct address_space *mapping, struct file *filp) + struct address_space *mapping, + struct file *filp, + pgoff_t offset, + unsigned long req_size) { int actual; @@ -382,6 +385,7 @@ static int try_context_readahead(struct if (size >= offset) size *= 2; + ra->pattern = RA_PATTERN_CONTEXT; ra->start = offset; ra->size = min(size + req_size, max); ra->async_size = 1; @@ -403,8 +407,10 @@ ondemand_readahead(struct address_space /* * start of file */ - if (!offset) + if (!offset) { + ra->pattern = RA_PATTERN_INITIAL; goto initial_readahead; + } /* * It's the expected callback offset, assume sequential access. @@ -412,6 +418,7 @@ ondemand_readahead(struct address_space */ if ((offset == (ra->start + ra->size - ra->async_size) || offset == (ra->start + ra->size))) { + ra->pattern = RA_PATTERN_SUBSEQUENT; ra->start += ra->size; ra->size = get_next_ra_size(ra, max); ra->async_size = ra->size; @@ -434,6 +441,7 @@ ondemand_readahead(struct address_space if (!start || start - offset > max) return 0; + ra->pattern = RA_PATTERN_CONTEXT; ra->start = start; ra->size = start - offset; /* old async_size */ ra->size += req_size; @@ -445,14 +453,18 @@ ondemand_readahead(struct address_space /* * oversize read */ - if (req_size > max) + if (req_size > max) { + ra->pattern = RA_PATTERN_OVERSIZE; goto initial_readahead; + } /* * sequential cache miss */ - if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) + if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) { + ra->pattern = RA_PATTERN_INITIAL; goto initial_readahead; + } /* * Query the page cache and look for the traces(cached history pages) @@ -463,9 +475,12 @@ ondemand_readahead(struct address_space /* * standalone, small random read - * Read as is, and do not pollute the readahead state. */ - return __do_page_cache_readahead(mapping, filp, offset, req_size, 0); + ra->pattern = RA_PATTERN_RANDOM; + ra->start = offset; + ra->size = req_size; + ra->async_size = 0; + goto readit; initial_readahead: ra->start = offset; @@ -483,7 +498,7 @@ readit: ra->size += ra->async_size; } - return ra_submit(ra, mapping, filp); + return ra_submit(ra, mapping, filp, offset, req_size); } /** --- linux-next.orig/include/linux/mm.h 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/include/linux/mm.h 2012-01-25 15:57:50.000000000 +0800 @@ -1448,7 +1448,9 @@ void page_cache_async_readahead(struct a unsigned long max_sane_readahead(unsigned long nr); unsigned long ra_submit(struct file_ra_state *ra, struct address_space *mapping, - struct file *filp); + struct file *filp, + pgoff_t offset, + unsigned long req_size); /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ extern int expand_stack(struct vm_area_struct *vma, unsigned long address); --- linux-next.orig/mm/filemap.c 2012-01-25 15:57:47.000000000 +0800 +++ linux-next/mm/filemap.c 2012-01-25 15:57:50.000000000 +0800 @@ -1597,11 +1597,12 @@ static void do_sync_mmap_readahead(struc /* * mmap read-around */ + ra->pattern = RA_PATTERN_MMAP_AROUND; ra_pages = max_sane_readahead(ra->ra_pages); ra->start = max_t(long, 0, offset - ra_pages / 2); ra->size = ra_pages; ra->async_size = ra_pages / 4; - ra_submit(ra, mapping, file); + ra_submit(ra, mapping, file, offset, 1); } /* From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754895Ab2A0DlT (ORCPT ); Thu, 26 Jan 2012 22:41:19 -0500 Received: from mga14.intel.com ([143.182.124.37]:30029 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754741Ab2A0Dkm (ORCPT ); Thu, 26 Jan 2012 22:40:42 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="100232717" Message-Id: <20120127031327.020100004@intel.com> User-Agent: quilt/0.48-1 Date: Fri, 27 Jan 2012 11:05:29 +0800 From: Wu Fengguang To: Andrew Morton cc: Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Jan Kara , Rik van Riel , Steven Rostedt , Wu Fengguang cc: Linux Memory Management List , Cc: LKML Subject: [PATCH 5/9] readahead: add vfs/readahead tracing event References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-tracer.patch Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is very useful for verifying whether the readahead algorithms are working to the expectation. Example output: # echo 1 > /debug/tracing/events/vfs/readahead/enable # cp test-file /dev/null # cat /debug/tracing/trace # trimmed output pattern=initial bdi=0:16 ino=100177 req=0+2 ra=0+4-2 async=0 actual=4 pattern=subsequent bdi=0:16 ino=100177 req=2+2 ra=4+8-8 async=1 actual=8 pattern=subsequent bdi=0:16 ino=100177 req=4+2 ra=12+16-16 async=1 actual=16 pattern=subsequent bdi=0:16 ino=100177 req=12+2 ra=28+32-32 async=1 actual=32 pattern=subsequent bdi=0:16 ino=100177 req=28+2 ra=60+60-60 async=1 actual=24 pattern=subsequent bdi=0:16 ino=100177 req=60+2 ra=120+60-60 async=1 actual=0 CC: Ingo Molnar CC: Jens Axboe CC: Peter Zijlstra Acked-by: Jan Kara Acked-by: Rik van Riel Acked-by: Steven Rostedt Signed-off-by: Wu Fengguang --- fs/Makefile | 1 fs/trace.c | 2 include/trace/events/vfs.h | 77 +++++++++++++++++++++++++++++++++++ mm/readahead.c | 24 ++++++++++ 4 files changed, 104 insertions(+) --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-next/include/trace/events/vfs.h 2012-01-25 15:57:52.000000000 +0800 @@ -0,0 +1,77 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM vfs + +#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_VFS_H + +#include +#include +#include +#include + +#define READAHEAD_PATTERNS \ + { RA_PATTERN_INITIAL, "initial" }, \ + { RA_PATTERN_SUBSEQUENT, "subsequent" }, \ + { RA_PATTERN_CONTEXT, "context" }, \ + { RA_PATTERN_MMAP_AROUND, "around" }, \ + { RA_PATTERN_FADVISE, "fadvise" }, \ + { RA_PATTERN_OVERSIZE, "oversize" }, \ + { RA_PATTERN_RANDOM, "random" }, \ + { RA_PATTERN_ALL, "all" } + +TRACE_EVENT(readahead, + TP_PROTO(struct address_space *mapping, + pgoff_t offset, + unsigned long req_size, + enum readahead_pattern pattern, + pgoff_t start, + unsigned long size, + unsigned long async_size, + unsigned int actual), + + TP_ARGS(mapping, offset, req_size, pattern, start, size, async_size, + actual), + + TP_STRUCT__entry( + __array(char, bdi, 32) + __field(ino_t, ino) + __field(pgoff_t, offset) + __field(unsigned long, req_size) + __field(unsigned int, pattern) + __field(pgoff_t, start) + __field(unsigned int, size) + __field(unsigned int, async_size) + __field(unsigned int, actual) + ), + + TP_fast_assign( + strncpy(__entry->bdi, + dev_name(mapping->backing_dev_info->dev), 32); + __entry->ino = mapping->host->i_ino; + __entry->offset = offset; + __entry->req_size = req_size; + __entry->pattern = pattern; + __entry->start = start; + __entry->size = size; + __entry->async_size = async_size; + __entry->actual = actual; + ), + + TP_printk("pattern=%s bdi=%s ino=%lu " + "req=%lu+%lu ra=%lu+%d-%d async=%d actual=%d", + __print_symbolic(__entry->pattern, READAHEAD_PATTERNS), + __entry->bdi, + __entry->ino, + __entry->offset, + __entry->req_size, + __entry->start, + __entry->size, + __entry->async_size, + __entry->start > __entry->offset, + __entry->actual) +); + +#endif /* _TRACE_VFS_H */ + +/* This part must be outside protection */ +#include --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 @@ -17,6 +17,7 @@ #include #include #include +#include /* * Initialise a struct file's readahead state. Assumes that the caller has @@ -32,6 +33,21 @@ EXPORT_SYMBOL_GPL(file_ra_state_init); #define list_to_page(head) (list_entry((head)->prev, struct page, lru)) +static inline void readahead_event(struct address_space *mapping, + pgoff_t offset, + unsigned long req_size, + bool for_mmap, + bool for_metadata, + enum readahead_pattern pattern, + pgoff_t start, + unsigned long size, + unsigned long async_size, + int actual) +{ + trace_readahead(mapping, offset, req_size, + pattern, start, size, async_size, actual); +} + /* * see if a page needs releasing upon read_cache_pages() failure * - the caller of read_cache_pages() may have set PG_private or PG_fscache @@ -228,6 +244,9 @@ int force_page_cache_readahead(struct ad ret = err; break; } + readahead_event(mapping, offset, nr_to_read, 0, 0, + RA_PATTERN_FADVISE, offset, this_chunk, 0, + err); ret += err; offset += this_chunk; nr_to_read -= this_chunk; @@ -259,6 +278,11 @@ unsigned long ra_submit(struct file_ra_s actual = __do_page_cache_readahead(mapping, filp, ra->start, ra->size, ra->async_size); + readahead_event(mapping, offset, req_size, + ra->for_mmap, ra->for_metadata, + ra->pattern, ra->start, ra->size, ra->async_size, + actual); + ra->for_mmap = 0; ra->for_metadata = 0; return actual; --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-next/fs/trace.c 2012-01-25 15:57:52.000000000 +0800 @@ -0,0 +1,2 @@ +#define CREATE_TRACE_POINTS +#include --- linux-next.orig/fs/Makefile 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/fs/Makefile 2012-01-25 15:57:52.000000000 +0800 @@ -50,6 +50,7 @@ obj-$(CONFIG_NFS_COMMON) += nfs_common/ obj-$(CONFIG_GENERIC_ACL) += generic_acl.o obj-$(CONFIG_FHANDLE) += fhandle.o +obj-$(CONFIG_TRACEPOINTS) += trace.o obj-y += quota/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754871Ab2A0DlQ (ORCPT ); Thu, 26 Jan 2012 22:41:16 -0500 Received: from mga14.intel.com ([143.182.124.37]:45958 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754730Ab2A0Dkk (ORCPT ); Thu, 26 Jan 2012 22:40:40 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="100232708" Message-Id: <20120127031327.159293683@intel.com> User-Agent: quilt/0.48-1 Date: Fri, 27 Jan 2012 11:05:30 +0800 From: Wu Fengguang To: Andrew Morton cc: Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Wu Fengguang cc: Linux Memory Management List , Cc: LKML Subject: [PATCH 6/9] readahead: add /debug/readahead/stats References: <20120127030524.854259561@intel.com> Content-Disposition: inline; filename=readahead-stats.patch Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y), and will remain inactive by default. It can be runtime enabled/disabled through the debugfs interface echo 1 > /debug/readahead/stats_enable echo 0 > /debug/readahead/stats_enable Example output: (taken from a fresh booted NFS-ROOT console box with rsize=524288) $ cat /debug/readahead/stats pattern readahead eof_hit cache_hit io sync_io mmap_io meta_io size async_size io_size initial 702 511 0 692 692 0 0 2 0 2 subsequent 7 0 1 7 1 1 0 23 22 23 context 160 161 0 2 0 1 0 0 0 16 around 184 184 177 184 184 184 0 58 0 53 backwards 2 0 2 2 2 0 0 4 0 3 fadvise 2593 47 8 2588 2588 0 0 1 0 1 oversize 0 0 0 0 0 0 0 0 0 0 random 45 20 0 44 44 0 0 1 0 1 all 3697 923 188 3519 3511 186 0 4 0 4 The two most important columns are - io number of readahead IO - io_size average readahead IO size CC: Ingo Molnar CC: Jens Axboe CC: Peter Zijlstra Acked-by: Rik van Riel Signed-off-by: Wu Fengguang --- mm/Kconfig | 15 +++ mm/readahead.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 217 insertions(+) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:53.000000000 +0800 @@ -33,6 +33,202 @@ EXPORT_SYMBOL_GPL(file_ra_state_init); #define list_to_page(head) (list_entry((head)->prev, struct page, lru)) +#ifdef CONFIG_READAHEAD_STATS +#include +#include +#include + +static u32 readahead_stats_enable __read_mostly; + +static const struct trace_print_flags ra_pattern_names[] = { + READAHEAD_PATTERNS +}; + +enum ra_account { + /* number of readaheads */ + RA_ACCOUNT_COUNT, /* readahead request */ + RA_ACCOUNT_EOF, /* readahead request covers EOF */ + RA_ACCOUNT_CACHE_HIT, /* readahead request covers some cached pages */ + RA_ACCOUNT_IOCOUNT, /* readahead IO */ + RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */ + RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */ + RA_ACCOUNT_METADATA, /* readahead IO on metadata */ + /* number of readahead pages */ + RA_ACCOUNT_SIZE, /* readahead size */ + RA_ACCOUNT_ASYNC_SIZE, /* readahead async size */ + RA_ACCOUNT_ACTUAL, /* readahead actual IO size */ + /* end mark */ + RA_ACCOUNT_MAX, +}; + +#define RA_STAT_BATCH (INT_MAX / 2) +static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX]; + +static inline void add_ra_stat(int i, int j, s64 amount) +{ + __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH); +} + +static inline void inc_ra_stat(int i, int j) +{ + add_ra_stat(i, j, 1); +} + +static void readahead_stats(struct address_space *mapping, + pgoff_t offset, + unsigned long req_size, + bool for_mmap, + bool for_metadata, + enum readahead_pattern pattern, + pgoff_t start, + unsigned long size, + unsigned long async_size, + int actual) +{ + pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1; + + inc_ra_stat(pattern, RA_ACCOUNT_COUNT); + add_ra_stat(pattern, RA_ACCOUNT_SIZE, size); + add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size); + add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual); + + if (start + size >= eof) + inc_ra_stat(pattern, RA_ACCOUNT_EOF); + if (actual < size) + inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT); + + if (actual) { + inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT); + + if (start <= offset && offset < start + size) + inc_ra_stat(pattern, RA_ACCOUNT_SYNC); + + if (for_mmap) + inc_ra_stat(pattern, RA_ACCOUNT_MMAP); + if (for_metadata) + inc_ra_stat(pattern, RA_ACCOUNT_METADATA); + } +} + +static void readahead_stats_reset(void) +{ + int i, j; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) + percpu_counter_set(&ra_stat[i][j], 0); +} + +static void +readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]) +{ + int i, j; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) { + s64 n = percpu_counter_sum(&ra_stat[i][j]); + ra_stats[i][j] += n; + ra_stats[RA_PATTERN_ALL][j] += n; + } +} + +static int readahead_stats_show(struct seq_file *s, void *_) +{ + long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]; + int i; + + seq_printf(s, + "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n", + "pattern", "readahead", "eof_hit", "cache_hit", + "io", "sync_io", "mmap_io", "meta_io", + "size", "async_size", "io_size"); + + memset(ra_stats, 0, sizeof(ra_stats)); + readahead_stats_sum(ra_stats); + + for (i = 0; i < RA_PATTERN_MAX; i++) { + unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT]; + unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT]; + /* + * avoid division-by-zero + */ + if (count == 0) + count = 1; + if (iocount == 0) + iocount = 1; + + seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld " + "%10lld %10lld %10lld %10lld %10lld\n", + ra_pattern_names[i].name, + ra_stats[i][RA_ACCOUNT_COUNT], + ra_stats[i][RA_ACCOUNT_EOF], + ra_stats[i][RA_ACCOUNT_CACHE_HIT], + ra_stats[i][RA_ACCOUNT_IOCOUNT], + ra_stats[i][RA_ACCOUNT_SYNC], + ra_stats[i][RA_ACCOUNT_MMAP], + ra_stats[i][RA_ACCOUNT_METADATA], + ra_stats[i][RA_ACCOUNT_SIZE] / count, + ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count, + ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount); + } + + return 0; +} + +static int readahead_stats_open(struct inode *inode, struct file *file) +{ + return single_open(file, readahead_stats_show, NULL); +} + +static ssize_t readahead_stats_write(struct file *file, const char __user *buf, + size_t size, loff_t *offset) +{ + readahead_stats_reset(); + return size; +} + +static const struct file_operations readahead_stats_fops = { + .owner = THIS_MODULE, + .open = readahead_stats_open, + .write = readahead_stats_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int __init readahead_create_debugfs(void) +{ + struct dentry *root; + struct dentry *entry; + int i, j; + + root = debugfs_create_dir("readahead", NULL); + if (!root) + goto out; + + entry = debugfs_create_file("stats", 0644, root, + NULL, &readahead_stats_fops); + if (!entry) + goto out; + + entry = debugfs_create_bool("stats_enable", 0644, root, + &readahead_stats_enable); + if (!entry) + goto out; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) + percpu_counter_init(&ra_stat[i][j], 0); + + return 0; +out: + printk(KERN_ERR "readahead: failed to create debugfs entries\n"); + return -ENOMEM; +} + +late_initcall(readahead_create_debugfs); +#endif + static inline void readahead_event(struct address_space *mapping, pgoff_t offset, unsigned long req_size, @@ -44,6 +240,12 @@ static inline void readahead_event(struc unsigned long async_size, int actual) { +#ifdef CONFIG_READAHEAD_STATS + if (readahead_stats_enable) + readahead_stats(mapping, offset, req_size, + for_mmap, for_metadata, + pattern, start, size, async_size, actual); +#endif trace_readahead(mapping, offset, req_size, pattern, start, size, async_size, actual); } --- linux-next.orig/mm/Kconfig 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/mm/Kconfig 2012-01-25 15:57:53.000000000 +0800 @@ -379,3 +379,18 @@ config CLEANCACHE in a negligible performance hit. If unsure, say Y to enable cleancache + +config READAHEAD_STATS + bool "Collect page cache readahead stats" + depends on DEBUG_FS + default n + help + This provides the readahead events accounting facilities. + + To do readahead accounting for a workload: + + echo 1 > /sys/kernel/debug/readahead/stats_enable + echo 0 > /sys/kernel/debug/readahead/stats # reset counters + # run the workload + cat /sys/kernel/debug/readahead/stats # check counters + echo 0 > /sys/kernel/debug/readahead/stats_enable From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756101Ab2A0QVm (ORCPT ); Fri, 27 Jan 2012 11:21:42 -0500 Received: from smtp104.prem.mail.ac4.yahoo.com ([76.13.13.43]:25232 "HELO smtp104.prem.mail.ac4.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1752688Ab2A0QVl (ORCPT ); Fri, 27 Jan 2012 11:21:41 -0500 X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: jtB3BbYVM1lMlaiP8qy4XeWSoWnTeghY14y9IYEvrPn601d EhBmg5rYL.hipG9mvqH.YX_IwoO9uIkh74XjUGjr8Y2Chzaiod9G0t1r_AVK GafxPogzxyo3GXOTGefFVcg5jhm0jYLuZNrzEap9H72y.QUQKVEtr.WHNBZv hU.qT91T58.a_3FDTaZ_AQZO98hNt5bCXiQbySCUTJkfPrwPgduFcOw0gsen fkypzzaealUfQIAI2TequrPITXMi4jBgiwG1rTON4GzmTy4tUHxJuyktvyWw ibH8QHTsiongE0BF.ONo.ey885qLXMruMSxz5KV7csK28uQGcV5Ds6tpSvIA tj0HrRzBe2stJts9fp6Uye3i2Xk2lbtak7MnKPCSECw5jKYZxJoDux37nWNp E X-Yahoo-SMTP: _Dag8S.swBC1p4FJKLCXbs8NQzyse1SYSgnAbY0- Date: Fri, 27 Jan 2012 10:21:36 -0600 (CST) From: Christoph Lameter X-X-Sender: cl@router.home To: Wu Fengguang cc: Andrew Morton , Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML Subject: Re: [PATCH 6/9] readahead: add /debug/readahead/stats In-Reply-To: <20120127031327.159293683@intel.com> Message-ID: References: <20120127030524.854259561@intel.com> <20120127031327.159293683@intel.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 27 Jan 2012, Wu Fengguang wrote: > + > +#define RA_STAT_BATCH (INT_MAX / 2) > +static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX]; Why use percpu counter here? The stats structures are not dynamically allocated so you can just use a DECLARE_PER_CPU statement. That way you do not have the overhead of percpu counter calls. Instead simple instructions are generated to deal with the counter. There are also no calls to any of the fast access functions for percpu counter so percpu_counter has to always having to loop over all counters anyways to get the results. The batching of the percpu_counters is therefore not used. Its simpler to just do a loop that sums over all counters when displaying the results. > +static inline void add_ra_stat(int i, int j, s64 amount) > +{ > + __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH); __this_cpu_add(ra_stat[i][j], amount); > +} > + > +static void readahead_stats_reset(void) > +{ > + int i, j; > + > + for (i = 0; i < RA_PATTERN_ALL; i++) > + for (j = 0; j < RA_ACCOUNT_MAX; j++) > + percpu_counter_set(&ra_stat[i][j], 0); for_each_online(cpu) memset(per_cpu_ptr(&ra_stat, cpu), 0, sizeof(ra_stat)); > +} > + > +static void > +readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]) > +{ > + int i, j; > + > + for (i = 0; i < RA_PATTERN_ALL; i++) > + for (j = 0; j < RA_ACCOUNT_MAX; j++) { > + s64 n = percpu_counter_sum(&ra_stat[i][j]); > + ra_stats[i][j] += n; > + ra_stats[RA_PATTERN_ALL][j] += n; > + } > +} Define a function stats instead? static long get_stat_sum(long __per_cpu *x) { int cpu; long sum; for_each_online(cpu) sum += *per_cpu_ptr(x, cpu); return sum; } > + > +static int readahead_stats_show(struct seq_file *s, void *_) > +{ > + readahead_stats_sum(ra_stats); > + > + for (i = 0; i < RA_PATTERN_MAX; i++) { > + unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT]; = get_stats(&ra_stats[i][RA_ACCOUNT]); ... ? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753515Ab2A0UPy (ORCPT ); Fri, 27 Jan 2012 15:15:54 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:54110 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752107Ab2A0UPw (ORCPT ); Fri, 27 Jan 2012 15:15:52 -0500 Date: Fri, 27 Jan 2012 12:15:51 -0800 From: Andrew Morton To: Christoph Lameter Cc: Wu Fengguang , Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML Subject: Re: [PATCH 6/9] readahead: add /debug/readahead/stats Message-Id: <20120127121551.acd256aa.akpm@linux-foundation.org> In-Reply-To: References: <20120127030524.854259561@intel.com> <20120127031327.159293683@intel.com> X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 27 Jan 2012 10:21:36 -0600 (CST) Christoph Lameter wrote: > > + > > +static void readahead_stats_reset(void) > > +{ > > + int i, j; > > + > > + for (i = 0; i < RA_PATTERN_ALL; i++) > > + for (j = 0; j < RA_ACCOUNT_MAX; j++) > > + percpu_counter_set(&ra_stat[i][j], 0); > > for_each_online(cpu) > memset(per_cpu_ptr(&ra_stat, cpu), 0, sizeof(ra_stat)); for_each_possible_cpu(). And that's one reason to not open-code the operation. Another is so we don't have tiresome open-coded loops all over the place. But before doing either of those things we should choose boring old atomic_inc(). Has it been shown that the cost of doing so is unacceptable? Bearing this in mind: > The accounting code will be compiled in by default > (CONFIG_READAHEAD_STATS=y), and will remain inactive by default. I agree with those choices. They effectively mean that the stats will be a developer-only/debugger-only thing. So even if the atomic_inc() costs are measurable during these develop/debug sessions, is anyone likely to care? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751146Ab2A2FRc (ORCPT ); Sun, 29 Jan 2012 00:17:32 -0500 Received: from mga14.intel.com ([143.182.124.37]:65108 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1749667Ab2A2FRa (ORCPT ); Sun, 29 Jan 2012 00:17:30 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="100765246" Date: Sun, 29 Jan 2012 13:07:22 +0800 From: Wu Fengguang To: Andrew Morton Cc: Christoph Lameter , Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML Subject: Re: [PATCH 6/9] readahead: add /debug/readahead/stats Message-ID: <20120129050722.GC26244@localhost> References: <20120127030524.854259561@intel.com> <20120127031327.159293683@intel.com> <20120127121551.acd256aa.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120127121551.acd256aa.akpm@linux-foundation.org> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 27, 2012 at 12:15:51PM -0800, Andrew Morton wrote: > > The accounting code will be compiled in by default > > (CONFIG_READAHEAD_STATS=y), and will remain inactive by default. > > I agree with those choices. They effectively mean that the stats will > be a developer-only/debugger-only thing. So even if the atomic_inc() > costs are measurable during these develop/debug sessions, is anyone > likely to care? Sorry I have changed the default to CONFIG_READAHEAD_STATS=n to avoid bloating the kernel (and forgot to edit the changelog accordingly). I'm not sure how many people are going to check the readahead stats. Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757982Ab2BIDdI (ORCPT ); Wed, 8 Feb 2012 22:33:08 -0500 Received: from mga14.intel.com ([143.182.124.37]:9344 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757964Ab2BIDdE (ORCPT ); Wed, 8 Feb 2012 22:33:04 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.71,315,1320652800"; d="scan'208";a="104762213" Date: Thu, 9 Feb 2012 11:22:55 +0800 From: Wu Fengguang To: Andrew Morton Cc: Andi Kleen , Ingo Molnar , Jens Axboe , Peter Zijlstra , Rik van Riel , Wu Fengguang , Linux Memory Management List , linux-fsdevel@vger.kernel.org, LKML Subject: [PATCH 6/9 update changelog] readahead: add /debug/readahead/stats Message-ID: <20120209032255.GA27396@localhost> References: <20120127030524.854259561@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline; filename="readahead-stats.patch" User-Agent: quilt/0.48-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This accounting code is effectively a no-op by default (CONFIG_READAHEAD_STATS=n). It's expected to be runtime reset and enabled before using: echo 0 > /debug/readahead/stats # reset counters echo 1 > /debug/readahead/stats_enable # run test workload echo 0 > /debug/readahead/stats_enable Example output: (taken from a fresh booted NFS-ROOT console box with rsize=524288) $ cat /debug/readahead/stats pattern readahead eof_hit cache_hit io sync_io mmap_io meta_io size async_size io_size initial 702 511 0 692 692 0 0 2 0 2 subsequent 7 0 1 7 1 1 0 23 22 23 context 160 161 0 2 0 1 0 0 0 16 around 184 184 177 184 184 184 0 58 0 53 backwards 2 0 2 2 2 0 0 4 0 3 fadvise 2593 47 8 2588 2588 0 0 1 0 1 oversize 0 0 0 0 0 0 0 0 0 0 random 45 20 0 44 44 0 0 1 0 1 all 3697 923 188 3519 3511 186 0 4 0 4 The two most important columns are - io number of readahead IO - io_size average readahead IO size CC: Ingo Molnar CC: Jens Axboe CC: Peter Zijlstra Acked-by: Rik van Riel Signed-off-by: Wu Fengguang --- mm/Kconfig | 15 +++ mm/readahead.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 217 insertions(+) --- linux-next.orig/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800 +++ linux-next/mm/readahead.c 2012-01-25 15:57:53.000000000 +0800 @@ -33,6 +33,202 @@ EXPORT_SYMBOL_GPL(file_ra_state_init); #define list_to_page(head) (list_entry((head)->prev, struct page, lru)) +#ifdef CONFIG_READAHEAD_STATS +#include +#include +#include + +static u32 readahead_stats_enable __read_mostly; + +static const struct trace_print_flags ra_pattern_names[] = { + READAHEAD_PATTERNS +}; + +enum ra_account { + /* number of readaheads */ + RA_ACCOUNT_COUNT, /* readahead request */ + RA_ACCOUNT_EOF, /* readahead request covers EOF */ + RA_ACCOUNT_CACHE_HIT, /* readahead request covers some cached pages */ + RA_ACCOUNT_IOCOUNT, /* readahead IO */ + RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */ + RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */ + RA_ACCOUNT_METADATA, /* readahead IO on metadata */ + /* number of readahead pages */ + RA_ACCOUNT_SIZE, /* readahead size */ + RA_ACCOUNT_ASYNC_SIZE, /* readahead async size */ + RA_ACCOUNT_ACTUAL, /* readahead actual IO size */ + /* end mark */ + RA_ACCOUNT_MAX, +}; + +#define RA_STAT_BATCH (INT_MAX / 2) +static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX]; + +static inline void add_ra_stat(int i, int j, s64 amount) +{ + __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH); +} + +static inline void inc_ra_stat(int i, int j) +{ + add_ra_stat(i, j, 1); +} + +static void readahead_stats(struct address_space *mapping, + pgoff_t offset, + unsigned long req_size, + bool for_mmap, + bool for_metadata, + enum readahead_pattern pattern, + pgoff_t start, + unsigned long size, + unsigned long async_size, + int actual) +{ + pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1; + + inc_ra_stat(pattern, RA_ACCOUNT_COUNT); + add_ra_stat(pattern, RA_ACCOUNT_SIZE, size); + add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size); + add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual); + + if (start + size >= eof) + inc_ra_stat(pattern, RA_ACCOUNT_EOF); + if (actual < size) + inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT); + + if (actual) { + inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT); + + if (start <= offset && offset < start + size) + inc_ra_stat(pattern, RA_ACCOUNT_SYNC); + + if (for_mmap) + inc_ra_stat(pattern, RA_ACCOUNT_MMAP); + if (for_metadata) + inc_ra_stat(pattern, RA_ACCOUNT_METADATA); + } +} + +static void readahead_stats_reset(void) +{ + int i, j; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) + percpu_counter_set(&ra_stat[i][j], 0); +} + +static void +readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]) +{ + int i, j; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) { + s64 n = percpu_counter_sum(&ra_stat[i][j]); + ra_stats[i][j] += n; + ra_stats[RA_PATTERN_ALL][j] += n; + } +} + +static int readahead_stats_show(struct seq_file *s, void *_) +{ + long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX]; + int i; + + seq_printf(s, + "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n", + "pattern", "readahead", "eof_hit", "cache_hit", + "io", "sync_io", "mmap_io", "meta_io", + "size", "async_size", "io_size"); + + memset(ra_stats, 0, sizeof(ra_stats)); + readahead_stats_sum(ra_stats); + + for (i = 0; i < RA_PATTERN_MAX; i++) { + unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT]; + unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT]; + /* + * avoid division-by-zero + */ + if (count == 0) + count = 1; + if (iocount == 0) + iocount = 1; + + seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld " + "%10lld %10lld %10lld %10lld %10lld\n", + ra_pattern_names[i].name, + ra_stats[i][RA_ACCOUNT_COUNT], + ra_stats[i][RA_ACCOUNT_EOF], + ra_stats[i][RA_ACCOUNT_CACHE_HIT], + ra_stats[i][RA_ACCOUNT_IOCOUNT], + ra_stats[i][RA_ACCOUNT_SYNC], + ra_stats[i][RA_ACCOUNT_MMAP], + ra_stats[i][RA_ACCOUNT_METADATA], + ra_stats[i][RA_ACCOUNT_SIZE] / count, + ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count, + ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount); + } + + return 0; +} + +static int readahead_stats_open(struct inode *inode, struct file *file) +{ + return single_open(file, readahead_stats_show, NULL); +} + +static ssize_t readahead_stats_write(struct file *file, const char __user *buf, + size_t size, loff_t *offset) +{ + readahead_stats_reset(); + return size; +} + +static const struct file_operations readahead_stats_fops = { + .owner = THIS_MODULE, + .open = readahead_stats_open, + .write = readahead_stats_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int __init readahead_create_debugfs(void) +{ + struct dentry *root; + struct dentry *entry; + int i, j; + + root = debugfs_create_dir("readahead", NULL); + if (!root) + goto out; + + entry = debugfs_create_file("stats", 0644, root, + NULL, &readahead_stats_fops); + if (!entry) + goto out; + + entry = debugfs_create_bool("stats_enable", 0644, root, + &readahead_stats_enable); + if (!entry) + goto out; + + for (i = 0; i < RA_PATTERN_ALL; i++) + for (j = 0; j < RA_ACCOUNT_MAX; j++) + percpu_counter_init(&ra_stat[i][j], 0); + + return 0; +out: + printk(KERN_ERR "readahead: failed to create debugfs entries\n"); + return -ENOMEM; +} + +late_initcall(readahead_create_debugfs); +#endif + static inline void readahead_event(struct address_space *mapping, pgoff_t offset, unsigned long req_size, @@ -44,6 +240,12 @@ static inline void readahead_event(struc unsigned long async_size, int actual) { +#ifdef CONFIG_READAHEAD_STATS + if (readahead_stats_enable) + readahead_stats(mapping, offset, req_size, + for_mmap, for_metadata, + pattern, start, size, async_size, actual); +#endif trace_readahead(mapping, offset, req_size, pattern, start, size, async_size, actual); } --- linux-next.orig/mm/Kconfig 2012-01-25 15:57:46.000000000 +0800 +++ linux-next/mm/Kconfig 2012-01-25 15:57:53.000000000 +0800 @@ -379,3 +379,18 @@ config CLEANCACHE in a negligible performance hit. If unsure, say Y to enable cleancache + +config READAHEAD_STATS + bool "Collect page cache readahead stats" + depends on DEBUG_FS + default n + help + This provides the readahead events accounting facilities. + + To do readahead accounting for a workload: + + echo 0 > /sys/kernel/debug/readahead/stats # reset counters + echo 1 > /sys/kernel/debug/readahead/stats_enable + # run the workload + cat /sys/kernel/debug/readahead/stats # check counters + echo 0 > /sys/kernel/debug/readahead/stats_enable