* [RFC PATCH 1/5] x86/Kconfig: Add Page Cache Accounting entry [not found] <no> @ 2011-03-02 8:38 ` Liu Yuan 2011-03-02 16:24 ` Randy Dunlap 2011-03-02 8:38 ` [RFC PATCH 2/5] block: Add functions and data types for Page Cache Accounting Liu Yuan ` (3 subsequent siblings) 4 siblings, 1 reply; 15+ messages in thread From: Liu Yuan @ 2011-03-02 8:38 UTC (permalink / raw) To: linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu From: Liu Yuan <tailai.ly@taobao.com> Signed-off-by: Liu Yuan <tailai.ly@taobao.com> --- arch/x86/Kconfig.debug | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug index 615e188..f29e32d 100644 --- a/arch/x86/Kconfig.debug +++ b/arch/x86/Kconfig.debug @@ -304,4 +304,13 @@ config DEBUG_STRICT_USER_COPY_CHECKS If unsure, or if you run an older (pre 4.4) gcc, say N. +config PAGE_CACHE_ACCT + bool "Page cache accounting" + ---help--- + Enabling this options to account for page cache hit/missed number of + times. This would allow user space applications get better knowledge + of underlying page cache system by reading virtual file. The statitics + per partition are collected. + + If unsure, say N. endmenu -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 1/5] x86/Kconfig: Add Page Cache Accounting entry 2011-03-02 8:38 ` [RFC PATCH 1/5] x86/Kconfig: Add Page Cache Accounting entry Liu Yuan @ 2011-03-02 16:24 ` Randy Dunlap 2011-03-03 3:11 ` Liu Yuan 0 siblings, 1 reply; 15+ messages in thread From: Randy Dunlap @ 2011-03-02 16:24 UTC (permalink / raw) To: Liu Yuan; +Cc: linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu On Wed, 2 Mar 2011 16:38:06 +0800 Liu Yuan wrote: > From: Liu Yuan <tailai.ly@taobao.com> > > Signed-off-by: Liu Yuan <tailai.ly@taobao.com> > --- > arch/x86/Kconfig.debug | 9 +++++++++ > 1 files changed, 9 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug > index 615e188..f29e32d 100644 > --- a/arch/x86/Kconfig.debug > +++ b/arch/x86/Kconfig.debug > @@ -304,4 +304,13 @@ config DEBUG_STRICT_USER_COPY_CHECKS > > If unsure, or if you run an older (pre 4.4) gcc, say N. > > +config PAGE_CACHE_ACCT > + bool "Page cache accounting" > + ---help--- > + Enabling this options to account for page cache hit/missed number of > + times. This would allow user space applications get better knowledge > + of underlying page cache system by reading virtual file. The statitics > + per partition are collected. > + > + If unsure, say N. > endmenu > -- rewrite: Enable this option to provide for page cache hit/miss counters. This allows userspace applications to obtain better knowledge of the underlying page cache subsystem by reading a virtual file. Statistics are collect per partition. questions: what virtual file? what kind of partition? --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 1/5] x86/Kconfig: Add Page Cache Accounting entry 2011-03-02 16:24 ` Randy Dunlap @ 2011-03-03 3:11 ` Liu Yuan 0 siblings, 0 replies; 15+ messages in thread From: Liu Yuan @ 2011-03-03 3:11 UTC (permalink / raw) To: Randy Dunlap; +Cc: linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu [-- Attachment #1: Type: text/plain, Size: 1801 bytes --] On Thu, Mar 3, 2011 at 12:24 AM, Randy Dunlap <rdunlap@xenotime.net> wrote: > On Wed, 2 Mar 2011 16:38:06 +0800 Liu Yuan wrote: > > > From: Liu Yuan <tailai.ly@taobao.com> > > > > Signed-off-by: Liu Yuan <tailai.ly@taobao.com> > > --- > > arch/x86/Kconfig.debug | 9 +++++++++ > > 1 files changed, 9 insertions(+), 0 deletions(-) > > > > diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug > > index 615e188..f29e32d 100644 > > --- a/arch/x86/Kconfig.debug > > +++ b/arch/x86/Kconfig.debug > > @@ -304,4 +304,13 @@ config DEBUG_STRICT_USER_COPY_CHECKS > > > > If unsure, or if you run an older (pre 4.4) gcc, say N. > > > > +config PAGE_CACHE_ACCT > > + bool "Page cache accounting" > > + ---help--- > > + Enabling this options to account for page cache hit/missed number > of > > + times. This would allow user space applications get better > knowledge > > + of underlying page cache system by reading virtual file. The > statitics > > + per partition are collected. > > + > > + If unsure, say N. > > endmenu > > -- > > rewrite: > > Enable this option to provide for page cache hit/miss counters. > This allows userspace applications to obtain better knowledge of > the > underlying page cache subsystem by reading a virtual file. > Statistics are collect per partition. > > questions: > what virtual file? > what kind of partition? > > Hi Randy, Thanks for your correction. 'virtual file' in this patch context means files in sysfs mounted at /sys. 'partition' indicates that every disk partition (like /dev/sda/{sda1,sda2...} has its own accountings for page cache information. I am not confident about phrasing. so please correct it if you think it is way better. Thanks, Liu Yuan [-- Attachment #2: Type: text/html, Size: 2582 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH 2/5] block: Add functions and data types for Page Cache Accounting [not found] <no> 2011-03-02 8:38 ` [RFC PATCH 1/5] x86/Kconfig: Add Page Cache Accounting entry Liu Yuan @ 2011-03-02 8:38 ` Liu Yuan 2011-03-02 8:38 ` [RFC PATCH 3/5] block: Make Page Cache counters work with sysfs Liu Yuan ` (2 subsequent siblings) 4 siblings, 0 replies; 15+ messages in thread From: Liu Yuan @ 2011-03-02 8:38 UTC (permalink / raw) To: linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu From: Liu Yuan <tailai.ly@taobao.com> These functions and data types are based on the percpu disk stats infrastructure. Signed-off-by: Liu Yuan <tailai.ly@taobao.com> --- include/linux/genhd.h | 56 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 56 insertions(+), 0 deletions(-) diff --git a/include/linux/genhd.h b/include/linux/genhd.h index c0d5f69..4f0257c 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -86,6 +86,11 @@ struct disk_stats { unsigned long ticks[2]; unsigned long io_ticks; unsigned long time_in_queue; +#ifdef CONFIG_PAGE_CACHE_ACCT + unsigned long page_cache_readpages; + unsigned long page_cache_hit[2]; + unsigned long page_cache_missed[2]; +#endif }; #define PARTITION_META_INFO_VOLNAMELTH 64 @@ -400,6 +405,57 @@ static inline void free_part_info(struct hd_struct *part) kfree(part->info); } +#ifdef CONFIG_PAGE_CACHE_ACCT +static inline void page_cache_acct_readpages(struct super_block *sb, int nr_pages) +{ + struct block_device *bdev = sb->s_bdev; + struct hd_struct *part; + int cpu; + if (likely(bdev) && likely(part = bdev->bd_part)) { + cpu = part_stat_lock(); + part_stat_add(cpu, part, page_cache_readpages, nr_pages); + part_stat_unlock(); + } +} +static inline void page_cache_acct_hit(struct super_block *sb, int rw) +{ + struct block_device *bdev = sb->s_bdev; + struct hd_struct *part; + int cpu; + if (likely(bdev) && likely(part = bdev->bd_part)) { + cpu = part_stat_lock(); + part_stat_inc(cpu, part, page_cache_hit[rw]); + part_stat_unlock(); + } +} + +static inline void page_cache_acct_missed(struct super_block *sb, int rw) +{ + struct block_device *bdev = sb->s_bdev; + struct hd_struct *part; + int cpu; + if (likely(bdev) && likely(part = bdev->bd_part)) { + cpu = part_stat_lock(); + part_stat_inc(cpu, part, page_cache_missed[rw]); + part_stat_unlock(); + } +} + +#else /* !CONFIG_PAGE_CACHE_ACCT */ +static inline void page_cache_acct_readpages(struct super_block *sb, int nr_pages) +{ +} + +static inline void page_cache_acct_hit(struct super_block *sb, int rw) +{ +} + +static inline void page_cache_acct_missed(struct super_block *sb, int rw) +{ +} + +#endif /* CONFIG_PAGE_CACHE_ACCT */ + /* block/blk-core.c */ extern void part_round_stats(int cpu, struct hd_struct *part); -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH 3/5] block: Make Page Cache counters work with sysfs [not found] <no> 2011-03-02 8:38 ` [RFC PATCH 1/5] x86/Kconfig: Add Page Cache Accounting entry Liu Yuan 2011-03-02 8:38 ` [RFC PATCH 2/5] block: Add functions and data types for Page Cache Accounting Liu Yuan @ 2011-03-02 8:38 ` Liu Yuan 2011-03-02 8:38 ` [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache Liu Yuan 2011-03-02 8:38 ` [RFC PATCH 5/5] mm: Add readpages accounting Liu Yuan 4 siblings, 0 replies; 15+ messages in thread From: Liu Yuan @ 2011-03-02 8:38 UTC (permalink / raw) To: linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu From: Liu Yuan <tailai.ly@taobao.com> Three counters are exported to the userspace from /sys/block/sdx/{,sdxx}/page_cache_stats. Signed-off-by: Liu Yuan <tailai.ly@taobao.com> --- block/genhd.c | 6 ++++++ fs/partitions/check.c | 23 +++++++++++++++++++++++ include/linux/genhd.h | 4 ++++ 3 files changed, 33 insertions(+), 0 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index cbf1112..af9e7f8 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -988,6 +988,9 @@ static struct device_attribute dev_attr_fail_timeout = __ATTR(io-timeout-fail, S_IRUGO|S_IWUSR, part_timeout_show, part_timeout_store); #endif +#ifdef CONFIG_PAGE_CACHE_ACCT +static DEVICE_ATTR(page_cache_stats, S_IRUGO, part_page_cache_stats_show, NULL); +#endif static struct attribute *disk_attrs[] = { &dev_attr_range.attr, @@ -1006,6 +1009,9 @@ static struct attribute *disk_attrs[] = { #ifdef CONFIG_FAIL_IO_TIMEOUT &dev_attr_fail_timeout.attr, #endif +#ifdef CONFIG_PAGE_CACHE_ACCT + &dev_attr_page_cache_stats.attr, +#endif NULL }; diff --git a/fs/partitions/check.c b/fs/partitions/check.c index 9c21119..e882e95 100644 --- a/fs/partitions/check.c +++ b/fs/partitions/check.c @@ -316,6 +316,23 @@ ssize_t part_fail_store(struct device *dev, } #endif +#ifdef CONFIG_PAGE_CACHE_ACCT +ssize_t part_page_cache_stats_show(struct device *dev, + struct device_attribute * attr, + char *buf) +{ + struct hd_struct *p = dev_to_part(dev); + + return sprintf(buf, + "%8lu %8lu %8lu %8lu %8lu\n ", + part_stat_read(p, page_cache_readpages), + part_stat_read(p, page_cache_missed[READ]), + part_stat_read(p, page_cache_hit[READ]), + part_stat_read(p, page_cache_missed[WRITE]), + part_stat_read(p, page_cache_hit[WRITE])); +} +#endif + static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL); static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL); static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL); @@ -329,6 +346,9 @@ static DEVICE_ATTR(inflight, S_IRUGO, part_inflight_show, NULL); static struct device_attribute dev_attr_fail = __ATTR(make-it-fail, S_IRUGO|S_IWUSR, part_fail_show, part_fail_store); #endif +#ifdef CONFIG_PAGE_CACHE_ACCT +static DEVICE_ATTR(page_cache_stats, S_IRUGO, part_page_cache_stats_show, NULL); +#endif static struct attribute *part_attrs[] = { &dev_attr_partition.attr, @@ -342,6 +362,9 @@ static struct attribute *part_attrs[] = { #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, #endif +#ifdef CONFIG_PAGE_CACHE_ACCT + &dev_attr_page_cache_stats.attr, +#endif NULL }; diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 4f0257c..0ecd165 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -682,6 +682,10 @@ extern ssize_t part_fail_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count); #endif /* CONFIG_FAIL_MAKE_REQUEST */ +#ifdef CONFIG_PAGE_CACHE_ACCT +extern ssize_t part_page_cache_stats_show(struct device *dev, + struct device_attribute *attr, char *buf); +#endif /* CONFIG_PAGE_CACHE_ACCT */ static inline void hd_ref_init(struct hd_struct *part) { -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache [not found] <no> ` (2 preceding siblings ...) 2011-03-02 8:38 ` [RFC PATCH 3/5] block: Make Page Cache counters work with sysfs Liu Yuan @ 2011-03-02 8:38 ` Liu Yuan 2011-03-02 8:45 ` Ingo Molnar 2011-03-02 8:38 ` [RFC PATCH 5/5] mm: Add readpages accounting Liu Yuan 4 siblings, 1 reply; 15+ messages in thread From: Liu Yuan @ 2011-03-02 8:38 UTC (permalink / raw) To: linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu From: Liu Yuan <tailai.ly@taobao.com> Hit/Miss accountings are request-centric: that is, single request would either cause one hit or one miss to be accounted for the very first time that kernel query the page cache. In some rare error conditions, kernel would re-query the page cache, but we donnot account for it and ignore it for simplicity. Signed-off-by: Liu Yuan <tailai.ly@taobao.com> --- mm/filemap.c | 26 ++++++++++++++++++++++---- 1 files changed, 22 insertions(+), 4 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 83a45d3..5388b2a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1009,10 +1009,15 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos, pgoff_t end_index; loff_t isize; unsigned long nr, ret; + int retry_find = 0; cond_resched(); find_page: page = find_get_page(mapping, index); + if (likely(!retry_find) && page && PageUptodate(page)) + page_cache_acct_hit(inode->i_sb, READ); + else + page_cache_acct_missed(inode->i_sb, READ); if (!page) { page_cache_sync_readahead(mapping, ra, filp, @@ -1137,6 +1142,7 @@ readpage: if (unlikely(error)) { if (error == AOP_TRUNCATED_PAGE) { page_cache_release(page); + retry_find = 1; goto find_page; } goto readpage_error; @@ -1153,6 +1159,7 @@ readpage: */ unlock_page(page); page_cache_release(page); + retry_find = 1; goto find_page; } unlock_page(page); @@ -1185,8 +1192,10 @@ no_cached_page: index, GFP_KERNEL); if (error) { page_cache_release(page); - if (error == -EEXIST) + if (error == -EEXIST) { + retry_find = 1; goto find_page; + } desc->error = error; goto out; } @@ -1543,6 +1552,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct page *page; pgoff_t size; int ret = 0; + int rw = !!(vmf->flags & FAULT_FLAG_WRITE); size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; if (offset >= size) @@ -1552,6 +1562,10 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * Do we have something in the page cache already? */ page = find_get_page(mapping, offset); + if (page && page->mapping && PageUptodate(page)) + page_cache_acct_hit(inode->i_sb, rw); + else + page_cache_acct_missed(inode->i_sb, rw); if (likely(page)) { /* * We found the page, so try async readahead before @@ -2227,20 +2241,24 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, gfp_notmask = __GFP_FS; repeat: page = find_lock_page(mapping, index); - if (page) + if (page) { + page_cache_acct_hit(mapping->host->i_sb, WRITE); return page; + } page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask); if (!page) - return NULL; + goto out; status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL & ~gfp_notmask); if (unlikely(status)) { page_cache_release(page); if (status == -EEXIST) goto repeat; - return NULL; + page = NULL; } +out: + page_cache_acct_missed(mapping->host->i_sb, WRITE); return page; } EXPORT_SYMBOL(grab_cache_page_write_begin); -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache 2011-03-02 8:38 ` [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache Liu Yuan @ 2011-03-02 8:45 ` Ingo Molnar 2011-03-02 17:02 ` Dave Hansen ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Ingo Molnar @ 2011-03-02 8:45 UTC (permalink / raw) To: Liu Yuan Cc: linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu, Peter Zijlstra, Frédéric Weisbecker, Steven Rostedt, Thomas Gleixner, Arnaldo Carvalho de Melo * Liu Yuan <namei.unix@gmail.com> wrote: > + if (likely(!retry_find) && page && PageUptodate(page)) > + page_cache_acct_hit(inode->i_sb, READ); > + else > + page_cache_acct_missed(inode->i_sb, READ); Sigh. This would make such a nice tracepoint or sw perf event. It could be collected in a 'count' form, equivalent to the stats you are aiming for here, or it could even be traced, if someone is interested in such details. It could be mixed with other events, enriching multiple apps at once. But, instead of trying to improve those aspects of our existing instrumentation frameworks, mm/* is gradually growing its own special instrumentation hacks, missing the big picture and fragmenting the instrumentation space some more. That trend is somewhat sad. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache 2011-03-02 8:45 ` Ingo Molnar @ 2011-03-02 17:02 ` Dave Hansen 2011-03-02 18:49 ` Ingo Molnar 2011-03-03 2:01 ` KOSAKI Motohiro 2011-03-03 3:14 ` Tao Ma 2 siblings, 1 reply; 15+ messages in thread From: Dave Hansen @ 2011-03-02 17:02 UTC (permalink / raw) To: Ingo Molnar Cc: Liu Yuan, linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu, Peter Zijlstra, Frédéric Weisbecker, Steven Rostedt, Thomas Gleixner, Arnaldo Carvalho de Melo On Wed, 2011-03-02 at 09:45 +0100, Ingo Molnar wrote: > But, instead of trying to improve those aspects of our existing instrumentation > frameworks, mm/* is gradually growing its own special instrumentation hacks, missing > the big picture and fragmenting the instrumentation space some more. > > That trend is somewhat sad. Go any handy examples of how you'd like to see these done? We're trying to add a batch of these for transparent huge pages, and there was a similar set for KSM, so there's certainly no shortage of potential sites. -- Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache 2011-03-02 17:02 ` Dave Hansen @ 2011-03-02 18:49 ` Ingo Molnar 2011-03-03 0:33 ` Wu Fengguang 0 siblings, 1 reply; 15+ messages in thread From: Ingo Molnar @ 2011-03-02 18:49 UTC (permalink / raw) To: Dave Hansen Cc: Liu Yuan, linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu, Peter Zijlstra, Frédéric Weisbecker, Steven Rostedt, Thomas Gleixner, Arnaldo Carvalho de Melo * Dave Hansen <dave@linux.vnet.ibm.com> wrote: > On Wed, 2011-03-02 at 09:45 +0100, Ingo Molnar wrote: > > But, instead of trying to improve those aspects of our existing instrumentation > > frameworks, mm/* is gradually growing its own special instrumentation hacks, missing > > the big picture and fragmenting the instrumentation space some more. > > > > That trend is somewhat sad. > > Go any handy examples of how you'd like to see these done? There's a very, very old branch in tip:tracing/mm (by Steve) that shows off some of the concepts that could be introduced, to 'dump' current MM state via an extension to the tracepoints APIs: 3383e37ea796: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events c33b3596bc38: tracing, page-allocator: Add trace event for page traffic related to the buddy lists 0d524fb734bc: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes b9a28177eedf: tracing, page-allocator: Add trace events for page allocation and page freeing 807243eb20b2: Merge branch 'perfcounters/urgent' into tracing/mm 08b6cb88eeb5: perf_counter tools: Provide default bfd_demangle() function in case it's not around eb4671011887: tracing/mm: rename 'trigger' file to 'dump_range' 1487a7a1ff99: tracing/mm: fix mapcount trace record field dcac8cdac1d4: tracing/mm: add page frame snapshot trace That's just a demo in essence - showing what things could be done in this area. You can pick those commits up via running: http://people.redhat.com/mingo/tip.git/README Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache 2011-03-02 18:49 ` Ingo Molnar @ 2011-03-03 0:33 ` Wu Fengguang 0 siblings, 0 replies; 15+ messages in thread From: Wu Fengguang @ 2011-03-03 0:33 UTC (permalink / raw) To: Ingo Molnar Cc: Dave Hansen, Liu Yuan, linux-kernel@vger.kernel.org, linux-mm@kvack.org, jaxboe@fusionio.com, akpm@linux-foundation.org, Peter Zijlstra, Frédéric Weisbecker, Steven Rostedt, Thomas Gleixner, Arnaldo Carvalho de Melo On Thu, Mar 03, 2011 at 02:49:53AM +0800, Ingo Molnar wrote: > > * Dave Hansen <dave@linux.vnet.ibm.com> wrote: > > > On Wed, 2011-03-02 at 09:45 +0100, Ingo Molnar wrote: > > > But, instead of trying to improve those aspects of our existing instrumentation > > > frameworks, mm/* is gradually growing its own special instrumentation hacks, missing > > > the big picture and fragmenting the instrumentation space some more. > > > > > > That trend is somewhat sad. > > > > Go any handy examples of how you'd like to see these done? > > There's a very, very old branch in tip:tracing/mm (by Steve) that shows off some of > the concepts that could be introduced, to 'dump' current MM state via an extension > to the tracepoints APIs: > > 3383e37ea796: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events > c33b3596bc38: tracing, page-allocator: Add trace event for page traffic related to the buddy lists > 0d524fb734bc: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes > b9a28177eedf: tracing, page-allocator: Add trace events for page allocation and page freeing > 807243eb20b2: Merge branch 'perfcounters/urgent' into tracing/mm > 08b6cb88eeb5: perf_counter tools: Provide default bfd_demangle() function in case it's not around > eb4671011887: tracing/mm: rename 'trigger' file to 'dump_range' > 1487a7a1ff99: tracing/mm: fix mapcount trace record field > dcac8cdac1d4: tracing/mm: add page frame snapshot trace > > That's just a demo in essence - showing what things could be done in this area. > > You can pick those commits up via running: > > http://people.redhat.com/mingo/tip.git/README Ingo, sorry for remain silence on this topic. I'm actually updating the code. Below is the most up-to-date version on top of 2.6.37. I can post a patchset for review after finished with the writeback patches. Thanks, Fengguang --- Usage: root@bay /home/wfg# echo / > /debug/tracing/objects/mm/pages/dump-fs root@bay /home/wfg# cat /debug/tracing/trace # The output has intermixed lines for inode and page # ino size cached age(ms) dirty type first-opened-by file-name 1507329 4096 8192 309042 ____ DIR swapper / # index len page-flags count mapcount 0 2 ____RU_____ 1 0 1786836 12288 40960 309026 ____ DIR swapper /sbin 0 10 ___ARU_____ 1 0 1786946 37312 40960 309024 ____ REG swapper /sbin/init 0 6 M__ARU_____ 2 1 6 1 M__A_U_____ 2 1 7 1 M__ARU_____ 2 1 8 2 _____U_____ 1 0 1507464 4 4096 309022 ____ LNK swapper /lib64 0 1 ___ARU_____ 1 0 1590173 12288 0 309021 ____ DIR swapper /lib 4563326 12 4096 309020 ____ LNK swapper /lib/ld-linux-x86-64.so.2 0 1 ___ARU_____ 1 0 4563295 128744 131072 309019 ____ REG swapper /lib/ld-2.11.2.so 0 1 M__ARU_____ 21 20 1 3 M__ARU_____ 17 16 4 4 M__ARU_____ 20 19 8 2 M__ARU_____ 27 26 10 3 M__ARU_____ 20 19 13 1 M__ARU_____ 27 26 14 1 M__ARU_____ 26 25 15 1 M__ARU_____ 20 19 16 1 M__ARU_____ 18 17 17 1 M__ARU_____ 9 8 18 1 M__A_U_____ 4 3 19 1 M__ARU_____ 27 26 20 1 M__ARU_____ 17 16 21 1 M__ARU_____ 20 19 22 1 M__ARU_____ 27 26 23 1 M__ARU_____ 20 19 24 1 M__ARU_____ 26 25 25 1 _____U_____ 1 0 26 1 M__A_U_____ 4 3 27 1 M__ARU_____ 20 19 28 4 _____U_____ 1 0 1525477 12288 0 309011 ____ DIR init /etc 1526463 64634 65536 309009 ____ REG init /etc/ld.so.cache 0 1 ___ARU_____ 1 0 1 1 _____U_____ 1 0 --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ mmotm/include/trace/events/mm.h 2010-12-26 20:59:48.000000000 +0800 @@ -0,0 +1,164 @@ +#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_MM_H + +#include <linux/tracepoint.h> +#include <linux/page-flags.h> +#include <linux/memcontrol.h> +#include <linux/pagemap.h> +#include <linux/mm.h> +#include <linux/kernel-page-flags.h> + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM mm + +extern struct trace_print_flags pageflag_names[]; + +/** + * dump_page_frame - called by the trace page dump trigger + * @pfn: page frame number + * @page: pointer to the page frame + * + * This is a helper trace point into the dumping of the page frames. + * It will record various infromation about a page frame. + */ +TRACE_EVENT(dump_page_frame, + + TP_PROTO(unsigned long pfn, struct page *page), + + TP_ARGS(pfn, page), + + TP_STRUCT__entry( + __field( unsigned long, pfn ) + __field( struct page *, page ) + __field( u64, stable_flags ) + __field( unsigned long, flags ) + __field( unsigned int, count ) + __field( unsigned int, mapcount ) + __field( unsigned long, private ) + __field( unsigned long, mapping ) + __field( unsigned long, index ) + ), + + TP_fast_assign( + __entry->pfn = pfn; + __entry->page = page; + __entry->stable_flags = stable_page_flags(page); + __entry->flags = page->flags; + __entry->count = atomic_read(&page->_count); + __entry->mapcount = page_mapcount(page); + __entry->private = page->private; + __entry->mapping = (unsigned long)page->mapping; + __entry->index = page->index; + ), + + TP_printk("%12lx %16p %8x %8x %16lx %16lx %16lx %s", + __entry->pfn, + __entry->page, + __entry->count, + __entry->mapcount, + __entry->private, + __entry->mapping, + __entry->index, + ftrace_print_flags_seq(p, "|", + __entry->flags & PAGE_FLAGS_MASK, + pageflag_names) + ) +); + +TRACE_EVENT(dump_page_cache, + + TP_PROTO(struct page *page, unsigned long len), + + TP_ARGS(page, len), + + TP_STRUCT__entry( + __field( unsigned long, index ) + __field( unsigned long, len ) + __field( u64, flags ) + __field( unsigned int, count ) + __field( unsigned int, mapcount ) + ), + + TP_fast_assign( + __entry->index = page->index; + __entry->len = len; + __entry->flags = stable_page_flags(page); + __entry->count = atomic_read(&page->_count); + __entry->mapcount = page_mapcount(page); + ), + + TP_printk("%12lu %6lu %c%c%c%c%c%c%c%c%c%c%c %4u %4u", + __entry->index, + __entry->len, + __entry->flags & (1ULL << KPF_MMAP) ? 'M' : '_', + __entry->flags & (1ULL << KPF_MLOCKED) ? 'm' : '_', + __entry->flags & (1ULL << KPF_UNEVICTABLE) ? 'u' : '_', + __entry->flags & (1ULL << KPF_ACTIVE) ? 'A' : '_', + __entry->flags & (1ULL << KPF_REFERENCED) ? 'R' : '_', + __entry->flags & (1ULL << KPF_UPTODATE) ? 'U' : '_', + __entry->flags & (1ULL << KPF_DIRTY) ? 'D' : '_', + __entry->flags & (1ULL << KPF_WRITEBACK) ? 'W' : '_', + __entry->flags & (1ULL << KPF_RECLAIM) ? 'I' : '_', + __entry->flags & (1ULL << KPF_MAPPEDTODISK) ? 'd' : '_', + __entry->flags & (1ULL << KPF_PRIVATE) ? 'P' : '_', + __entry->count, + __entry->mapcount) +); + + +#define show_inode_type(val) __print_symbolic(val, \ + { S_IFREG, "REG" }, \ + { S_IFDIR, "DIR" }, \ + { S_IFLNK, "LNK" }, \ + { S_IFBLK, "BLK" }, \ + { S_IFCHR, "CHR" }, \ + { S_IFIFO, "FIFO" }, \ + { S_IFSOCK, "SOCK" }) + +TRACE_EVENT(dump_inode_cache, + + TP_PROTO(struct inode *inode, char *name, int len), + + TP_ARGS(inode, name, len), + + TP_STRUCT__entry( + __field( unsigned long, ino ) + __field( loff_t, size ) /* bytes */ + __field( loff_t, cached ) /* bytes */ + __field( unsigned long, age ) /* ms */ + __field( unsigned long, state ) + __field( umode_t, mode ) + __array( char, comm, TASK_COMM_LEN) + __dynamic_array(char, file, len ) + ), + + TP_fast_assign( + __entry->ino = inode->i_ino; + __entry->size = i_size_read(inode); + __entry->cached = inode->i_mapping->nrpages; + __entry->cached <<= PAGE_CACHE_SHIFT; + __entry->age = (jiffies - inode->dirtied_when) * 1000 / HZ; + __entry->state = inode->i_state; + __entry->mode = inode->i_mode; + memcpy(__entry->comm, inode->i_comm, TASK_COMM_LEN); + memcpy(__get_str(file), name, len); + ), + + TP_printk("%12lu %12llu %12llu %12lu %c%c%c%c %4s %16s %s", + __entry->ino, + __entry->size, + __entry->cached, + __entry->age, + __entry->state & I_DIRTY_PAGES ? 'D' : '_', + __entry->state & I_DIRTY_DATASYNC ? 'd' : '_', + __entry->state & I_DIRTY_SYNC ? 'm' : '_', + __entry->state & I_SYNC ? 'S' : '_', + show_inode_type(__entry->mode & S_IFMT), + __entry->comm, + __get_str(file)) +); + +#endif /* _TRACE_MM_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> --- mmotm.orig/kernel/trace/Makefile 2010-12-26 20:58:46.000000000 +0800 +++ mmotm/kernel/trace/Makefile 2010-12-26 20:59:41.000000000 +0800 @@ -26,6 +26,7 @@ obj-$(CONFIG_RING_BUFFER) += ring_buffer obj-$(CONFIG_RING_BUFFER_BENCHMARK) += ring_buffer_benchmark.o obj-$(CONFIG_TRACING) += trace.o +obj-$(CONFIG_TRACING) += trace_objects.o obj-$(CONFIG_TRACING) += trace_output.o obj-$(CONFIG_TRACING) += trace_stat.o obj-$(CONFIG_TRACING) += trace_printk.o @@ -53,6 +54,7 @@ endif obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o obj-$(CONFIG_EVENT_TRACING) += power-traces.o +obj-$(CONFIG_EVENT_TRACING) += trace_mm.o ifeq ($(CONFIG_TRACING),y) obj-$(CONFIG_KGDB_KDB) += trace_kdb.o endif --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ mmotm/kernel/trace/trace_mm.c 2010-12-26 20:59:41.000000000 +0800 @@ -0,0 +1,367 @@ +/* + * Trace mm pages + * + * Copyright (C) 2009 Red Hat Inc, Steven Rostedt <srostedt@redhat.com> + * + * Code based on Matt Mackall's /proc/[kpagecount|kpageflags] code. + */ +#include <linux/module.h> +#include <linux/bootmem.h> +#include <linux/debugfs.h> +#include <linux/uaccess.h> +#include <linux/ctype.h> +#include <linux/pagevec.h> +#include <linux/writeback.h> +#include <linux/file.h> +#include <linux/slab.h> + +#include "trace_output.h" + +#define CREATE_TRACE_POINTS +#include <trace/events/mm.h> + +void trace_mm_page_frames(unsigned long start, unsigned long end, + void (*trace)(unsigned long pfn, struct page *page)) +{ + unsigned long pfn = start; + struct page *page; + + if (start > max_pfn - 1) + return; + + if (end > max_pfn) + end = max_pfn; + + while (pfn < end) { + page = NULL; + if (pfn_valid(pfn)) + page = pfn_to_page(pfn); + pfn++; + if (page) + trace(pfn, page); + } +} + +static void trace_mm_page_frame(unsigned long pfn, struct page *page) +{ + trace_dump_page_frame(pfn, page); +} + +static ssize_t +trace_mm_pfn_range_read(struct file *filp, char __user *ubuf, size_t cnt, + loff_t *ppos) +{ + return simple_read_from_buffer(ubuf, cnt, ppos, "0\n", 2); +} + + +/* + * recognized formats: + * "M N" start=M, end=N + * "M" start=M, end=M+1 + * "M +N" start=M, end=M+N-1 + */ +static ssize_t +trace_mm_pfn_range_write(struct file *filp, const char __user *ubuf, size_t cnt, + loff_t *ppos) +{ + unsigned long start; + unsigned long end = 0; + char buf[64]; + char *ptr; + + if (cnt >= sizeof(buf)) + return -EINVAL; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + + if (tracing_update_buffers() < 0) + return -ENOMEM; + + if (trace_set_clr_event("mm", "dump_page_frame", 1)) + return -EINVAL; + + buf[cnt] = 0; + + start = simple_strtoul(buf, &ptr, 0); + + for (; *ptr; ptr++) { + if (isdigit(*ptr)) { + if (*(ptr - 1) == '+') + end = start; + end += simple_strtoul(ptr, NULL, 0); + break; + } + } + if (!*ptr) + end = start + 1; + + trace_mm_page_frames(start, end, trace_mm_page_frame); + + return cnt; +} + +static const struct file_operations trace_mm_fops = { + .open = tracing_open_generic, + .read = trace_mm_pfn_range_read, + .write = trace_mm_pfn_range_write, +}; + +static struct dentry *trace_objects_mm_dir(void) +{ + static struct dentry *d_mm; + struct dentry *d_objects; + + if (d_mm) + return d_mm; + + d_objects = trace_objects_dir(); + if (!d_objects) + return NULL; + + d_mm = debugfs_create_dir("mm", d_objects); + if (!d_mm) + pr_warning("Could not create 'objects/mm' directory\n"); + + return d_mm; +} + +static unsigned long page_flags(struct page *page) +{ + return page->flags & ((1 << NR_PAGEFLAGS) - 1); +} + +static int pages_similar(struct page *page0, struct page *page) +{ + if (page_flags(page0) != page_flags(page)) + return 0; + + if (page_count(page0) != page_count(page)) + return 0; + + if (page_mapcount(page0) != page_mapcount(page)) + return 0; + + return 1; +} + +static void dump_pagecache(struct address_space *mapping) +{ + unsigned long nr_pages; + struct page *pages[PAGEVEC_SIZE]; + struct page *uninitialized_var(page0); + struct page *page; + unsigned long start = 0; + unsigned long len = 0; + int i; + + for (;;) { + rcu_read_lock(); + nr_pages = radix_tree_gang_lookup(&mapping->page_tree, + (void **)pages, start + len, PAGEVEC_SIZE); + rcu_read_unlock(); + + if (nr_pages == 0) { + if (len) + trace_dump_page_cache(page0, len); + return; + } + + for (i = 0; i < nr_pages; i++) { + page = pages[i]; + + if (len && + page->index == start + len && + pages_similar(page0, page)) + len++; + else { + if (len) + trace_dump_page_cache(page0, len); + page0 = page; + start = page->index; + len = 1; + } + } + cond_resched(); + } +} + +static void dump_inode_cache(struct inode *inode, + char *name_buf, + struct vfsmount *mnt) +{ + struct path path = { + .mnt = mnt, + .dentry = d_find_alias(inode) + }; + char *name; + int len; + + if (!mnt) { + trace_dump_inode_cache(inode, name_buf, strlen(name_buf)); + return; + } + + if (!path.dentry) { + trace_dump_inode_cache(inode, "", 1); + return; + } + + name = d_path(&path, name_buf, PAGE_SIZE); + if (IS_ERR(name)) { + name = ""; + len = 1; + } else + len = PAGE_SIZE + name_buf - name; + + trace_dump_inode_cache(inode, name, len); + + if (path.dentry) + dput(path.dentry); +} + +static void dump_fs_pagecache(struct super_block *sb, struct vfsmount *mnt) +{ + struct inode *inode; + struct inode *prev_inode = NULL; + char *name_buf; + + name_buf = (char *)__get_free_page(GFP_TEMPORARY); + if (!name_buf) + return; + + down_read(&sb->s_umount); + if (!sb->s_root) + goto out; + + spin_lock(&inode_lock); + list_for_each_entry_reverse(inode, &sb->s_inodes, i_sb_list) { + if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW)) + continue; + __iget(inode); + spin_unlock(&inode_lock); + dump_inode_cache(inode, name_buf, mnt); + if (inode->i_mapping->nrpages) + dump_pagecache(inode->i_mapping); + iput(prev_inode); + prev_inode = inode; + cond_resched(); + spin_lock(&inode_lock); + } + spin_unlock(&inode_lock); + iput(prev_inode); +out: + up_read(&sb->s_umount); + free_page((unsigned long)name_buf); +} + +static ssize_t +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count, + loff_t *ppos) +{ + struct file *file = NULL; + char *name; + int err = 0; + + if (count <= 1) + return -EINVAL; + if (count >= PAGE_SIZE) + return -ENAMETOOLONG; + + name = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (!name) + return -ENOMEM; + + if (copy_from_user(name, ubuf, count)) { + err = -EFAULT; + goto out; + } + + /* strip the newline added by `echo` */ + if (name[count-1] == '\n') + name[count-1] = '\0'; + else + name[count] = '\0'; + + file = filp_open(name, O_RDONLY|O_LARGEFILE, 0); + if (IS_ERR(file)) { + err = PTR_ERR(file); + file = NULL; + goto out; + } + + if (tracing_update_buffers() < 0) { + err = -ENOMEM; + goto out; + } + if (trace_set_clr_event("mm", "dump_page_cache", 1)) { + err = -EINVAL; + goto out; + } + if (trace_set_clr_event("mm", "dump_inode_cache", 1)) { + err = -EINVAL; + goto out; + } + + if (filp->f_path.dentry->d_inode->i_private) { + dump_fs_pagecache(file->f_path.dentry->d_sb, file->f_path.mnt); + } else { + dump_inode_cache(file->f_mapping->host, name, NULL); + dump_pagecache(file->f_mapping); + } + +out: + if (file) + fput(file); + kfree(name); + + return err ? err : count; +} + +static const struct file_operations trace_pagecache_fops = { + .open = tracing_open_generic, + .read = trace_mm_pfn_range_read, + .write = trace_pagecache_write, +}; + +static struct dentry *trace_objects_mm_pages_dir(void) +{ + static struct dentry *d_pages; + struct dentry *d_mm; + + if (d_pages) + return d_pages; + + d_mm = trace_objects_mm_dir(); + if (!d_mm) + return NULL; + + d_pages = debugfs_create_dir("pages", d_mm); + if (!d_pages) + pr_warning("Could not create debugfs " + "'objects/mm/pages' directory\n"); + + return d_pages; +} + +static __init int trace_objects_mm_init(void) +{ + struct dentry *d_pages; + + d_pages = trace_objects_mm_pages_dir(); + if (!d_pages) + return 0; + + trace_create_file("dump-pfn", 0600, d_pages, NULL, + &trace_mm_fops); + + trace_create_file("dump-file", 0600, d_pages, NULL, + &trace_pagecache_fops); + + trace_create_file("dump-fs", 0600, d_pages, (void *)1, + &trace_pagecache_fops); + + return 0; +} +fs_initcall(trace_objects_mm_init); --- mmotm.orig/kernel/trace/trace.h 2010-12-26 20:58:46.000000000 +0800 +++ mmotm/kernel/trace/trace.h 2010-12-26 20:59:41.000000000 +0800 @@ -295,6 +295,7 @@ struct dentry *trace_create_file(const c const struct file_operations *fops); struct dentry *tracing_init_dentry(void); +struct dentry *trace_objects_dir(void); struct ring_buffer_event; --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ mmotm/kernel/trace/trace_objects.c 2010-12-26 20:59:41.000000000 +0800 @@ -0,0 +1,26 @@ +#include <linux/debugfs.h> + +#include "trace.h" +#include "trace_output.h" + +struct dentry *trace_objects_dir(void) +{ + static struct dentry *d_objects; + struct dentry *d_tracer; + + if (d_objects) + return d_objects; + + d_tracer = tracing_init_dentry(); + if (!d_tracer) + return NULL; + + d_objects = debugfs_create_dir("objects", d_tracer); + if (!d_objects) + pr_warning("Could not create debugfs " + "'objects' directory\n"); + + return d_objects; +} + + --- mmotm.orig/mm/page_alloc.c 2010-12-26 20:58:46.000000000 +0800 +++ mmotm/mm/page_alloc.c 2010-12-26 20:59:41.000000000 +0800 @@ -5493,7 +5493,7 @@ bool is_free_buddy_page(struct page *pag } #endif -static struct trace_print_flags pageflag_names[] = { +struct trace_print_flags pageflag_names[] = { {1UL << PG_locked, "locked" }, {1UL << PG_error, "error" }, {1UL << PG_referenced, "referenced" }, @@ -5541,7 +5541,7 @@ static void dump_page_flags(unsigned lon printk(KERN_ALERT "page flags: %#lx(", flags); /* remove zone id */ - flags &= (1UL << NR_PAGEFLAGS) - 1; + flags &= PAGE_FLAGS_MASK; for (i = 0; pageflag_names[i].name && flags; i++) { --- mmotm.orig/include/linux/page-flags.h 2010-12-26 20:58:46.000000000 +0800 +++ mmotm/include/linux/page-flags.h 2010-12-26 20:59:41.000000000 +0800 @@ -414,6 +414,7 @@ static inline void __ClearPageTail(struc * there has been a kernel bug or struct page corruption. */ #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1) +#define PAGE_FLAGS_MASK ((1 << NR_PAGEFLAGS) - 1) #define PAGE_FLAGS_PRIVATE \ (1 << PG_private | 1 << PG_private_2) --- mmotm.orig/fs/inode.c 2010-12-26 20:58:45.000000000 +0800 +++ mmotm/fs/inode.c 2010-12-26 21:00:09.000000000 +0800 @@ -182,7 +182,13 @@ int inode_init_always(struct super_block inode->i_bdev = NULL; inode->i_cdev = NULL; inode->i_rdev = 0; - inode->dirtied_when = 0; + + /* + * This records inode load time. It will be invalidated once inode is + * dirtied, or jiffies wraps around. Despite the pitfalls it still + * provides useful information for some use cases like fastboot. + */ + inode->dirtied_when = jiffies; if (security_inode_alloc(inode)) goto out; @@ -226,6 +232,9 @@ int inode_init_always(struct super_block percpu_counter_inc(&nr_inodes); + BUILD_BUG_ON(sizeof(inode->i_comm) != TASK_COMM_LEN); + memcpy(inode->i_comm, current->comm, TASK_COMM_LEN); + return 0; out: return -ENOMEM; --- mmotm.orig/include/linux/fs.h 2010-12-26 20:59:50.000000000 +0800 +++ mmotm/include/linux/fs.h 2010-12-26 21:00:09.000000000 +0800 @@ -800,6 +800,8 @@ struct inode { struct posix_acl *i_default_acl; #endif void *i_private; /* fs or device private pointer */ + + char i_comm[16]; /* first opened by */ }; static inline int inode_unhashed(struct inode *inode) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache 2011-03-02 8:45 ` Ingo Molnar 2011-03-02 17:02 ` Dave Hansen @ 2011-03-03 2:01 ` KOSAKI Motohiro 2011-03-03 3:14 ` Tao Ma 2 siblings, 0 replies; 15+ messages in thread From: KOSAKI Motohiro @ 2011-03-03 2:01 UTC (permalink / raw) To: Ingo Molnar Cc: kosaki.motohiro, Liu Yuan, linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu, Peter Zijlstra, Fr馘駻ic Weisbecker, Steven Rostedt, Thomas Gleixner, Arnaldo Carvalho de Melo > > * Liu Yuan <namei.unix@gmail.com> wrote: > > > + if (likely(!retry_find) && page && PageUptodate(page)) > > + page_cache_acct_hit(inode->i_sb, READ); > > + else > > + page_cache_acct_missed(inode->i_sb, READ); > > Sigh. > > This would make such a nice tracepoint or sw perf event. It could be collected in a > 'count' form, equivalent to the stats you are aiming for here, or it could even be > traced, if someone is interested in such details. > > It could be mixed with other events, enriching multiple apps at once. Totally agreed. > But, instead of trying to improve those aspects of our existing instrumentation > frameworks, mm/* is gradually growing its own special instrumentation hacks, missing > the big picture and fragmenting the instrumentation space some more. > > That trend is somewhat sad. So, I think thing is, the stat is how much people and how frequently used. If it is really really really common, /proc/meminfo or similar special place is good idea. Another example, If the stat can help our MM debugging to handle LKML bug report, it is worth to have special care. But other almost else case are better to use generic framework. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache 2011-03-02 8:45 ` Ingo Molnar 2011-03-02 17:02 ` Dave Hansen 2011-03-03 2:01 ` KOSAKI Motohiro @ 2011-03-03 3:14 ` Tao Ma 2011-03-03 9:34 ` Ingo Molnar 2 siblings, 1 reply; 15+ messages in thread From: Tao Ma @ 2011-03-03 3:14 UTC (permalink / raw) To: Ingo Molnar Cc: Liu Yuan, linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu, Peter Zijlstra, Frédéric Weisbecker, Steven Rostedt, Thomas Gleixner, Arnaldo Carvalho de Melo On 03/02/2011 04:45 PM, Ingo Molnar wrote: > * Liu Yuan<namei.unix@gmail.com> wrote: > > >> + if (likely(!retry_find)&& page&& PageUptodate(page)) >> + page_cache_acct_hit(inode->i_sb, READ); >> + else >> + page_cache_acct_missed(inode->i_sb, READ); >> > Sigh. > > This would make such a nice tracepoint or sw perf event. It could be collected in a > 'count' form, equivalent to the stats you are aiming for here, or it could even be > traced, if someone is interested in such details. > > It could be mixed with other events, enriching multiple apps at once. > > But, instead of trying to improve those aspects of our existing instrumentation > frameworks, mm/* is gradually growing its own special instrumentation hacks, missing > the big picture and fragmenting the instrumentation space some more. > Thanks for the quick response. Actually our team(including Liu) here are planing to add some debug info to the mm parts for analyzing the application behavior and hope to find some way to improve our application's performance. We have searched the trace points in mm, but it seems to us that the trace points isn't quite welcomed there. Only vmscan and writeback have some limited trace points added. That's the reason we first tried to add some debug info like this patch. You does shed some light on our direction. Thanks. btw, what part do you think is needed to add some trace point? We volunteer to add more if you like. Regards, Tao -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache 2011-03-03 3:14 ` Tao Ma @ 2011-03-03 9:34 ` Ingo Molnar 2011-03-03 15:08 ` Tao Ma 0 siblings, 1 reply; 15+ messages in thread From: Ingo Molnar @ 2011-03-03 9:34 UTC (permalink / raw) To: Tao Ma Cc: Liu Yuan, linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu, Peter Zijlstra, Frédéric Weisbecker, Steven Rostedt, Thomas Gleixner, Arnaldo Carvalho de Melo, Tom Zanussi * Tao Ma <tm@tao.ma> wrote: > On 03/02/2011 04:45 PM, Ingo Molnar wrote: > >* Liu Yuan<namei.unix@gmail.com> wrote: > > > >>+ if (likely(!retry_find)&& page&& PageUptodate(page)) > >>+ page_cache_acct_hit(inode->i_sb, READ); > >>+ else > >>+ page_cache_acct_missed(inode->i_sb, READ); > >Sigh. > > > >This would make such a nice tracepoint or sw perf event. It could be collected in a > >'count' form, equivalent to the stats you are aiming for here, or it could even be > >traced, if someone is interested in such details. > > > >It could be mixed with other events, enriching multiple apps at once. > > > >But, instead of trying to improve those aspects of our existing instrumentation > >frameworks, mm/* is gradually growing its own special instrumentation hacks, missing > >the big picture and fragmenting the instrumentation space some more. > Thanks for the quick response. Actually our team(including Liu) here are planing > to add some debug info to the mm parts for analyzing the application behavior and > hope to find some way to improve our application's performance. We have searched > the trace points in mm, but it seems to us that the trace points isn't quite > welcomed there. Only vmscan and writeback have some limited trace points added. > That's the reason we first tried to add some debug info like this patch. You does > shed some light on our direction. Thanks. Yes, it's very much a 'critical mass' phenomenon: the moment there's enough tracepoints, above some magic limit, things happen quickly and everyone finds the stuff obviously useful. Before that limit it's all pretty painful. > btw, what part do you think is needed to add some trace point? We > volunteer to add more if you like. Whatever part you find useful in your daily development work! Tracepoints are pretty flexible. The bit that is missing and which is very important for the MM is the collapse into 'summaries' and the avoidance of tracing overhead when only a summary is wanted. Please see Wu Fengguang's reply in this thread about the 'dump state' facility he and Steve added to recover large statistics. I suspect the hit/miss histogram you are building in this patch could be recovered via that facility initially? The next step would generalize that approach - it is non-trivial but powerful :-) The idea is to allow non-trivial histograms and summaries to be built out of simple events, via the filter engine. It would require an extension of tracing to really allow a filter expression to be defined over existing events, which would allow the maintenance of a persistent 'sum' variable - probably within the perf ring-buffer. We already have filter support, that would have to be extended with a notion of 'persistent variables'. So right now, if you define a tracepoint in that spot, we already support such filter expressions: 'bdev == sda1 && page_state == PageUptodate' You can inject such filter expressions into /debug/tracing/events/*/*/filter today, and you can use filters in perf record --filter '...' as well. To implement 'fast statistics', the filter engine would have to be extended to support (simple) statements like: if (bdev == sda1 && page_state == PageUptodate)' var0++; And: if (bdev == sda1 && page_state != PageUptodate)' var1++; Only a very minimal type of C syntax would be supported - not a full C parser. That way the 'var0' portion of the perf ring-buffer (which would not be part of the regular, overwritten ring-buffer) would act as a 'hits' variable that you could recover. The 'var1' portion would be the 'misses' counter. Individual trace events would only twiddle var0 and var1 - they would not inject a full-blown event into the ring-buffer, so statistics would be very fast. This method is very extensible and could be used for far more things than just MM statistics. In theory all of /proc statistics collection could be replaced and made optional that way, just by adding the right events to the right spots in the kernel. That is obviously a very long-term project. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache 2011-03-03 9:34 ` Ingo Molnar @ 2011-03-03 15:08 ` Tao Ma 0 siblings, 0 replies; 15+ messages in thread From: Tao Ma @ 2011-03-03 15:08 UTC (permalink / raw) To: Ingo Molnar Cc: Liu Yuan, linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu, Peter Zijlstra, Frédéric Weisbecker, Steven Rostedt, Thomas Gleixner, Arnaldo Carvalho de Melo, Tom Zanussi On 03/03/2011 05:34 PM, Ingo Molnar wrote: > * Tao Ma<tm@tao.ma> wrote: > >> On 03/02/2011 04:45 PM, Ingo Molnar wrote: >>> * Liu Yuan<namei.unix@gmail.com> wrote: >>> >>>> + if (likely(!retry_find)&& page&& PageUptodate(page)) >>>> + page_cache_acct_hit(inode->i_sb, READ); >>>> + else >>>> + page_cache_acct_missed(inode->i_sb, READ); >>> Sigh. >>> >>> This would make such a nice tracepoint or sw perf event. It could be collected in a >>> 'count' form, equivalent to the stats you are aiming for here, or it could even be >>> traced, if someone is interested in such details. >>> >>> It could be mixed with other events, enriching multiple apps at once. >>> >>> But, instead of trying to improve those aspects of our existing instrumentation >>> frameworks, mm/* is gradually growing its own special instrumentation hacks, missing >>> the big picture and fragmenting the instrumentation space some more. >> Thanks for the quick response. Actually our team(including Liu) here are planing >> to add some debug info to the mm parts for analyzing the application behavior and >> hope to find some way to improve our application's performance. We have searched >> the trace points in mm, but it seems to us that the trace points isn't quite >> welcomed there. Only vmscan and writeback have some limited trace points added. >> That's the reason we first tried to add some debug info like this patch. You does >> shed some light on our direction. Thanks. > Yes, it's very much a 'critical mass' phenomenon: the moment there's enough > tracepoints, above some magic limit, things happen quickly and everyone finds the > stuff obviously useful. > > Before that limit it's all pretty painful. yeah. >> btw, what part do you think is needed to add some trace point? We >> volunteer to add more if you like. > Whatever part you find useful in your daily development work! > > Tracepoints are pretty flexible. The bit that is missing and which is very important > for the MM is the collapse into 'summaries' and the avoidance of tracing overhead > when only a summary is wanted. Please see Wu Fengguang's reply in this thread about > the 'dump state' facility he and Steve added to recover large statistics. We are looking into it now. Thanks for the hint. > I suspect the hit/miss histogram you are building in this patch could be recovered > via that facility initially? > > The next step would generalize that approach - it is non-trivial but powerful :-) > > The idea is to allow non-trivial histograms and summaries to be built out of simple > events, via the filter engine. > > It would require an extension of tracing to really allow a filter expression to be > defined over existing events, which would allow the maintenance of a persistent > 'sum' variable - probably within the perf ring-buffer. We already have filter > support, that would have to be extended with a notion of 'persistent variables'. > > So right now, if you define a tracepoint in that spot, we already support such > filter expressions: > > 'bdev == sda1&& page_state == PageUptodate' > > You can inject such filter expressions into /debug/tracing/events/*/*/filter today, > and you can use filters in perf record --filter '...' as well. > > To implement 'fast statistics', the filter engine would have to be extended to > support (simple) statements like: > > if (bdev == sda1&& page_state == PageUptodate)' > var0++; > > And: > > if (bdev == sda1&& page_state != PageUptodate)' > var1++; > > Only a very minimal type of C syntax would be supported - not a full C parser. > > That way the 'var0' portion of the perf ring-buffer (which would not be part of the > regular, overwritten ring-buffer) would act as a 'hits' variable that you could > recover. The 'var1' portion would be the 'misses' counter. > > Individual trace events would only twiddle var0 and var1 - they would not inject a > full-blown event into the ring-buffer, so statistics would be very fast. > > This method is very extensible and could be used for far more things than just MM > statistics. In theory all of /proc statistics collection could be replaced and made > optional that way, just by adding the right events to the right spots in the kernel. > That is obviously a very long-term project. It looks really fantastic for us. OK, we will try to figure out when and how we can work on this issue. Great thanks. Regards, Tao -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH 5/5] mm: Add readpages accounting [not found] <no> ` (3 preceding siblings ...) 2011-03-02 8:38 ` [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache Liu Yuan @ 2011-03-02 8:38 ` Liu Yuan 4 siblings, 0 replies; 15+ messages in thread From: Liu Yuan @ 2011-03-02 8:38 UTC (permalink / raw) To: linux-kernel, linux-mm, jaxboe, akpm, fengguang.wu From: Liu Yuan <tailai.ly@taobao.com> The _readpages_ counter simply counts how many pages the kernel really request from the disk, either by readahead module or aop->readpage() when readahead window equals 0. This counter is request-centric and doesnot check read errors since the read requests are issued to the block layer already. Signed-off-by: Liu Yuan <tailai.ly@taobao.com> --- mm/filemap.c | 1 + mm/readahead.c | 2 ++ 2 files changed, 3 insertions(+), 0 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 5388b2a..d638391 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1137,6 +1137,7 @@ readpage: */ ClearPageError(page); /* Start the actual read. The read will unlock the page. */ + page_cache_acct_readpages(mapping->host->i_sb, 1); error = mapping->a_ops->readpage(filp, page); if (unlikely(error)) { diff --git a/mm/readahead.c b/mm/readahead.c index 77506a2..483acb8 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -112,6 +112,8 @@ static int read_pages(struct address_space *mapping, struct file *filp, unsigned page_idx; int ret; + page_cache_acct_readpages(mapping->host->i_sb, nr_pages); + if (mapping->a_ops->readpages) { ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages); /* Clean up the remaining pages */ -- 1.7.0.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2011-03-03 15:08 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <no> 2011-03-02 8:38 ` [RFC PATCH 1/5] x86/Kconfig: Add Page Cache Accounting entry Liu Yuan 2011-03-02 16:24 ` Randy Dunlap 2011-03-03 3:11 ` Liu Yuan 2011-03-02 8:38 ` [RFC PATCH 2/5] block: Add functions and data types for Page Cache Accounting Liu Yuan 2011-03-02 8:38 ` [RFC PATCH 3/5] block: Make Page Cache counters work with sysfs Liu Yuan 2011-03-02 8:38 ` [RFC PATCH 4/5] mm: Add hit/miss accounting for Page Cache Liu Yuan 2011-03-02 8:45 ` Ingo Molnar 2011-03-02 17:02 ` Dave Hansen 2011-03-02 18:49 ` Ingo Molnar 2011-03-03 0:33 ` Wu Fengguang 2011-03-03 2:01 ` KOSAKI Motohiro 2011-03-03 3:14 ` Tao Ma 2011-03-03 9:34 ` Ingo Molnar 2011-03-03 15:08 ` Tao Ma 2011-03-02 8:38 ` [RFC PATCH 5/5] mm: Add readpages accounting Liu Yuan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).