From: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
To: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: "Li,
Shaohua" <shaohua.li-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
"linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
"linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Chris Mason <chris.mason-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
Arjan van de Ven <arjan-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
"Yan,
Zheng" <zheng.z.yan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>,
linux-api <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
manpages <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs
Date: Thu, 20 Jan 2011 13:46:45 +0800 [thread overview]
Message-ID: <20110120054645.GA18642@localhost> (raw)
In-Reply-To: <20110119201014.adf02a78.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
[-- Attachment #1: Type: text/plain, Size: 14452 bytes --]
On Thu, Jan 20, 2011 at 12:10:14PM +0800, Andrew Morton wrote:
> On Thu, 20 Jan 2011 11:21:49 +0800 Shaohua Li <shaohua.li-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>
> > > It seems to return a single offset/length tuple which refers to the
> > > btrfs metadata "file", with the intent that this tuple later be fed
> > > into a btrfs-specific readahead ioctl.
> > >
> > > I can see how this might be used with say fatfs or ext3 where all
> > > metadata resides within the blockdev address_space. But how is a
> > > filesytem which keeps its metadata in multiple address_spaces supposed
> > > to use this interface?
> > Oh, this looks like a big problem, thanks for letting me know such
> > filesystems. is it possible specific filesystem mapping multiple
> > address_space ranges to a virtual big ranges? the new ioctls handle the
> > mapping.
>
> I'm not sure what you mean by that.
>
> ext2, minix and probably others create an address_space for each
> directory. Heaven knows what xfs does (for example).
>
> > If the issue can't be solved, we can only add the metadata readahead for
> > specific implementation like my initial post instead of a generic
> > interface.
>
> Well. One approach would be for the kernel to report the names of all
> presently-cached files. And for each file, report the offsets of all
> the pages which are presently in pagecache. This all gets put into a
> database.
>
> At cold-boot time we open all those files and read the relevant files.
>
> To optimise that further, userspace would need to use fibmap to work
> out the LBA(s) of each page, and then read the pages in an optimised order.
>
> To optimise that even further, userspace would need to find the on-disk
> locations all the metadata for each file, generate the metadata->data
> dependencies and then incorporate that into the reading order.
>
> I actually wrote code to do all this. Gad, it was ten years ago. I
> forget how it works, but I do recall that it pioneered the technology
> of doing (effecticely) a sys_write(1, ...) from a kernel module, so the
> module's output appears on modprobe's stdout and can be redirected to
> another file or a pipe. So sue me! It's in
> http://userweb.kernel.org/~akpm/stuff/fboot.tar.gz. Good luck with
> that ;)
>
> <looks>
>
> It walked mem_map[], indentifying pagecache pages, walking back from
> the page* all the way to the filename then logging the pathname and the
> file's pagecache indexes. It also handled the blockdev superblock,
> where all the ext3 metadata resides.
> There are much smarter ways of doing this of course, especially with
> the vfs data structures which we later added.
Yup :) The attached patch walks sb->s_inodes and dumps a ordered view
of all cached file pages. It will list each cached files and pages in
the order of the struct inode create time.
The patch will record and show the command name that first opened the
file. (At the time we dump the page cache, the task may no longer
exists.) Although the field is very useful in some cases, it does add
runtime overheads. I'm not sure how to balance this situation. Adding
a compile time option? But then the trace output becomes dependent on
kernel configuration, which may confuse user space tools (at least the
dumb ones).
Otherwise the patch is good enough for wider review.
Here is a trimmed example output.
root@bay /home/wfg# echo / > /debug/tracing/objects/mm/pages/dump-fs
root@bay /home/wfg# cat /debug/tracing/trace
The output are made of intermixed lines for inode and page. The
corresponding field names are:
file lines:
ino size cached age(ms) dirty type first-opened-by file-name
page lines:
index len page-flags count mapcount
1507329 4096 8192 309042 ____ DIR swapper /
0 2 ____RU_____ 1 0
1786836 12288 40960 309026 ____ DIR swapper /sbin
0 10 ___ARU_____ 1 0
1786946 37312 40960 309024 ____ REG swapper /sbin/init
0 6 M__ARU_____ 2 1
6 1 M__A_U_____ 2 1
7 1 M__ARU_____ 2 1
8 2 _____U_____ 1 0
1507464 4 4096 309022 ____ LNK swapper /lib64
0 1 ___ARU_____ 1 0
1590173 12288 0 309021 ____ DIR swapper /lib
4563326 12 4096 309020 ____ LNK swapper /lib/ld-linux-x86-64.so.2
0 1 ___ARU_____ 1 0
4563295 128744 131072 309019 ____ REG swapper /lib/ld-2.11.2.so
0 1 M__ARU_____ 21 20
1 3 M__ARU_____ 17 16
4 4 M__ARU_____ 20 19
8 2 M__ARU_____ 27 26
10 3 M__ARU_____ 20 19
13 1 M__ARU_____ 27 26
14 1 M__ARU_____ 26 25
15 1 M__ARU_____ 20 19
16 1 M__ARU_____ 18 17
17 1 M__ARU_____ 9 8
18 1 M__A_U_____ 4 3
19 1 M__ARU_____ 27 26
20 1 M__ARU_____ 17 16
21 1 M__ARU_____ 20 19
22 1 M__ARU_____ 27 26
23 1 M__ARU_____ 20 19
24 1 M__ARU_____ 26 25
25 1 _____U_____ 1 0
26 1 M__A_U_____ 4 3
27 1 M__ARU_____ 20 19
28 4 _____U_____ 1 0
1525477 12288 0 309011 ____ DIR init /etc
1526463 64634 65536 309009 ____ REG init /etc/ld.so.cache
0 1 ___ARU_____ 1 0
1 1 _____U_____ 1 0
2 13 ___ARU_____ 1 0
15 1 ____RU_____ 1 0
1590258 241632 241664 309005 ____ REG init /lib/libsepol.so.1
0 5 M__ARU_____ 2 1
5 42 _____U_____ 1 0
47 1 M__ARU_____ 2 1
48 11 _____U_____ 1 0
1590330 117848 118784 308989 ____ REG init /lib/libselinux.so.1
0 1 M__ARU_____ 7 6
1 4 M__ARU_____ 4 3
5 1 M__ARU_____ 5 4
6 5 _____U_____ 1 0
11 2 M__ARU_____ 4 3
13 5 _____U_____ 1 0
18 1 ___ARU_____ 1 0
19 2 _____U_____ 1 0
21 1 M__ARU_____ 5 4
22 7 _____U_____ 1 0
4563314 14 4096 308982 ____ LNK init /lib/libc.so.6
0 1 ___ARU_____ 1 0
4563283 1432968 1433600 308981 ____ REG init /lib/libc-2.11.2.so
0 3 M__ARU_____ 27 26
3 1 M__ARU_____ 25 24
4 2 M__ARU_____ 23 22
6 1 M__ARU_____ 26 25
7 1 M__ARU_____ 22 21
8 1 M__ARU_____ 27 26
9 2 M__ARU_____ 25 24
11 1 M__ARU_____ 23 22
12 1 M__ARU_____ 25 24
13 1 M__ARU_____ 24 23
14 1 M__ARU_____ 25 24
15 3 M__ARU_____ 24 23
18 3 M__ARU_____ 26 25
21 2 M__ARU_____ 27 26
23 7 M__ARU_____ 17 16
30 1 M__ARU_____ 29 28
31 1 M__ARU_____ 25 24
32 2 M__ARU_____ 4 3
34 1 M__ARU_____ 3 2
35 2 M__ARU_____ 4 3
37 1 M__ARU_____ 2 1
38 1 _____U_____ 1 0
39 1 M__ARU_____ 4 3
40 1 M__ARU_____ 13 12
41 1 M__ARU_____ 12 11
42 1 M__ARU_____ 5 4
43 1 M__ARU_____ 23 22
44 2 M__ARU_____ 6 5
46 1 ___ARU_____ 1 0
47 1 M__ARU_____ 12 11
48 1 M__ARU_____ 4 3
49 1 M__ARU_____ 18 17
50 1 M__ARU_____ 29 28
51 2 M__ARU_____ 2 1
53 1 M__ARU_____ 27 26
54 1 M__ARU_____ 19 18
55 1 M__ARU_____ 25 24
56 2 _____U_____ 1 0
58 2 M__ARU_____ 2 1
60 2 _____U_____ 1 0
62 1 M__A_U_____ 2 1
63 1 _____U_____ 1 0
64 1 ___ARU_____ 1 0
65 3 M__ARU_____ 29 28
68 1 M__ARU_____ 21 20
69 1 M__ARU_____ 26 25
70 1 M__ARU_____ 9 8
71 1 M__ARU_____ 3 2
72 2 ___ARU_____ 1 0
74 2 _____U_____ 1 0
76 1 M__ARU_____ 27 26
77 2 M__ARU_____ 13 12
79 1 M__ARU_____ 9 8
80 1 M__ARU_____ 10 9
81 1 M__A_U_____ 2 1
82 1 M___RU_____ 4 3
83 1 M__ARU_____ 3 2
84 1 M__ARU_____ 16 15
85 1 M__ARU_____ 3 2
86 12 _____U_____ 1 0
98 1 M__ARU_____ 26 25
99 1 M__ARU_____ 25 24
100 2 M__ARU_____ 17 16
102 1 M__ARU_____ 25 24
103 1 M__ARU_____ 18 17
104 1 M__ARU_____ 14 13
105 3 _____U_____ 1 0
108 1 M__ARU_____ 12 11
109 2 M__ARU_____ 26 25
111 6 M__ARU_____ 30 29
117 1 M__ARU_____ 29 28
118 1 M__ARU_____ 30 29
119 1 M__ARU_____ 19 18
120 1 M__ARU_____ 22 21
121 1 M__ARU_____ 3 2
122 1 M__ARU_____ 28 27
123 1 M__ARU_____ 30 29
124 1 M__ARU_____ 11 10
125 1 M__ARU_____ 26 25
126 1 M__ARU_____ 22 21
127 2 M__ARU_____ 29 28
129 2 M__ARU_____ 5 4
131 1 M__ARU_____ 10 9
132 1 M__ARU_____ 25 24
133 2 M__ARU_____ 17 16
135 1 M__ARU_____ 3 2
136 6 _____U_____ 1 0
142 2 M__ARU_____ 3 2
144 1 M__ARU_____ 8 7
145 1 M__ARU_____ 22 21
146 3 M__ARU_____ 8 7
149 2 _____U_____ 1 0
151 3 M__ARU_____ 6 5
154 2 _____U_____ 1 0
156 1 M__ARU_____ 8 7
157 1 M__ARU_____ 10 9
158 1 M__ARU_____ 9 8
159 1 M__ARU_____ 8 7
160 1 M__ARU_____ 28 27
161 1 M__ARU_____ 30 29
162 1 M__ARU_____ 14 13
163 1 M____U_____ 2 1
164 2 _____U_____ 1 0
166 2 M__ARU_____ 4 3
168 1 M__ARU_____ 12 11
169 1 M__ARU_____ 10 9
170 1 M__ARU_____ 4 3
171 3 M__ARU_____ 3 2
174 6 ___ARU_____ 1 0
180 1 _____U_____ 1 0
181 9 ___ARU_____ 1 0
190 1 M__ARU_____ 4 3
191 1 ___A_U_____ 1 0
192 1 _____U_____ 1 0
193 1 ___A_U_____ 1 0
194 1 M__ARU_____ 30 29
195 1 M__ARU_____ 27 26
196 1 M__ARU_____ 17 16
197 2 _____U_____ 1 0
199 1 M__ARU_____ 27 26
200 1 M__ARU_____ 25 24
201 1 M__ARU_____ 2 1
202 1 M__ARU_____ 9 8
203 1 M__ARU_____ 26 25
204 1 M__ARU_____ 14 13
205 1 M__ARU_____ 4 3
206 1 M__ARU_____ 18 17
207 1 M__ARU_____ 26 25
208 1 M__ARU_____ 22 21
209 1 M__ARU_____ 2 1
210 1 M__ARU_____ 3 2
211 2 M____U_____ 2 1
213 5 _____U_____ 1 0
218 1 ___A_U_____ 1 0
> <googles>
>
> According to http://kerneltrap.org/node/2157 it sped up cold boot by
> "10%", whatever that means. Seems that I wasn't sufficiently impressed
> by that and got distracted.
>
> I'm not sure any of that was very useful, really. A full-on coldboot
> optimiser really wants visibility into every disk block which need to
> be read, and then mechanisms to tell the kernel to load those blocks
> into the correct address_spaces. That's hard, because file data
> depends on file metadata. A vast simplification would be to do it in
> two disk passes: read all the metadata on pass 1 then all the data on
> pass 2.
Yes, that is what this patchset tries to do.
> A totally different approach is to reorder all the data and metadata
> on-disk, so no special cold-boot processing is needed at all.
The boot time speedup mentioned in the changelog won't be possible
without the physical data/metadata reordering. Fortunately btrfs makes
it a trivial task.
> And a third approach is to save all the cache into a special
> file/partition/etc and to preload all that into kernel data structures
> at boot. Obviously this one is ricky/tricky because the on-disk
> replica of the real data can get out of sync with the real data.
Hah! We are thinking much alike :)
It's a very good optimization for LiveCDs and readonly mounted NFS /usr.
For a typical desktop, the solution in my mind is to install some
initscript to run at halt/reboot time, after all other tasks have been
killed and filesystems remounted readonly. At the time it may dump
whatever in the page cache to the swap partition. At the next boot,
the data/metadata can then be read back _perfectly sequentially_ for
populating the page cache.
For kexec based reboot, the data can even be passed to next kernel
directly, saving the disk IO totally.
Thanks,
Fengguang
[-- Attachment #2: dump-page-cache-v2.6.37.patch --]
[-- Type: text/x-diff, Size: 16714 bytes --]
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ mmotm/include/trace/events/mm.h 2010-12-26 20:59:48.000000000 +0800
@@ -0,0 +1,164 @@
+#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MM_H
+
+#include <linux/tracepoint.h>
+#include <linux/page-flags.h>
+#include <linux/memcontrol.h>
+#include <linux/pagemap.h>
+#include <linux/mm.h>
+#include <linux/kernel-page-flags.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mm
+
+extern struct trace_print_flags pageflag_names[];
+
+/**
+ * dump_page_frame - called by the trace page dump trigger
+ * @pfn: page frame number
+ * @page: pointer to the page frame
+ *
+ * This is a helper trace point into the dumping of the page frames.
+ * It will record various infromation about a page frame.
+ */
+TRACE_EVENT(dump_page_frame,
+
+ TP_PROTO(unsigned long pfn, struct page *page),
+
+ TP_ARGS(pfn, page),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, pfn )
+ __field( struct page *, page )
+ __field( u64, stable_flags )
+ __field( unsigned long, flags )
+ __field( unsigned int, count )
+ __field( unsigned int, mapcount )
+ __field( unsigned long, private )
+ __field( unsigned long, mapping )
+ __field( unsigned long, index )
+ ),
+
+ TP_fast_assign(
+ __entry->pfn = pfn;
+ __entry->page = page;
+ __entry->stable_flags = stable_page_flags(page);
+ __entry->flags = page->flags;
+ __entry->count = atomic_read(&page->_count);
+ __entry->mapcount = page_mapcount(page);
+ __entry->private = page->private;
+ __entry->mapping = (unsigned long)page->mapping;
+ __entry->index = page->index;
+ ),
+
+ TP_printk("%12lx %16p %8x %8x %16lx %16lx %16lx %s",
+ __entry->pfn,
+ __entry->page,
+ __entry->count,
+ __entry->mapcount,
+ __entry->private,
+ __entry->mapping,
+ __entry->index,
+ ftrace_print_flags_seq(p, "|",
+ __entry->flags & PAGE_FLAGS_MASK,
+ pageflag_names)
+ )
+);
+
+TRACE_EVENT(dump_page_cache,
+
+ TP_PROTO(struct page *page, unsigned long len),
+
+ TP_ARGS(page, len),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, index )
+ __field( unsigned long, len )
+ __field( u64, flags )
+ __field( unsigned int, count )
+ __field( unsigned int, mapcount )
+ ),
+
+ TP_fast_assign(
+ __entry->index = page->index;
+ __entry->len = len;
+ __entry->flags = stable_page_flags(page);
+ __entry->count = atomic_read(&page->_count);
+ __entry->mapcount = page_mapcount(page);
+ ),
+
+ TP_printk("%12lu %6lu %c%c%c%c%c%c%c%c%c%c%c %4u %4u",
+ __entry->index,
+ __entry->len,
+ __entry->flags & (1ULL << KPF_MMAP) ? 'M' : '_',
+ __entry->flags & (1ULL << KPF_MLOCKED) ? 'm' : '_',
+ __entry->flags & (1ULL << KPF_UNEVICTABLE) ? 'u' : '_',
+ __entry->flags & (1ULL << KPF_ACTIVE) ? 'A' : '_',
+ __entry->flags & (1ULL << KPF_REFERENCED) ? 'R' : '_',
+ __entry->flags & (1ULL << KPF_UPTODATE) ? 'U' : '_',
+ __entry->flags & (1ULL << KPF_DIRTY) ? 'D' : '_',
+ __entry->flags & (1ULL << KPF_WRITEBACK) ? 'W' : '_',
+ __entry->flags & (1ULL << KPF_RECLAIM) ? 'I' : '_',
+ __entry->flags & (1ULL << KPF_MAPPEDTODISK) ? 'd' : '_',
+ __entry->flags & (1ULL << KPF_PRIVATE) ? 'P' : '_',
+ __entry->count,
+ __entry->mapcount)
+);
+
+
+#define show_inode_type(val) __print_symbolic(val, \
+ { S_IFREG, "REG" }, \
+ { S_IFDIR, "DIR" }, \
+ { S_IFLNK, "LNK" }, \
+ { S_IFBLK, "BLK" }, \
+ { S_IFCHR, "CHR" }, \
+ { S_IFIFO, "FIFO" }, \
+ { S_IFSOCK, "SOCK" })
+
+TRACE_EVENT(dump_inode_cache,
+
+ TP_PROTO(struct inode *inode, char *name, int len),
+
+ TP_ARGS(inode, name, len),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, ino )
+ __field( loff_t, size ) /* bytes */
+ __field( loff_t, cached ) /* bytes */
+ __field( unsigned long, age ) /* ms */
+ __field( unsigned long, state )
+ __field( umode_t, mode )
+ __array( char, comm, TASK_COMM_LEN)
+ __dynamic_array(char, file, len )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->size = i_size_read(inode);
+ __entry->cached = inode->i_mapping->nrpages;
+ __entry->cached <<= PAGE_CACHE_SHIFT;
+ __entry->age = (jiffies - inode->dirtied_when) * 1000 / HZ;
+ __entry->state = inode->i_state;
+ __entry->mode = inode->i_mode;
+ memcpy(__entry->comm, inode->i_comm, TASK_COMM_LEN);
+ memcpy(__get_str(file), name, len);
+ ),
+
+ TP_printk("%12lu %12llu %12llu %12lu %c%c%c%c %4s %16s %s",
+ __entry->ino,
+ __entry->size,
+ __entry->cached,
+ __entry->age,
+ __entry->state & I_DIRTY_PAGES ? 'D' : '_',
+ __entry->state & I_DIRTY_DATASYNC ? 'd' : '_',
+ __entry->state & I_DIRTY_SYNC ? 'm' : '_',
+ __entry->state & I_SYNC ? 'S' : '_',
+ show_inode_type(__entry->mode & S_IFMT),
+ __entry->comm,
+ __get_str(file))
+);
+
+#endif /* _TRACE_MM_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- mmotm.orig/kernel/trace/Makefile 2010-12-26 20:58:46.000000000 +0800
+++ mmotm/kernel/trace/Makefile 2010-12-26 20:59:41.000000000 +0800
@@ -26,6 +26,7 @@ obj-$(CONFIG_RING_BUFFER) += ring_buffer
obj-$(CONFIG_RING_BUFFER_BENCHMARK) += ring_buffer_benchmark.o
obj-$(CONFIG_TRACING) += trace.o
+obj-$(CONFIG_TRACING) += trace_objects.o
obj-$(CONFIG_TRACING) += trace_output.o
obj-$(CONFIG_TRACING) += trace_stat.o
obj-$(CONFIG_TRACING) += trace_printk.o
@@ -53,6 +54,7 @@ endif
obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
obj-$(CONFIG_EVENT_TRACING) += power-traces.o
+obj-$(CONFIG_EVENT_TRACING) += trace_mm.o
ifeq ($(CONFIG_TRACING),y)
obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
endif
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ mmotm/kernel/trace/trace_mm.c 2010-12-26 20:59:41.000000000 +0800
@@ -0,0 +1,367 @@
+/*
+ * Trace mm pages
+ *
+ * Copyright (C) 2009 Red Hat Inc, Steven Rostedt <srostedt-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * Code based on Matt Mackall's /proc/[kpagecount|kpageflags] code.
+ */
+#include <linux/module.h>
+#include <linux/bootmem.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/ctype.h>
+#include <linux/pagevec.h>
+#include <linux/writeback.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+
+#include "trace_output.h"
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/mm.h>
+
+void trace_mm_page_frames(unsigned long start, unsigned long end,
+ void (*trace)(unsigned long pfn, struct page *page))
+{
+ unsigned long pfn = start;
+ struct page *page;
+
+ if (start > max_pfn - 1)
+ return;
+
+ if (end > max_pfn)
+ end = max_pfn;
+
+ while (pfn < end) {
+ page = NULL;
+ if (pfn_valid(pfn))
+ page = pfn_to_page(pfn);
+ pfn++;
+ if (page)
+ trace(pfn, page);
+ }
+}
+
+static void trace_mm_page_frame(unsigned long pfn, struct page *page)
+{
+ trace_dump_page_frame(pfn, page);
+}
+
+static ssize_t
+trace_mm_pfn_range_read(struct file *filp, char __user *ubuf, size_t cnt,
+ loff_t *ppos)
+{
+ return simple_read_from_buffer(ubuf, cnt, ppos, "0\n", 2);
+}
+
+
+/*
+ * recognized formats:
+ * "M N" start=M, end=N
+ * "M" start=M, end=M+1
+ * "M +N" start=M, end=M+N-1
+ */
+static ssize_t
+trace_mm_pfn_range_write(struct file *filp, const char __user *ubuf, size_t cnt,
+ loff_t *ppos)
+{
+ unsigned long start;
+ unsigned long end = 0;
+ char buf[64];
+ char *ptr;
+
+ if (cnt >= sizeof(buf))
+ return -EINVAL;
+
+ if (copy_from_user(&buf, ubuf, cnt))
+ return -EFAULT;
+
+ if (tracing_update_buffers() < 0)
+ return -ENOMEM;
+
+ if (trace_set_clr_event("mm", "dump_page_frame", 1))
+ return -EINVAL;
+
+ buf[cnt] = 0;
+
+ start = simple_strtoul(buf, &ptr, 0);
+
+ for (; *ptr; ptr++) {
+ if (isdigit(*ptr)) {
+ if (*(ptr - 1) == '+')
+ end = start;
+ end += simple_strtoul(ptr, NULL, 0);
+ break;
+ }
+ }
+ if (!*ptr)
+ end = start + 1;
+
+ trace_mm_page_frames(start, end, trace_mm_page_frame);
+
+ return cnt;
+}
+
+static const struct file_operations trace_mm_fops = {
+ .open = tracing_open_generic,
+ .read = trace_mm_pfn_range_read,
+ .write = trace_mm_pfn_range_write,
+};
+
+static struct dentry *trace_objects_mm_dir(void)
+{
+ static struct dentry *d_mm;
+ struct dentry *d_objects;
+
+ if (d_mm)
+ return d_mm;
+
+ d_objects = trace_objects_dir();
+ if (!d_objects)
+ return NULL;
+
+ d_mm = debugfs_create_dir("mm", d_objects);
+ if (!d_mm)
+ pr_warning("Could not create 'objects/mm' directory\n");
+
+ return d_mm;
+}
+
+static unsigned long page_flags(struct page *page)
+{
+ return page->flags & ((1 << NR_PAGEFLAGS) - 1);
+}
+
+static int pages_similar(struct page *page0, struct page *page)
+{
+ if (page_flags(page0) != page_flags(page))
+ return 0;
+
+ if (page_count(page0) != page_count(page))
+ return 0;
+
+ if (page_mapcount(page0) != page_mapcount(page))
+ return 0;
+
+ return 1;
+}
+
+static void dump_pagecache(struct address_space *mapping)
+{
+ unsigned long nr_pages;
+ struct page *pages[PAGEVEC_SIZE];
+ struct page *uninitialized_var(page0);
+ struct page *page;
+ unsigned long start = 0;
+ unsigned long len = 0;
+ int i;
+
+ for (;;) {
+ rcu_read_lock();
+ nr_pages = radix_tree_gang_lookup(&mapping->page_tree,
+ (void **)pages, start + len, PAGEVEC_SIZE);
+ rcu_read_unlock();
+
+ if (nr_pages == 0) {
+ if (len)
+ trace_dump_page_cache(page0, len);
+ return;
+ }
+
+ for (i = 0; i < nr_pages; i++) {
+ page = pages[i];
+
+ if (len &&
+ page->index == start + len &&
+ pages_similar(page0, page))
+ len++;
+ else {
+ if (len)
+ trace_dump_page_cache(page0, len);
+ page0 = page;
+ start = page->index;
+ len = 1;
+ }
+ }
+ cond_resched();
+ }
+}
+
+static void dump_inode_cache(struct inode *inode,
+ char *name_buf,
+ struct vfsmount *mnt)
+{
+ struct path path = {
+ .mnt = mnt,
+ .dentry = d_find_alias(inode)
+ };
+ char *name;
+ int len;
+
+ if (!mnt) {
+ trace_dump_inode_cache(inode, name_buf, strlen(name_buf));
+ return;
+ }
+
+ if (!path.dentry) {
+ trace_dump_inode_cache(inode, "", 1);
+ return;
+ }
+
+ name = d_path(&path, name_buf, PAGE_SIZE);
+ if (IS_ERR(name)) {
+ name = "";
+ len = 1;
+ } else
+ len = PAGE_SIZE + name_buf - name;
+
+ trace_dump_inode_cache(inode, name, len);
+
+ if (path.dentry)
+ dput(path.dentry);
+}
+
+static void dump_fs_pagecache(struct super_block *sb, struct vfsmount *mnt)
+{
+ struct inode *inode;
+ struct inode *prev_inode = NULL;
+ char *name_buf;
+
+ name_buf = (char *)__get_free_page(GFP_TEMPORARY);
+ if (!name_buf)
+ return;
+
+ down_read(&sb->s_umount);
+ if (!sb->s_root)
+ goto out;
+
+ spin_lock(&inode_lock);
+ list_for_each_entry_reverse(inode, &sb->s_inodes, i_sb_list) {
+ if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
+ continue;
+ __iget(inode);
+ spin_unlock(&inode_lock);
+ dump_inode_cache(inode, name_buf, mnt);
+ if (inode->i_mapping->nrpages)
+ dump_pagecache(inode->i_mapping);
+ iput(prev_inode);
+ prev_inode = inode;
+ cond_resched();
+ spin_lock(&inode_lock);
+ }
+ spin_unlock(&inode_lock);
+ iput(prev_inode);
+out:
+ up_read(&sb->s_umount);
+ free_page((unsigned long)name_buf);
+}
+
+static ssize_t
+trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
+ loff_t *ppos)
+{
+ struct file *file = NULL;
+ char *name;
+ int err = 0;
+
+ if (count <= 1)
+ return -EINVAL;
+ if (count >= PAGE_SIZE)
+ return -ENAMETOOLONG;
+
+ name = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!name)
+ return -ENOMEM;
+
+ if (copy_from_user(name, ubuf, count)) {
+ err = -EFAULT;
+ goto out;
+ }
+
+ /* strip the newline added by `echo` */
+ if (name[count-1] == '\n')
+ name[count-1] = '\0';
+ else
+ name[count] = '\0';
+
+ file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
+ if (IS_ERR(file)) {
+ err = PTR_ERR(file);
+ file = NULL;
+ goto out;
+ }
+
+ if (tracing_update_buffers() < 0) {
+ err = -ENOMEM;
+ goto out;
+ }
+ if (trace_set_clr_event("mm", "dump_page_cache", 1)) {
+ err = -EINVAL;
+ goto out;
+ }
+ if (trace_set_clr_event("mm", "dump_inode_cache", 1)) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (filp->f_path.dentry->d_inode->i_private) {
+ dump_fs_pagecache(file->f_path.dentry->d_sb, file->f_path.mnt);
+ } else {
+ dump_inode_cache(file->f_mapping->host, name, NULL);
+ dump_pagecache(file->f_mapping);
+ }
+
+out:
+ if (file)
+ fput(file);
+ kfree(name);
+
+ return err ? err : count;
+}
+
+static const struct file_operations trace_pagecache_fops = {
+ .open = tracing_open_generic,
+ .read = trace_mm_pfn_range_read,
+ .write = trace_pagecache_write,
+};
+
+static struct dentry *trace_objects_mm_pages_dir(void)
+{
+ static struct dentry *d_pages;
+ struct dentry *d_mm;
+
+ if (d_pages)
+ return d_pages;
+
+ d_mm = trace_objects_mm_dir();
+ if (!d_mm)
+ return NULL;
+
+ d_pages = debugfs_create_dir("pages", d_mm);
+ if (!d_pages)
+ pr_warning("Could not create debugfs "
+ "'objects/mm/pages' directory\n");
+
+ return d_pages;
+}
+
+static __init int trace_objects_mm_init(void)
+{
+ struct dentry *d_pages;
+
+ d_pages = trace_objects_mm_pages_dir();
+ if (!d_pages)
+ return 0;
+
+ trace_create_file("dump-pfn", 0600, d_pages, NULL,
+ &trace_mm_fops);
+
+ trace_create_file("dump-file", 0600, d_pages, NULL,
+ &trace_pagecache_fops);
+
+ trace_create_file("dump-fs", 0600, d_pages, (void *)1,
+ &trace_pagecache_fops);
+
+ return 0;
+}
+fs_initcall(trace_objects_mm_init);
--- mmotm.orig/kernel/trace/trace.h 2010-12-26 20:58:46.000000000 +0800
+++ mmotm/kernel/trace/trace.h 2010-12-26 20:59:41.000000000 +0800
@@ -295,6 +295,7 @@ struct dentry *trace_create_file(const c
const struct file_operations *fops);
struct dentry *tracing_init_dentry(void);
+struct dentry *trace_objects_dir(void);
struct ring_buffer_event;
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ mmotm/kernel/trace/trace_objects.c 2010-12-26 20:59:41.000000000 +0800
@@ -0,0 +1,26 @@
+#include <linux/debugfs.h>
+
+#include "trace.h"
+#include "trace_output.h"
+
+struct dentry *trace_objects_dir(void)
+{
+ static struct dentry *d_objects;
+ struct dentry *d_tracer;
+
+ if (d_objects)
+ return d_objects;
+
+ d_tracer = tracing_init_dentry();
+ if (!d_tracer)
+ return NULL;
+
+ d_objects = debugfs_create_dir("objects", d_tracer);
+ if (!d_objects)
+ pr_warning("Could not create debugfs "
+ "'objects' directory\n");
+
+ return d_objects;
+}
+
+
--- mmotm.orig/mm/page_alloc.c 2010-12-26 20:58:46.000000000 +0800
+++ mmotm/mm/page_alloc.c 2010-12-26 20:59:41.000000000 +0800
@@ -5493,7 +5493,7 @@ bool is_free_buddy_page(struct page *pag
}
#endif
-static struct trace_print_flags pageflag_names[] = {
+struct trace_print_flags pageflag_names[] = {
{1UL << PG_locked, "locked" },
{1UL << PG_error, "error" },
{1UL << PG_referenced, "referenced" },
@@ -5541,7 +5541,7 @@ static void dump_page_flags(unsigned lon
printk(KERN_ALERT "page flags: %#lx(", flags);
/* remove zone id */
- flags &= (1UL << NR_PAGEFLAGS) - 1;
+ flags &= PAGE_FLAGS_MASK;
for (i = 0; pageflag_names[i].name && flags; i++) {
--- mmotm.orig/include/linux/page-flags.h 2010-12-26 20:58:46.000000000 +0800
+++ mmotm/include/linux/page-flags.h 2010-12-26 20:59:41.000000000 +0800
@@ -414,6 +414,7 @@ static inline void __ClearPageTail(struc
* there has been a kernel bug or struct page corruption.
*/
#define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
+#define PAGE_FLAGS_MASK ((1 << NR_PAGEFLAGS) - 1)
#define PAGE_FLAGS_PRIVATE \
(1 << PG_private | 1 << PG_private_2)
--- mmotm.orig/fs/inode.c 2010-12-26 20:58:45.000000000 +0800
+++ mmotm/fs/inode.c 2010-12-26 21:00:09.000000000 +0800
@@ -182,7 +182,13 @@ int inode_init_always(struct super_block
inode->i_bdev = NULL;
inode->i_cdev = NULL;
inode->i_rdev = 0;
- inode->dirtied_when = 0;
+
+ /*
+ * This records inode load time. It will be invalidated once inode is
+ * dirtied, or jiffies wraps around. Despite the pitfalls it still
+ * provides useful information for some use cases like fastboot.
+ */
+ inode->dirtied_when = jiffies;
if (security_inode_alloc(inode))
goto out;
@@ -226,6 +232,9 @@ int inode_init_always(struct super_block
percpu_counter_inc(&nr_inodes);
+ BUILD_BUG_ON(sizeof(inode->i_comm) != TASK_COMM_LEN);
+ memcpy(inode->i_comm, current->comm, TASK_COMM_LEN);
+
return 0;
out:
return -ENOMEM;
--- mmotm.orig/include/linux/fs.h 2010-12-26 20:59:50.000000000 +0800
+++ mmotm/include/linux/fs.h 2010-12-26 21:00:09.000000000 +0800
@@ -800,6 +800,8 @@ struct inode {
struct posix_acl *i_default_acl;
#endif
void *i_private; /* fs or device private pointer */
+
+ char i_comm[16]; /* first opened by */
};
static inline int inode_unhashed(struct inode *inode)
next prev parent reply other threads:[~2011-01-20 5:46 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-19 1:15 [PATCH v3 1/5] add metadata_incore ioctl in vfs Shaohua Li
2011-01-19 20:41 ` Andrew Morton
[not found] ` <20110119124158.b0348c44.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2011-01-20 2:30 ` Shaohua Li
2011-01-20 2:42 ` Andrew Morton
2011-01-20 2:48 ` Shaohua Li
2011-01-20 3:05 ` Andrew Morton
[not found] ` <20110119190548.e1f7f01f.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2011-01-20 3:21 ` Shaohua Li
2011-01-20 4:10 ` Andrew Morton
2011-01-20 4:41 ` Dave Chinner
2011-01-20 5:44 ` Shaohua Li
2011-01-20 6:06 ` Wu Fengguang
2011-01-24 4:29 ` Dave Chinner
[not found] ` <20110119201014.adf02a78.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2011-01-20 5:38 ` Shaohua Li
2011-01-20 5:55 ` Andrew Morton
[not found] ` <20110119215510.0882db92.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2011-01-20 6:12 ` Shaohua Li
2011-01-20 6:19 ` Wu Fengguang
2011-01-20 6:29 ` Andrew Morton
2011-01-20 6:37 ` Shaohua Li
2011-01-20 6:45 ` Wu Fengguang
2011-01-20 6:27 ` Andrew Morton
[not found] ` <20110119222740.fb1b5229.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2011-01-24 10:06 ` Boaz Harrosh
2011-01-20 5:46 ` Wu Fengguang [this message]
2011-01-20 5:55 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110120054645.GA18642@localhost \
--to=fengguang.wu-ral2jqcrhueavxtiumwx3w@public.gmane.org \
--cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
--cc=arjan-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
--cc=chris.mason-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org \
--cc=hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
--cc=linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
--cc=shaohua.li-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
--cc=zheng.z.yan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).