On Thu, Jan 20, 2011 at 12:10:14PM +0800, Andrew Morton wrote: > On Thu, 20 Jan 2011 11:21:49 +0800 Shaohua Li wrote: > > > > It seems to return a single offset/length tuple which refers to the > > > btrfs metadata "file", with the intent that this tuple later be fed > > > into a btrfs-specific readahead ioctl. > > > > > > I can see how this might be used with say fatfs or ext3 where all > > > metadata resides within the blockdev address_space. But how is a > > > filesytem which keeps its metadata in multiple address_spaces supposed > > > to use this interface? > > Oh, this looks like a big problem, thanks for letting me know such > > filesystems. is it possible specific filesystem mapping multiple > > address_space ranges to a virtual big ranges? the new ioctls handle the > > mapping. > > I'm not sure what you mean by that. > > ext2, minix and probably others create an address_space for each > directory. Heaven knows what xfs does (for example). > > > If the issue can't be solved, we can only add the metadata readahead for > > specific implementation like my initial post instead of a generic > > interface. > > Well. One approach would be for the kernel to report the names of all > presently-cached files. And for each file, report the offsets of all > the pages which are presently in pagecache. This all gets put into a > database. > > At cold-boot time we open all those files and read the relevant files. > > To optimise that further, userspace would need to use fibmap to work > out the LBA(s) of each page, and then read the pages in an optimised order. > > To optimise that even further, userspace would need to find the on-disk > locations all the metadata for each file, generate the metadata->data > dependencies and then incorporate that into the reading order. > > I actually wrote code to do all this. Gad, it was ten years ago. I > forget how it works, but I do recall that it pioneered the technology > of doing (effecticely) a sys_write(1, ...) from a kernel module, so the > module's output appears on modprobe's stdout and can be redirected to > another file or a pipe. So sue me! It's in > http://userweb.kernel.org/~akpm/stuff/fboot.tar.gz. Good luck with > that ;) > > > > It walked mem_map[], indentifying pagecache pages, walking back from > the page* all the way to the filename then logging the pathname and the > file's pagecache indexes. It also handled the blockdev superblock, > where all the ext3 metadata resides. > There are much smarter ways of doing this of course, especially with > the vfs data structures which we later added. Yup :) The attached patch walks sb->s_inodes and dumps a ordered view of all cached file pages. It will list each cached files and pages in the order of the struct inode create time. The patch will record and show the command name that first opened the file. (At the time we dump the page cache, the task may no longer exists.) Although the field is very useful in some cases, it does add runtime overheads. I'm not sure how to balance this situation. Adding a compile time option? But then the trace output becomes dependent on kernel configuration, which may confuse user space tools (at least the dumb ones). Otherwise the patch is good enough for wider review. Here is a trimmed example output. root@bay /home/wfg# echo / > /debug/tracing/objects/mm/pages/dump-fs root@bay /home/wfg# cat /debug/tracing/trace The output are made of intermixed lines for inode and page. The corresponding field names are: file lines: ino size cached age(ms) dirty type first-opened-by file-name page lines: index len page-flags count mapcount 1507329 4096 8192 309042 ____ DIR swapper / 0 2 ____RU_____ 1 0 1786836 12288 40960 309026 ____ DIR swapper /sbin 0 10 ___ARU_____ 1 0 1786946 37312 40960 309024 ____ REG swapper /sbin/init 0 6 M__ARU_____ 2 1 6 1 M__A_U_____ 2 1 7 1 M__ARU_____ 2 1 8 2 _____U_____ 1 0 1507464 4 4096 309022 ____ LNK swapper /lib64 0 1 ___ARU_____ 1 0 1590173 12288 0 309021 ____ DIR swapper /lib 4563326 12 4096 309020 ____ LNK swapper /lib/ld-linux-x86-64.so.2 0 1 ___ARU_____ 1 0 4563295 128744 131072 309019 ____ REG swapper /lib/ld-2.11.2.so 0 1 M__ARU_____ 21 20 1 3 M__ARU_____ 17 16 4 4 M__ARU_____ 20 19 8 2 M__ARU_____ 27 26 10 3 M__ARU_____ 20 19 13 1 M__ARU_____ 27 26 14 1 M__ARU_____ 26 25 15 1 M__ARU_____ 20 19 16 1 M__ARU_____ 18 17 17 1 M__ARU_____ 9 8 18 1 M__A_U_____ 4 3 19 1 M__ARU_____ 27 26 20 1 M__ARU_____ 17 16 21 1 M__ARU_____ 20 19 22 1 M__ARU_____ 27 26 23 1 M__ARU_____ 20 19 24 1 M__ARU_____ 26 25 25 1 _____U_____ 1 0 26 1 M__A_U_____ 4 3 27 1 M__ARU_____ 20 19 28 4 _____U_____ 1 0 1525477 12288 0 309011 ____ DIR init /etc 1526463 64634 65536 309009 ____ REG init /etc/ld.so.cache 0 1 ___ARU_____ 1 0 1 1 _____U_____ 1 0 2 13 ___ARU_____ 1 0 15 1 ____RU_____ 1 0 1590258 241632 241664 309005 ____ REG init /lib/libsepol.so.1 0 5 M__ARU_____ 2 1 5 42 _____U_____ 1 0 47 1 M__ARU_____ 2 1 48 11 _____U_____ 1 0 1590330 117848 118784 308989 ____ REG init /lib/libselinux.so.1 0 1 M__ARU_____ 7 6 1 4 M__ARU_____ 4 3 5 1 M__ARU_____ 5 4 6 5 _____U_____ 1 0 11 2 M__ARU_____ 4 3 13 5 _____U_____ 1 0 18 1 ___ARU_____ 1 0 19 2 _____U_____ 1 0 21 1 M__ARU_____ 5 4 22 7 _____U_____ 1 0 4563314 14 4096 308982 ____ LNK init /lib/libc.so.6 0 1 ___ARU_____ 1 0 4563283 1432968 1433600 308981 ____ REG init /lib/libc-2.11.2.so 0 3 M__ARU_____ 27 26 3 1 M__ARU_____ 25 24 4 2 M__ARU_____ 23 22 6 1 M__ARU_____ 26 25 7 1 M__ARU_____ 22 21 8 1 M__ARU_____ 27 26 9 2 M__ARU_____ 25 24 11 1 M__ARU_____ 23 22 12 1 M__ARU_____ 25 24 13 1 M__ARU_____ 24 23 14 1 M__ARU_____ 25 24 15 3 M__ARU_____ 24 23 18 3 M__ARU_____ 26 25 21 2 M__ARU_____ 27 26 23 7 M__ARU_____ 17 16 30 1 M__ARU_____ 29 28 31 1 M__ARU_____ 25 24 32 2 M__ARU_____ 4 3 34 1 M__ARU_____ 3 2 35 2 M__ARU_____ 4 3 37 1 M__ARU_____ 2 1 38 1 _____U_____ 1 0 39 1 M__ARU_____ 4 3 40 1 M__ARU_____ 13 12 41 1 M__ARU_____ 12 11 42 1 M__ARU_____ 5 4 43 1 M__ARU_____ 23 22 44 2 M__ARU_____ 6 5 46 1 ___ARU_____ 1 0 47 1 M__ARU_____ 12 11 48 1 M__ARU_____ 4 3 49 1 M__ARU_____ 18 17 50 1 M__ARU_____ 29 28 51 2 M__ARU_____ 2 1 53 1 M__ARU_____ 27 26 54 1 M__ARU_____ 19 18 55 1 M__ARU_____ 25 24 56 2 _____U_____ 1 0 58 2 M__ARU_____ 2 1 60 2 _____U_____ 1 0 62 1 M__A_U_____ 2 1 63 1 _____U_____ 1 0 64 1 ___ARU_____ 1 0 65 3 M__ARU_____ 29 28 68 1 M__ARU_____ 21 20 69 1 M__ARU_____ 26 25 70 1 M__ARU_____ 9 8 71 1 M__ARU_____ 3 2 72 2 ___ARU_____ 1 0 74 2 _____U_____ 1 0 76 1 M__ARU_____ 27 26 77 2 M__ARU_____ 13 12 79 1 M__ARU_____ 9 8 80 1 M__ARU_____ 10 9 81 1 M__A_U_____ 2 1 82 1 M___RU_____ 4 3 83 1 M__ARU_____ 3 2 84 1 M__ARU_____ 16 15 85 1 M__ARU_____ 3 2 86 12 _____U_____ 1 0 98 1 M__ARU_____ 26 25 99 1 M__ARU_____ 25 24 100 2 M__ARU_____ 17 16 102 1 M__ARU_____ 25 24 103 1 M__ARU_____ 18 17 104 1 M__ARU_____ 14 13 105 3 _____U_____ 1 0 108 1 M__ARU_____ 12 11 109 2 M__ARU_____ 26 25 111 6 M__ARU_____ 30 29 117 1 M__ARU_____ 29 28 118 1 M__ARU_____ 30 29 119 1 M__ARU_____ 19 18 120 1 M__ARU_____ 22 21 121 1 M__ARU_____ 3 2 122 1 M__ARU_____ 28 27 123 1 M__ARU_____ 30 29 124 1 M__ARU_____ 11 10 125 1 M__ARU_____ 26 25 126 1 M__ARU_____ 22 21 127 2 M__ARU_____ 29 28 129 2 M__ARU_____ 5 4 131 1 M__ARU_____ 10 9 132 1 M__ARU_____ 25 24 133 2 M__ARU_____ 17 16 135 1 M__ARU_____ 3 2 136 6 _____U_____ 1 0 142 2 M__ARU_____ 3 2 144 1 M__ARU_____ 8 7 145 1 M__ARU_____ 22 21 146 3 M__ARU_____ 8 7 149 2 _____U_____ 1 0 151 3 M__ARU_____ 6 5 154 2 _____U_____ 1 0 156 1 M__ARU_____ 8 7 157 1 M__ARU_____ 10 9 158 1 M__ARU_____ 9 8 159 1 M__ARU_____ 8 7 160 1 M__ARU_____ 28 27 161 1 M__ARU_____ 30 29 162 1 M__ARU_____ 14 13 163 1 M____U_____ 2 1 164 2 _____U_____ 1 0 166 2 M__ARU_____ 4 3 168 1 M__ARU_____ 12 11 169 1 M__ARU_____ 10 9 170 1 M__ARU_____ 4 3 171 3 M__ARU_____ 3 2 174 6 ___ARU_____ 1 0 180 1 _____U_____ 1 0 181 9 ___ARU_____ 1 0 190 1 M__ARU_____ 4 3 191 1 ___A_U_____ 1 0 192 1 _____U_____ 1 0 193 1 ___A_U_____ 1 0 194 1 M__ARU_____ 30 29 195 1 M__ARU_____ 27 26 196 1 M__ARU_____ 17 16 197 2 _____U_____ 1 0 199 1 M__ARU_____ 27 26 200 1 M__ARU_____ 25 24 201 1 M__ARU_____ 2 1 202 1 M__ARU_____ 9 8 203 1 M__ARU_____ 26 25 204 1 M__ARU_____ 14 13 205 1 M__ARU_____ 4 3 206 1 M__ARU_____ 18 17 207 1 M__ARU_____ 26 25 208 1 M__ARU_____ 22 21 209 1 M__ARU_____ 2 1 210 1 M__ARU_____ 3 2 211 2 M____U_____ 2 1 213 5 _____U_____ 1 0 218 1 ___A_U_____ 1 0 > > > According to http://kerneltrap.org/node/2157 it sped up cold boot by > "10%", whatever that means. Seems that I wasn't sufficiently impressed > by that and got distracted. > > I'm not sure any of that was very useful, really. A full-on coldboot > optimiser really wants visibility into every disk block which need to > be read, and then mechanisms to tell the kernel to load those blocks > into the correct address_spaces. That's hard, because file data > depends on file metadata. A vast simplification would be to do it in > two disk passes: read all the metadata on pass 1 then all the data on > pass 2. Yes, that is what this patchset tries to do. > A totally different approach is to reorder all the data and metadata > on-disk, so no special cold-boot processing is needed at all. The boot time speedup mentioned in the changelog won't be possible without the physical data/metadata reordering. Fortunately btrfs makes it a trivial task. > And a third approach is to save all the cache into a special > file/partition/etc and to preload all that into kernel data structures > at boot. Obviously this one is ricky/tricky because the on-disk > replica of the real data can get out of sync with the real data. Hah! We are thinking much alike :) It's a very good optimization for LiveCDs and readonly mounted NFS /usr. For a typical desktop, the solution in my mind is to install some initscript to run at halt/reboot time, after all other tasks have been killed and filesystems remounted readonly. At the time it may dump whatever in the page cache to the swap partition. At the next boot, the data/metadata can then be read back _perfectly sequentially_ for populating the page cache. For kexec based reboot, the data can even be passed to next kernel directly, saving the disk IO totally. Thanks, Fengguang