From: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
To: "Li, Shaohua" <shaohua.li-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: "linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
"linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Chris Mason <chris.mason-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
Andrew Morton
<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
Arjan van de Ven <arjan-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
"Yan,
Zheng" <zheng.z.yan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>,
"linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
"mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org"
<mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
Date: Tue, 11 Jan 2011 17:13:53 +0800 [thread overview]
Message-ID: <20110111091353.GA13753@localhost> (raw)
In-Reply-To: <1294716453.1949.625.camel@sli10-conroe>
On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote:
> On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
> > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> > > On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
> > > > On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
> > > > > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> > > > > > Shaohua,
> > > > > >
> > > > > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > > > > > > Hi,
> > > > > > > We have file readahead to do asyn file read, but has no metadata
> > > > > > > readahead. For a list of files, their metadata is stored in fragmented
> > > > > > > disk space and metadata read is a sync operation, which impacts the
> > > > > > > efficiency of readahead much. The patches try to add meatadata readahead
> > > > > > > for btrfs.
> > > > > > > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > > > the inode to a fd so we could use existing syscalls (readahead, mincore
> > > > > > > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > > > > > > no easy way for this from my understanding. So we add two ioctls for
> > > > > >
> > > > > > If that is the main obstacle, why not do straightforward fincore()/
> > > > > > fadvise(), and add ioctls to btrfs to export/grab the hidden
> > > > > > btree_inode in any form? This will address btrfs' specific issue, and
> > > > > > have the benefit of making the VFS part general enough. You know
> > > > > > ext2/3/4 already have block_dev ready for metadata readahead.
> > > > > I forgot to update this comment. Please see patch 2 and patch 4, both
> > > > > incore and readahead need btrfs specific staff involved, so we can't use
> > > > > generic fincore or something.
> > > >
> > > > You can if you like :)
> > > >
> > > > - fincore() can return the referenced bit, which is generally
> > > > useful information
> > > metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
> > > we can't blindly filter out such pages with the bit.
> >
> > block_dev inodes have the accessed bits. Look at the below output.
> >
> > /dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the
> > dump_page_cache lines stand for Active/Referenced.
> ext4 already does readahead? please check other filesystems.
ext3/4 does readahead on accessing large directories. However that's
orthogonal feature to the user space metadata readahead. The latter is
still important for fast boot on ext3/4.
> filesystem sues bread like API to read metadata, which definitely
> doesn't set referenced bit.
__find_get_block() will call touch_buffer() which is a synonymous for
mark_page_accessed().
> > root@bay /home/wfg# echo /dev/sda5 > /debug/tracing/objects/mm/pages/dump-file
> > root@bay /home/wfg# cat /debug/tracing/trace
> > # tracer: nop
> > #
> > # TASK-PID CPU# TIMESTAMP FUNCTION
> > # | | | | |
> > zsh-2950 [003] 879.500764: dump_inode_cache: 0 55643986944 1703936 21879 D___ BLK mount /dev/sda5
> > zsh-2950 [003] 879.500774: dump_page_cache: 0 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500776: dump_page_cache: 2 3 ____R_____P 2 0
> > zsh-2950 [003] 879.500777: dump_page_cache: 1026 5 ___AR_____P 2 0
> > zsh-2950 [003] 879.500778: dump_page_cache: 1031 3 ___A______P 2 0
> > zsh-2950 [003] 879.500779: dump_page_cache: 1034 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500780: dump_page_cache: 1035 2 ___A______P 2 0
> > zsh-2950 [003] 879.500781: dump_page_cache: 1037 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500782: dump_page_cache: 1038 3 ____R_____P 2 0
> > zsh-2950 [003] 879.500782: dump_page_cache: 1041 1 ___A______P 2 0
> > zsh-2950 [003] 879.500783: dump_page_cache: 1057 1 ___AR_D___P 2 0
> > zsh-2950 [003] 879.500788: dump_page_cache: 1058 6 ___A______P 2 0
> > zsh-2950 [003] 879.500788: dump_page_cache: 9249 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500789: dump_page_cache: 524289 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500790: dump_page_cache: 524290 2 ___A______P 2 0
> > zsh-2950 [003] 879.500790: dump_page_cache: 524292 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500791: dump_page_cache: 524293 1 ___A______P 2 0
> > zsh-2950 [003] 879.500796: dump_page_cache: 524294 9 ____R_____P 2 0
> > zsh-2950 [003] 879.500797: dump_page_cache: 524303 1 ___A______P 2 0
> > zsh-2950 [003] 879.500798: dump_page_cache: 987136 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500798: dump_page_cache: 1048576 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500799: dump_page_cache: 1048577 2 ___A______P 2 0
> > zsh-2950 [003] 879.500800: dump_page_cache: 1048579 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500801: dump_page_cache: 1048580 5 ___A______P 2 0
> > zsh-2950 [003] 879.500802: dump_page_cache: 1048585 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500805: dump_page_cache: 1048586 5 ___A______P 2 0
> > zsh-2950 [003] 879.500805: dump_page_cache: 1048591 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500806: dump_page_cache: 1572864 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500807: dump_page_cache: 1572865 5 ___A______P 2 0
> > zsh-2950 [003] 879.500808: dump_page_cache: 1572870 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500811: dump_page_cache: 1572871 6 ___A______P 2 0
> > zsh-2950 [003] 879.500812: dump_page_cache: 1572877 3 ____R_____P 2 0
> > zsh-2950 [003] 879.500816: dump_page_cache: 2097153 8 ____R_____P 2 0
> > zsh-2950 [003] 879.500817: dump_page_cache: 2097161 1 ___A______P 2 0
> > zsh-2950 [003] 879.500818: dump_page_cache: 2097162 4 ____R_____P 2 0
> > zsh-2950 [003] 879.500819: dump_page_cache: 6324224 1 ____R_D___P 2 0
> > zsh-2950 [003] 879.500820: dump_page_cache: 6324225 3 ___AR_____P 2 0
> > zsh-2950 [003] 879.500825: dump_page_cache: 6324228 29 ___A______P 2 0
> > zsh-2950 [003] 879.500826: dump_page_cache: 6324257 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500828: dump_page_cache: 6324258 4 ___A______P 2 0
> > zsh-2950 [003] 879.500830: dump_page_cache: 6324262 11 ____R_____P 2 0
> > zsh-2950 [003] 879.500833: dump_page_cache: 6324273 16 ___AR_____P 2 0
> > zsh-2950 [003] 879.500833: dump_page_cache: 6324289 1 ___A______P 2 0
> > zsh-2950 [003] 879.500834: dump_page_cache: 6324290 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500835: dump_page_cache: 6324292 8 ___A______P 2 0
> > zsh-2950 [003] 879.500836: dump_page_cache: 6324300 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500837: dump_page_cache: 6324302 3 ___A______P 2 0
> > zsh-2950 [003] 879.500838: dump_page_cache: 6324305 4 ____R_____P 2 0
> > zsh-2950 [003] 879.500843: dump_page_cache: 6324309 28 ___AR_____P 2 0
> > zsh-2950 [003] 879.500844: dump_page_cache: 6324337 4 ___A______P 2 0
> > zsh-2950 [003] 879.500845: dump_page_cache: 6324341 2 ____R_____P 2 0
> > zsh-2950 [003] 879.500850: dump_page_cache: 6324343 30 ___AR_____P 2 0
> > zsh-2950 [003] 879.500851: dump_page_cache: 6324373 2 ___A______P 2 0
> > zsh-2950 [003] 879.500852: dump_page_cache: 6324375 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500853: dump_page_cache: 6324377 9 ___A______P 2 0
> > zsh-2950 [003] 879.500854: dump_page_cache: 6324386 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500855: dump_page_cache: 6324388 5 ___A______P 2 0
> > zsh-2950 [003] 879.500856: dump_page_cache: 6324393 3 ___AR_____P 2 0
> > zsh-2950 [003] 879.500858: dump_page_cache: 6324396 11 ___A______P 2 0
> > zsh-2950 [003] 879.500859: dump_page_cache: 6324407 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500864: dump_page_cache: 6324408 31 ___AR_____P 2 0
> > zsh-2950 [003] 879.500864: dump_page_cache: 6324439 1 ___A______P 2 0
> > zsh-2950 [003] 879.500865: dump_page_cache: 6324440 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500866: dump_page_cache: 6324441 2 ___A______P 2 0
> > zsh-2950 [003] 879.500867: dump_page_cache: 6324443 5 ____R_____P 2 0
> > zsh-2950 [003] 879.500872: dump_page_cache: 6324448 26 ___AR_____P 2 0
> > zsh-2950 [003] 879.500873: dump_page_cache: 6324474 6 ___A______P 2 0
> > zsh-2950 [003] 879.500874: dump_page_cache: 6324480 4 ____R_____P 2 0
> > zsh-2950 [003] 879.500879: dump_page_cache: 6324484 28 ___AR_____P 2 0
> > zsh-2950 [003] 879.500880: dump_page_cache: 6324512 4 ___A______P 2 0
> > zsh-2950 [003] 879.500881: dump_page_cache: 6324516 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500881: dump_page_cache: 6324517 1 ___A______P 2 0
> > zsh-2950 [003] 879.500882: dump_page_cache: 6324518 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500888: dump_page_cache: 6324520 28 ___A______P 2 0
> > zsh-2950 [003] 879.500890: dump_page_cache: 6324548 2 ____R_____P 2 0
> >
> > > fincore can takes a parameter or it returns a bit to distinguish
> > > referenced pages, but I don't think it's a good API. This should be
> > > transparent to userspace.
> >
> > Users care about the "cached" status may well be interested in the
> > "active/referenced" status. They are co-related information. fincore()
> > won't be a simple replication of mincore() anyway. fincore() has to
> > deal with huge sparsely accessed files. The accessed bits of a file
> > page are normally more meaningful than the accessed bits of mapped
> > (anonymous) pages.
> if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough.
It's a reasonable thing to set the accessed bits. So I believe the
various filesystems are calling mark_page_accessed() on their metadata
inode, or can be changed to do it.
> > Another option may be to use the above
> > /debug/tracing/objects/mm/pages/dump-file interface.
> >
> > > > - btrfs_metadata_readahead() can be passed to some (faked)
> > > > ->readpages() for use with fadvise.
> > > this need filesystem specific hook too, the difference is your proposal
> > > uses fadvise but I'm using ioctl. There isn't big difference.
> >
> > True for btrfs. However they make big differences for other file systems.
> why?
The block_dev of ext2/3/4 can do metadata query/readahead directly
with fincore()+fadvise(), with no need for any additional ioctls.
Given that the vast majority desktops are running ext2/3/4, it seems
worthwhile to have a straightforward solution for them.
> > > BTW, it's hard to hook btrfs_inode to a fd even with a ioctl, at least I
> > > didn't find a easy way to do this. It might be possible to do this for
> > > example adding a fake device or fake fs (anon_inode doesn't work here,
> > > IIRC), which is a bit ugly. Before it's proved generic API can handle
> > > metadata readahead, I don't want to do it.
> >
> > Right, it could be hard to export btrfs_inode. I'm glad you speak it
> > out. If we cannot make it, it's valuable to point out the problem and
> > let everyone know the root cause we turn to an ioctl based workaround.
> > Then others will understand the design choices, and if lucky, join us
> > and help export the btrfs_inode.
> I didn't hide anything. I actually tell out this in the comments. this
> is what I said.
Ah, sorry for overlooking this message!
Thanks,
Fengguang
> In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > > > the inode to a fd so we could use existing syscalls
> (readahead, mincore
> > > > > > > or upcoming fincore) to do readahead, but the inode is
> hidden, there is
> > > > > > > no easy way for this from my understanding.
>
>
> Thanks,
> Shaohua
> > > > > > > this. One is like readahead syscall, the other is like micore/fincore
> > > > > > > syscall.
> > > > > > > Under a harddisk based netbook with Meego, the metadata readahead
> > > > > > > reduced about 3.5s boot time in average from total 16s.
> > > > > > > Last time I posted similar patches to btrfs maillist, which adds the
> > > > > > > new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> > > > > > > have a generic interface to do this so other filesystem can share some
> > > > > > > code, so I came up with the new one. Comments and suggestions are
> > > > > > > welcome!
> > > > > > >
> > > > > > > v1->v2:
> > > > > > > 1. Added more comments and fix return values suggested by Andrew Morton
> > > > > > > 2. fix a race condition pointed out by Yan Zheng
> > > > > > >
> > > > > > > initial post:
> > > > > > > http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Shaohua
> > > > > > >
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > >
> > >
> > >
>
>
next prev parent reply other threads:[~2011-01-11 9:13 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-04 5:40 [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs Shaohua Li
2011-01-04 16:14 ` Jeff Moyer
[not found] ` <x498vz0abov.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2011-01-05 2:10 ` Shaohua Li
2011-01-10 14:26 ` Wu Fengguang
2011-01-11 0:15 ` Shaohua Li
2011-01-11 1:38 ` Wu Fengguang
2011-01-11 2:03 ` Shaohua Li
2011-01-11 3:07 ` Wu Fengguang
2011-01-11 3:27 ` Shaohua Li
2011-01-11 9:13 ` Wu Fengguang [this message]
2011-01-12 2:55 ` Shaohua Li
[not found] ` <20110112025516.GA11303-yAZKuqJtXNMXR+D7ky4Foa2pdiUAq4bhAL8bYrjMMd8@public.gmane.org>
2011-01-16 3:38 ` Wu Fengguang
2011-01-17 1:32 ` Shaohua Li
2011-01-18 4:41 ` Wu Fengguang
2011-01-18 5:15 ` Shaohua Li
2011-01-18 6:22 ` Wu Fengguang
2011-01-18 6:35 ` Shaohua Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110111091353.GA13753@localhost \
--to=fengguang.wu-ral2jqcrhueavxtiumwx3w@public.gmane.org \
--cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
--cc=arjan-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
--cc=chris.mason-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org \
--cc=hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
--cc=linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
--cc=shaohua.li-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
--cc=zheng.z.yan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).