* [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
@ 2011-01-04 5:40 Shaohua Li
2011-01-04 16:14 ` Jeff Moyer
2011-01-10 14:26 ` Wu Fengguang
0 siblings, 2 replies; 17+ messages in thread
From: Shaohua Li @ 2011-01-04 5:40 UTC (permalink / raw)
To: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Cc: Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api, mtk.manpages
Hi,
We have file readahead to do asyn file read, but has no metadata
readahead. For a list of files, their metadata is stored in fragmented
disk space and metadata read is a sync operation, which impacts the
efficiency of readahead much. The patches try to add meatadata readahead
for btrfs.
In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
the inode to a fd so we could use existing syscalls (readahead, mincore
or upcoming fincore) to do readahead, but the inode is hidden, there is
no easy way for this from my understanding. So we add two ioctls for
this. One is like readahead syscall, the other is like micore/fincore
syscall.
Under a harddisk based netbook with Meego, the metadata readahead
reduced about 3.5s boot time in average from total 16s.
Last time I posted similar patches to btrfs maillist, which adds the
new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
have a generic interface to do this so other filesystem can share some
code, so I came up with the new one. Comments and suggestions are
welcome!
v1->v2:
1. Added more comments and fix return values suggested by Andrew Morton
2. fix a race condition pointed out by Yan Zheng
initial post:
http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-04 5:40 [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs Shaohua Li
@ 2011-01-04 16:14 ` Jeff Moyer
[not found] ` <x498vz0abov.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2011-01-10 14:26 ` Wu Fengguang
1 sibling, 1 reply; 17+ messages in thread
From: Jeff Moyer @ 2011-01-04 16:14 UTC (permalink / raw)
To: Shaohua Li
Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w
Shaohua Li <shaohua.li-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> writes:
> Hi,
> We have file readahead to do asyn file read, but has no metadata
> readahead. For a list of files, their metadata is stored in fragmented
> disk space and metadata read is a sync operation, which impacts the
> efficiency of readahead much. The patches try to add meatadata readahead
> for btrfs.
> In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> the inode to a fd so we could use existing syscalls (readahead, mincore
> or upcoming fincore) to do readahead, but the inode is hidden, there is
> no easy way for this from my understanding. So we add two ioctls for
> this. One is like readahead syscall, the other is like micore/fincore
> syscall.
> Under a harddisk based netbook with Meego, the metadata readahead
> reduced about 3.5s boot time in average from total 16s.
> Last time I posted similar patches to btrfs maillist, which adds the
> new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> have a generic interface to do this so other filesystem can share some
> code, so I came up with the new one. Comments and suggestions are
> welcome!
Is it not possible to enhance the existing readahead mechanisms to work
on metadata as well? Is there some reason why metadata should be
fetched separately from the data it references?
Cheers,
Jeff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
[not found] ` <x498vz0abov.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
@ 2011-01-05 2:10 ` Shaohua Li
0 siblings, 0 replies; 17+ messages in thread
From: Shaohua Li @ 2011-01-05 2:10 UTC (permalink / raw)
To: Jeff Moyer
Cc: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
On Wed, 2011-01-05 at 00:14 +0800, Jeff Moyer wrote:
> Shaohua Li <shaohua.li-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> writes:
>
> > Hi,
> > We have file readahead to do asyn file read, but has no metadata
> > readahead. For a list of files, their metadata is stored in fragmented
> > disk space and metadata read is a sync operation, which impacts the
> > efficiency of readahead much. The patches try to add meatadata readahead
> > for btrfs.
> > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > the inode to a fd so we could use existing syscalls (readahead, mincore
> > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > no easy way for this from my understanding. So we add two ioctls for
> > this. One is like readahead syscall, the other is like micore/fincore
> > syscall.
> > Under a harddisk based netbook with Meego, the metadata readahead
> > reduced about 3.5s boot time in average from total 16s.
> > Last time I posted similar patches to btrfs maillist, which adds the
> > new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> > have a generic interface to do this so other filesystem can share some
> > code, so I came up with the new one. Comments and suggestions are
> > welcome!
>
> Is it not possible to enhance the existing readahead mechanisms to work
> on metadata as well?
using existing sys_readahead to do metadata readahead? The problem is I
can't hook a fd for meatadata inode, as explained above.
or let kernel automatically do metadata readahead? Kernel can't be so
intelligent, because kernel doesn't even know which file should be
readahead till userspace tells it
> Is there some reason why metadata should be
> fetched separately from the data it references?
metadata read is sync operation, which will break data readahead
pipeline. And metadata and data usually lives in not adjacent disk
blocks, which will introduce a lot of disk seeks. reading metadata first
and then do data readahead can reduce a lot of disk seeks and data
readahead can be fully pumped
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-04 5:40 [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs Shaohua Li
2011-01-04 16:14 ` Jeff Moyer
@ 2011-01-10 14:26 ` Wu Fengguang
2011-01-11 0:15 ` Shaohua Li
1 sibling, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2011-01-10 14:26 UTC (permalink / raw)
To: Shaohua Li
Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api, mtk.manpages
Shaohua,
On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> Hi,
> We have file readahead to do asyn file read, but has no metadata
> readahead. For a list of files, their metadata is stored in fragmented
> disk space and metadata read is a sync operation, which impacts the
> efficiency of readahead much. The patches try to add meatadata readahead
> for btrfs.
> In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> the inode to a fd so we could use existing syscalls (readahead, mincore
> or upcoming fincore) to do readahead, but the inode is hidden, there is
> no easy way for this from my understanding. So we add two ioctls for
If that is the main obstacle, why not do straightforward fincore()/
fadvise(), and add ioctls to btrfs to export/grab the hidden
btree_inode in any form? This will address btrfs' specific issue, and
have the benefit of making the VFS part general enough. You know
ext2/3/4 already have block_dev ready for metadata readahead.
Thanks,
Fengguang
> this. One is like readahead syscall, the other is like micore/fincore
> syscall.
> Under a harddisk based netbook with Meego, the metadata readahead
> reduced about 3.5s boot time in average from total 16s.
> Last time I posted similar patches to btrfs maillist, which adds the
> new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> have a generic interface to do this so other filesystem can share some
> code, so I came up with the new one. Comments and suggestions are
> welcome!
>
> v1->v2:
> 1. Added more comments and fix return values suggested by Andrew Morton
> 2. fix a race condition pointed out by Yan Zheng
>
> initial post:
> http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2
>
> Thanks,
> Shaohua
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-10 14:26 ` Wu Fengguang
@ 2011-01-11 0:15 ` Shaohua Li
2011-01-11 1:38 ` Wu Fengguang
0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2011-01-11 0:15 UTC (permalink / raw)
To: Wu, Fengguang
Cc: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> Shaohua,
>
> On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > Hi,
> > We have file readahead to do asyn file read, but has no metadata
> > readahead. For a list of files, their metadata is stored in fragmented
> > disk space and metadata read is a sync operation, which impacts the
> > efficiency of readahead much. The patches try to add meatadata readahead
> > for btrfs.
> > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > the inode to a fd so we could use existing syscalls (readahead, mincore
> > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > no easy way for this from my understanding. So we add two ioctls for
>
> If that is the main obstacle, why not do straightforward fincore()/
> fadvise(), and add ioctls to btrfs to export/grab the hidden
> btree_inode in any form? This will address btrfs' specific issue, and
> have the benefit of making the VFS part general enough. You know
> ext2/3/4 already have block_dev ready for metadata readahead.
I forgot to update this comment. Please see patch 2 and patch 4, both
incore and readahead need btrfs specific staff involved, so we can't use
generic fincore or something.
Thanks,
Shaohua
> > this. One is like readahead syscall, the other is like micore/fincore
> > syscall.
> > Under a harddisk based netbook with Meego, the metadata readahead
> > reduced about 3.5s boot time in average from total 16s.
> > Last time I posted similar patches to btrfs maillist, which adds the
> > new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> > have a generic interface to do this so other filesystem can share some
> > code, so I came up with the new one. Comments and suggestions are
> > welcome!
> >
> > v1->v2:
> > 1. Added more comments and fix return values suggested by Andrew Morton
> > 2. fix a race condition pointed out by Yan Zheng
> >
> > initial post:
> > http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2
> >
> > Thanks,
> > Shaohua
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-11 0:15 ` Shaohua Li
@ 2011-01-11 1:38 ` Wu Fengguang
2011-01-11 2:03 ` Shaohua Li
0 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2011-01-11 1:38 UTC (permalink / raw)
To: Li, Shaohua
Cc: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
> On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> > Shaohua,
> >
> > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > > Hi,
> > > We have file readahead to do asyn file read, but has no metadata
> > > readahead. For a list of files, their metadata is stored in fragmented
> > > disk space and metadata read is a sync operation, which impacts the
> > > efficiency of readahead much. The patches try to add meatadata readahead
> > > for btrfs.
> > > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > the inode to a fd so we could use existing syscalls (readahead, mincore
> > > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > > no easy way for this from my understanding. So we add two ioctls for
> >
> > If that is the main obstacle, why not do straightforward fincore()/
> > fadvise(), and add ioctls to btrfs to export/grab the hidden
> > btree_inode in any form? This will address btrfs' specific issue, and
> > have the benefit of making the VFS part general enough. You know
> > ext2/3/4 already have block_dev ready for metadata readahead.
> I forgot to update this comment. Please see patch 2 and patch 4, both
> incore and readahead need btrfs specific staff involved, so we can't use
> generic fincore or something.
You can if you like :)
- fincore() can return the referenced bit, which is generally
useful information
- btrfs_metadata_readahead() can be passed to some (faked)
->readpages() for use with fadvise.
Thanks,
Fengguang
> > > this. One is like readahead syscall, the other is like micore/fincore
> > > syscall.
> > > Under a harddisk based netbook with Meego, the metadata readahead
> > > reduced about 3.5s boot time in average from total 16s.
> > > Last time I posted similar patches to btrfs maillist, which adds the
> > > new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> > > have a generic interface to do this so other filesystem can share some
> > > code, so I came up with the new one. Comments and suggestions are
> > > welcome!
> > >
> > > v1->v2:
> > > 1. Added more comments and fix return values suggested by Andrew Morton
> > > 2. fix a race condition pointed out by Yan Zheng
> > >
> > > initial post:
> > > http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2
> > >
> > > Thanks,
> > > Shaohua
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-11 1:38 ` Wu Fengguang
@ 2011-01-11 2:03 ` Shaohua Li
2011-01-11 3:07 ` Wu Fengguang
0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2011-01-11 2:03 UTC (permalink / raw)
To: Wu, Fengguang
Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api@vger.kernel.org, mtk.manpages@gmail.com
On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
> On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
> > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> > > Shaohua,
> > >
> > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > > > Hi,
> > > > We have file readahead to do asyn file read, but has no metadata
> > > > readahead. For a list of files, their metadata is stored in fragmented
> > > > disk space and metadata read is a sync operation, which impacts the
> > > > efficiency of readahead much. The patches try to add meatadata readahead
> > > > for btrfs.
> > > > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > the inode to a fd so we could use existing syscalls (readahead, mincore
> > > > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > > > no easy way for this from my understanding. So we add two ioctls for
> > >
> > > If that is the main obstacle, why not do straightforward fincore()/
> > > fadvise(), and add ioctls to btrfs to export/grab the hidden
> > > btree_inode in any form? This will address btrfs' specific issue, and
> > > have the benefit of making the VFS part general enough. You know
> > > ext2/3/4 already have block_dev ready for metadata readahead.
> > I forgot to update this comment. Please see patch 2 and patch 4, both
> > incore and readahead need btrfs specific staff involved, so we can't use
> > generic fincore or something.
>
> You can if you like :)
>
> - fincore() can return the referenced bit, which is generally
> useful information
metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
we can't blindly filter out such pages with the bit. fincore can takes a
parameter or it returns a bit to distinguish referenced pages, but I
don't think it's a good API. This should be transparent to userspace.
> - btrfs_metadata_readahead() can be passed to some (faked)
> ->readpages() for use with fadvise.
this need filesystem specific hook too, the difference is your proposal
uses fadvise but I'm using ioctl. There isn't big difference.
BTW, it's hard to hook btrfs_inode to a fd even with a ioctl, at least I
didn't find a easy way to do this. It might be possible to do this for
example adding a fake device or fake fs (anon_inode doesn't work here,
IIRC), which is a bit ugly. Before it's proved generic API can handle
metadata readahead, I don't want to do it.
Thanks,
Shaohua
> > > > this. One is like readahead syscall, the other is like micore/fincore
> > > > syscall.
> > > > Under a harddisk based netbook with Meego, the metadata readahead
> > > > reduced about 3.5s boot time in average from total 16s.
> > > > Last time I posted similar patches to btrfs maillist, which adds the
> > > > new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> > > > have a generic interface to do this so other filesystem can share some
> > > > code, so I came up with the new one. Comments and suggestions are
> > > > welcome!
> > > >
> > > > v1->v2:
> > > > 1. Added more comments and fix return values suggested by Andrew Morton
> > > > 2. fix a race condition pointed out by Yan Zheng
> > > >
> > > > initial post:
> > > > http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2
> > > >
> > > > Thanks,
> > > > Shaohua
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-11 2:03 ` Shaohua Li
@ 2011-01-11 3:07 ` Wu Fengguang
2011-01-11 3:27 ` Shaohua Li
0 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2011-01-11 3:07 UTC (permalink / raw)
To: Li, Shaohua
Cc: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
> > On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
> > > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> > > > Shaohua,
> > > >
> > > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > > > > Hi,
> > > > > We have file readahead to do asyn file read, but has no metadata
> > > > > readahead. For a list of files, their metadata is stored in fragmented
> > > > > disk space and metadata read is a sync operation, which impacts the
> > > > > efficiency of readahead much. The patches try to add meatadata readahead
> > > > > for btrfs.
> > > > > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > the inode to a fd so we could use existing syscalls (readahead, mincore
> > > > > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > > > > no easy way for this from my understanding. So we add two ioctls for
> > > >
> > > > If that is the main obstacle, why not do straightforward fincore()/
> > > > fadvise(), and add ioctls to btrfs to export/grab the hidden
> > > > btree_inode in any form? This will address btrfs' specific issue, and
> > > > have the benefit of making the VFS part general enough. You know
> > > > ext2/3/4 already have block_dev ready for metadata readahead.
> > > I forgot to update this comment. Please see patch 2 and patch 4, both
> > > incore and readahead need btrfs specific staff involved, so we can't use
> > > generic fincore or something.
> >
> > You can if you like :)
> >
> > - fincore() can return the referenced bit, which is generally
> > useful information
> metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
> we can't blindly filter out such pages with the bit.
block_dev inodes have the accessed bits. Look at the below output.
/dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the
dump_page_cache lines stand for Active/Referenced.
root@bay /home/wfg# echo /dev/sda5 > /debug/tracing/objects/mm/pages/dump-file
root@bay /home/wfg# cat /debug/tracing/trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-2950 [003] 879.500764: dump_inode_cache: 0 55643986944 1703936 21879 D___ BLK mount /dev/sda5
zsh-2950 [003] 879.500774: dump_page_cache: 0 2 ___AR_____P 2 0
zsh-2950 [003] 879.500776: dump_page_cache: 2 3 ____R_____P 2 0
zsh-2950 [003] 879.500777: dump_page_cache: 1026 5 ___AR_____P 2 0
zsh-2950 [003] 879.500778: dump_page_cache: 1031 3 ___A______P 2 0
zsh-2950 [003] 879.500779: dump_page_cache: 1034 1 ___AR_____P 2 0
zsh-2950 [003] 879.500780: dump_page_cache: 1035 2 ___A______P 2 0
zsh-2950 [003] 879.500781: dump_page_cache: 1037 1 ___AR_____P 2 0
zsh-2950 [003] 879.500782: dump_page_cache: 1038 3 ____R_____P 2 0
zsh-2950 [003] 879.500782: dump_page_cache: 1041 1 ___A______P 2 0
zsh-2950 [003] 879.500783: dump_page_cache: 1057 1 ___AR_D___P 2 0
zsh-2950 [003] 879.500788: dump_page_cache: 1058 6 ___A______P 2 0
zsh-2950 [003] 879.500788: dump_page_cache: 9249 1 ___AR_____P 2 0
zsh-2950 [003] 879.500789: dump_page_cache: 524289 1 ____R_____P 2 0
zsh-2950 [003] 879.500790: dump_page_cache: 524290 2 ___A______P 2 0
zsh-2950 [003] 879.500790: dump_page_cache: 524292 1 ___AR_____P 2 0
zsh-2950 [003] 879.500791: dump_page_cache: 524293 1 ___A______P 2 0
zsh-2950 [003] 879.500796: dump_page_cache: 524294 9 ____R_____P 2 0
zsh-2950 [003] 879.500797: dump_page_cache: 524303 1 ___A______P 2 0
zsh-2950 [003] 879.500798: dump_page_cache: 987136 1 ___AR_____P 2 0
zsh-2950 [003] 879.500798: dump_page_cache: 1048576 1 ____R_____P 2 0
zsh-2950 [003] 879.500799: dump_page_cache: 1048577 2 ___A______P 2 0
zsh-2950 [003] 879.500800: dump_page_cache: 1048579 1 ___AR_____P 2 0
zsh-2950 [003] 879.500801: dump_page_cache: 1048580 5 ___A______P 2 0
zsh-2950 [003] 879.500802: dump_page_cache: 1048585 1 ___AR_____P 2 0
zsh-2950 [003] 879.500805: dump_page_cache: 1048586 5 ___A______P 2 0
zsh-2950 [003] 879.500805: dump_page_cache: 1048591 1 ___AR_____P 2 0
zsh-2950 [003] 879.500806: dump_page_cache: 1572864 1 ____R_____P 2 0
zsh-2950 [003] 879.500807: dump_page_cache: 1572865 5 ___A______P 2 0
zsh-2950 [003] 879.500808: dump_page_cache: 1572870 1 ___AR_____P 2 0
zsh-2950 [003] 879.500811: dump_page_cache: 1572871 6 ___A______P 2 0
zsh-2950 [003] 879.500812: dump_page_cache: 1572877 3 ____R_____P 2 0
zsh-2950 [003] 879.500816: dump_page_cache: 2097153 8 ____R_____P 2 0
zsh-2950 [003] 879.500817: dump_page_cache: 2097161 1 ___A______P 2 0
zsh-2950 [003] 879.500818: dump_page_cache: 2097162 4 ____R_____P 2 0
zsh-2950 [003] 879.500819: dump_page_cache: 6324224 1 ____R_D___P 2 0
zsh-2950 [003] 879.500820: dump_page_cache: 6324225 3 ___AR_____P 2 0
zsh-2950 [003] 879.500825: dump_page_cache: 6324228 29 ___A______P 2 0
zsh-2950 [003] 879.500826: dump_page_cache: 6324257 1 ____R_____P 2 0
zsh-2950 [003] 879.500828: dump_page_cache: 6324258 4 ___A______P 2 0
zsh-2950 [003] 879.500830: dump_page_cache: 6324262 11 ____R_____P 2 0
zsh-2950 [003] 879.500833: dump_page_cache: 6324273 16 ___AR_____P 2 0
zsh-2950 [003] 879.500833: dump_page_cache: 6324289 1 ___A______P 2 0
zsh-2950 [003] 879.500834: dump_page_cache: 6324290 2 ___AR_____P 2 0
zsh-2950 [003] 879.500835: dump_page_cache: 6324292 8 ___A______P 2 0
zsh-2950 [003] 879.500836: dump_page_cache: 6324300 2 ___AR_____P 2 0
zsh-2950 [003] 879.500837: dump_page_cache: 6324302 3 ___A______P 2 0
zsh-2950 [003] 879.500838: dump_page_cache: 6324305 4 ____R_____P 2 0
zsh-2950 [003] 879.500843: dump_page_cache: 6324309 28 ___AR_____P 2 0
zsh-2950 [003] 879.500844: dump_page_cache: 6324337 4 ___A______P 2 0
zsh-2950 [003] 879.500845: dump_page_cache: 6324341 2 ____R_____P 2 0
zsh-2950 [003] 879.500850: dump_page_cache: 6324343 30 ___AR_____P 2 0
zsh-2950 [003] 879.500851: dump_page_cache: 6324373 2 ___A______P 2 0
zsh-2950 [003] 879.500852: dump_page_cache: 6324375 2 ___AR_____P 2 0
zsh-2950 [003] 879.500853: dump_page_cache: 6324377 9 ___A______P 2 0
zsh-2950 [003] 879.500854: dump_page_cache: 6324386 2 ___AR_____P 2 0
zsh-2950 [003] 879.500855: dump_page_cache: 6324388 5 ___A______P 2 0
zsh-2950 [003] 879.500856: dump_page_cache: 6324393 3 ___AR_____P 2 0
zsh-2950 [003] 879.500858: dump_page_cache: 6324396 11 ___A______P 2 0
zsh-2950 [003] 879.500859: dump_page_cache: 6324407 1 ____R_____P 2 0
zsh-2950 [003] 879.500864: dump_page_cache: 6324408 31 ___AR_____P 2 0
zsh-2950 [003] 879.500864: dump_page_cache: 6324439 1 ___A______P 2 0
zsh-2950 [003] 879.500865: dump_page_cache: 6324440 1 ____R_____P 2 0
zsh-2950 [003] 879.500866: dump_page_cache: 6324441 2 ___A______P 2 0
zsh-2950 [003] 879.500867: dump_page_cache: 6324443 5 ____R_____P 2 0
zsh-2950 [003] 879.500872: dump_page_cache: 6324448 26 ___AR_____P 2 0
zsh-2950 [003] 879.500873: dump_page_cache: 6324474 6 ___A______P 2 0
zsh-2950 [003] 879.500874: dump_page_cache: 6324480 4 ____R_____P 2 0
zsh-2950 [003] 879.500879: dump_page_cache: 6324484 28 ___AR_____P 2 0
zsh-2950 [003] 879.500880: dump_page_cache: 6324512 4 ___A______P 2 0
zsh-2950 [003] 879.500881: dump_page_cache: 6324516 1 ____R_____P 2 0
zsh-2950 [003] 879.500881: dump_page_cache: 6324517 1 ___A______P 2 0
zsh-2950 [003] 879.500882: dump_page_cache: 6324518 2 ___AR_____P 2 0
zsh-2950 [003] 879.500888: dump_page_cache: 6324520 28 ___A______P 2 0
zsh-2950 [003] 879.500890: dump_page_cache: 6324548 2 ____R_____P 2 0
> fincore can takes a parameter or it returns a bit to distinguish
> referenced pages, but I don't think it's a good API. This should be
> transparent to userspace.
Users care about the "cached" status may well be interested in the
"active/referenced" status. They are co-related information. fincore()
won't be a simple replication of mincore() anyway. fincore() has to
deal with huge sparsely accessed files. The accessed bits of a file
page are normally more meaningful than the accessed bits of mapped
(anonymous) pages.
Another option may be to use the above
/debug/tracing/objects/mm/pages/dump-file interface.
> > - btrfs_metadata_readahead() can be passed to some (faked)
> > ->readpages() for use with fadvise.
> this need filesystem specific hook too, the difference is your proposal
> uses fadvise but I'm using ioctl. There isn't big difference.
True for btrfs. However they make big differences for other file systems.
> BTW, it's hard to hook btrfs_inode to a fd even with a ioctl, at least I
> didn't find a easy way to do this. It might be possible to do this for
> example adding a fake device or fake fs (anon_inode doesn't work here,
> IIRC), which is a bit ugly. Before it's proved generic API can handle
> metadata readahead, I don't want to do it.
Right, it could be hard to export btrfs_inode. I'm glad you speak it
out. If we cannot make it, it's valuable to point out the problem and
let everyone know the root cause we turn to an ioctl based workaround.
Then others will understand the design choices, and if lucky, join us
and help export the btrfs_inode.
Thanks,
Fengguang
> > > > > this. One is like readahead syscall, the other is like micore/fincore
> > > > > syscall.
> > > > > Under a harddisk based netbook with Meego, the metadata readahead
> > > > > reduced about 3.5s boot time in average from total 16s.
> > > > > Last time I posted similar patches to btrfs maillist, which adds the
> > > > > new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> > > > > have a generic interface to do this so other filesystem can share some
> > > > > code, so I came up with the new one. Comments and suggestions are
> > > > > welcome!
> > > > >
> > > > > v1->v2:
> > > > > 1. Added more comments and fix return values suggested by Andrew Morton
> > > > > 2. fix a race condition pointed out by Yan Zheng
> > > > >
> > > > > initial post:
> > > > > http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2
> > > > >
> > > > > Thanks,
> > > > > Shaohua
> > > > >
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-11 3:07 ` Wu Fengguang
@ 2011-01-11 3:27 ` Shaohua Li
2011-01-11 9:13 ` Wu Fengguang
0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2011-01-11 3:27 UTC (permalink / raw)
To: Wu, Fengguang
Cc: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
> On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> > On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
> > > On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
> > > > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> > > > > Shaohua,
> > > > >
> > > > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > > > > > Hi,
> > > > > > We have file readahead to do asyn file read, but has no metadata
> > > > > > readahead. For a list of files, their metadata is stored in fragmented
> > > > > > disk space and metadata read is a sync operation, which impacts the
> > > > > > efficiency of readahead much. The patches try to add meatadata readahead
> > > > > > for btrfs.
> > > > > > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > > the inode to a fd so we could use existing syscalls (readahead, mincore
> > > > > > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > > > > > no easy way for this from my understanding. So we add two ioctls for
> > > > >
> > > > > If that is the main obstacle, why not do straightforward fincore()/
> > > > > fadvise(), and add ioctls to btrfs to export/grab the hidden
> > > > > btree_inode in any form? This will address btrfs' specific issue, and
> > > > > have the benefit of making the VFS part general enough. You know
> > > > > ext2/3/4 already have block_dev ready for metadata readahead.
> > > > I forgot to update this comment. Please see patch 2 and patch 4, both
> > > > incore and readahead need btrfs specific staff involved, so we can't use
> > > > generic fincore or something.
> > >
> > > You can if you like :)
> > >
> > > - fincore() can return the referenced bit, which is generally
> > > useful information
> > metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
> > we can't blindly filter out such pages with the bit.
>
> block_dev inodes have the accessed bits. Look at the below output.
>
> /dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the
> dump_page_cache lines stand for Active/Referenced.
ext4 already does readahead? please check other filesystems.
filesystem sues bread like API to read metadata, which definitely
doesn't set referenced bit.
> root@bay /home/wfg# echo /dev/sda5 > /debug/tracing/objects/mm/pages/dump-file
> root@bay /home/wfg# cat /debug/tracing/trace
> # tracer: nop
> #
> # TASK-PID CPU# TIMESTAMP FUNCTION
> # | | | | |
> zsh-2950 [003] 879.500764: dump_inode_cache: 0 55643986944 1703936 21879 D___ BLK mount /dev/sda5
> zsh-2950 [003] 879.500774: dump_page_cache: 0 2 ___AR_____P 2 0
> zsh-2950 [003] 879.500776: dump_page_cache: 2 3 ____R_____P 2 0
> zsh-2950 [003] 879.500777: dump_page_cache: 1026 5 ___AR_____P 2 0
> zsh-2950 [003] 879.500778: dump_page_cache: 1031 3 ___A______P 2 0
> zsh-2950 [003] 879.500779: dump_page_cache: 1034 1 ___AR_____P 2 0
> zsh-2950 [003] 879.500780: dump_page_cache: 1035 2 ___A______P 2 0
> zsh-2950 [003] 879.500781: dump_page_cache: 1037 1 ___AR_____P 2 0
> zsh-2950 [003] 879.500782: dump_page_cache: 1038 3 ____R_____P 2 0
> zsh-2950 [003] 879.500782: dump_page_cache: 1041 1 ___A______P 2 0
> zsh-2950 [003] 879.500783: dump_page_cache: 1057 1 ___AR_D___P 2 0
> zsh-2950 [003] 879.500788: dump_page_cache: 1058 6 ___A______P 2 0
> zsh-2950 [003] 879.500788: dump_page_cache: 9249 1 ___AR_____P 2 0
> zsh-2950 [003] 879.500789: dump_page_cache: 524289 1 ____R_____P 2 0
> zsh-2950 [003] 879.500790: dump_page_cache: 524290 2 ___A______P 2 0
> zsh-2950 [003] 879.500790: dump_page_cache: 524292 1 ___AR_____P 2 0
> zsh-2950 [003] 879.500791: dump_page_cache: 524293 1 ___A______P 2 0
> zsh-2950 [003] 879.500796: dump_page_cache: 524294 9 ____R_____P 2 0
> zsh-2950 [003] 879.500797: dump_page_cache: 524303 1 ___A______P 2 0
> zsh-2950 [003] 879.500798: dump_page_cache: 987136 1 ___AR_____P 2 0
> zsh-2950 [003] 879.500798: dump_page_cache: 1048576 1 ____R_____P 2 0
> zsh-2950 [003] 879.500799: dump_page_cache: 1048577 2 ___A______P 2 0
> zsh-2950 [003] 879.500800: dump_page_cache: 1048579 1 ___AR_____P 2 0
> zsh-2950 [003] 879.500801: dump_page_cache: 1048580 5 ___A______P 2 0
> zsh-2950 [003] 879.500802: dump_page_cache: 1048585 1 ___AR_____P 2 0
> zsh-2950 [003] 879.500805: dump_page_cache: 1048586 5 ___A______P 2 0
> zsh-2950 [003] 879.500805: dump_page_cache: 1048591 1 ___AR_____P 2 0
> zsh-2950 [003] 879.500806: dump_page_cache: 1572864 1 ____R_____P 2 0
> zsh-2950 [003] 879.500807: dump_page_cache: 1572865 5 ___A______P 2 0
> zsh-2950 [003] 879.500808: dump_page_cache: 1572870 1 ___AR_____P 2 0
> zsh-2950 [003] 879.500811: dump_page_cache: 1572871 6 ___A______P 2 0
> zsh-2950 [003] 879.500812: dump_page_cache: 1572877 3 ____R_____P 2 0
> zsh-2950 [003] 879.500816: dump_page_cache: 2097153 8 ____R_____P 2 0
> zsh-2950 [003] 879.500817: dump_page_cache: 2097161 1 ___A______P 2 0
> zsh-2950 [003] 879.500818: dump_page_cache: 2097162 4 ____R_____P 2 0
> zsh-2950 [003] 879.500819: dump_page_cache: 6324224 1 ____R_D___P 2 0
> zsh-2950 [003] 879.500820: dump_page_cache: 6324225 3 ___AR_____P 2 0
> zsh-2950 [003] 879.500825: dump_page_cache: 6324228 29 ___A______P 2 0
> zsh-2950 [003] 879.500826: dump_page_cache: 6324257 1 ____R_____P 2 0
> zsh-2950 [003] 879.500828: dump_page_cache: 6324258 4 ___A______P 2 0
> zsh-2950 [003] 879.500830: dump_page_cache: 6324262 11 ____R_____P 2 0
> zsh-2950 [003] 879.500833: dump_page_cache: 6324273 16 ___AR_____P 2 0
> zsh-2950 [003] 879.500833: dump_page_cache: 6324289 1 ___A______P 2 0
> zsh-2950 [003] 879.500834: dump_page_cache: 6324290 2 ___AR_____P 2 0
> zsh-2950 [003] 879.500835: dump_page_cache: 6324292 8 ___A______P 2 0
> zsh-2950 [003] 879.500836: dump_page_cache: 6324300 2 ___AR_____P 2 0
> zsh-2950 [003] 879.500837: dump_page_cache: 6324302 3 ___A______P 2 0
> zsh-2950 [003] 879.500838: dump_page_cache: 6324305 4 ____R_____P 2 0
> zsh-2950 [003] 879.500843: dump_page_cache: 6324309 28 ___AR_____P 2 0
> zsh-2950 [003] 879.500844: dump_page_cache: 6324337 4 ___A______P 2 0
> zsh-2950 [003] 879.500845: dump_page_cache: 6324341 2 ____R_____P 2 0
> zsh-2950 [003] 879.500850: dump_page_cache: 6324343 30 ___AR_____P 2 0
> zsh-2950 [003] 879.500851: dump_page_cache: 6324373 2 ___A______P 2 0
> zsh-2950 [003] 879.500852: dump_page_cache: 6324375 2 ___AR_____P 2 0
> zsh-2950 [003] 879.500853: dump_page_cache: 6324377 9 ___A______P 2 0
> zsh-2950 [003] 879.500854: dump_page_cache: 6324386 2 ___AR_____P 2 0
> zsh-2950 [003] 879.500855: dump_page_cache: 6324388 5 ___A______P 2 0
> zsh-2950 [003] 879.500856: dump_page_cache: 6324393 3 ___AR_____P 2 0
> zsh-2950 [003] 879.500858: dump_page_cache: 6324396 11 ___A______P 2 0
> zsh-2950 [003] 879.500859: dump_page_cache: 6324407 1 ____R_____P 2 0
> zsh-2950 [003] 879.500864: dump_page_cache: 6324408 31 ___AR_____P 2 0
> zsh-2950 [003] 879.500864: dump_page_cache: 6324439 1 ___A______P 2 0
> zsh-2950 [003] 879.500865: dump_page_cache: 6324440 1 ____R_____P 2 0
> zsh-2950 [003] 879.500866: dump_page_cache: 6324441 2 ___A______P 2 0
> zsh-2950 [003] 879.500867: dump_page_cache: 6324443 5 ____R_____P 2 0
> zsh-2950 [003] 879.500872: dump_page_cache: 6324448 26 ___AR_____P 2 0
> zsh-2950 [003] 879.500873: dump_page_cache: 6324474 6 ___A______P 2 0
> zsh-2950 [003] 879.500874: dump_page_cache: 6324480 4 ____R_____P 2 0
> zsh-2950 [003] 879.500879: dump_page_cache: 6324484 28 ___AR_____P 2 0
> zsh-2950 [003] 879.500880: dump_page_cache: 6324512 4 ___A______P 2 0
> zsh-2950 [003] 879.500881: dump_page_cache: 6324516 1 ____R_____P 2 0
> zsh-2950 [003] 879.500881: dump_page_cache: 6324517 1 ___A______P 2 0
> zsh-2950 [003] 879.500882: dump_page_cache: 6324518 2 ___AR_____P 2 0
> zsh-2950 [003] 879.500888: dump_page_cache: 6324520 28 ___A______P 2 0
> zsh-2950 [003] 879.500890: dump_page_cache: 6324548 2 ____R_____P 2 0
>
> > fincore can takes a parameter or it returns a bit to distinguish
> > referenced pages, but I don't think it's a good API. This should be
> > transparent to userspace.
>
> Users care about the "cached" status may well be interested in the
> "active/referenced" status. They are co-related information. fincore()
> won't be a simple replication of mincore() anyway. fincore() has to
> deal with huge sparsely accessed files. The accessed bits of a file
> page are normally more meaningful than the accessed bits of mapped
> (anonymous) pages.
if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough.
> Another option may be to use the above
> /debug/tracing/objects/mm/pages/dump-file interface.
>
> > > - btrfs_metadata_readahead() can be passed to some (faked)
> > > ->readpages() for use with fadvise.
> > this need filesystem specific hook too, the difference is your proposal
> > uses fadvise but I'm using ioctl. There isn't big difference.
>
> True for btrfs. However they make big differences for other file systems.
why?
> > BTW, it's hard to hook btrfs_inode to a fd even with a ioctl, at least I
> > didn't find a easy way to do this. It might be possible to do this for
> > example adding a fake device or fake fs (anon_inode doesn't work here,
> > IIRC), which is a bit ugly. Before it's proved generic API can handle
> > metadata readahead, I don't want to do it.
>
> Right, it could be hard to export btrfs_inode. I'm glad you speak it
> out. If we cannot make it, it's valuable to point out the problem and
> let everyone know the root cause we turn to an ioctl based workaround.
> Then others will understand the design choices, and if lucky, join us
> and help export the btrfs_inode.
I didn't hide anything. I actually tell out this in the comments. this
is what I said.
In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > > the inode to a fd so we could use existing syscalls
(readahead, mincore
> > > > > > or upcoming fincore) to do readahead, but the inode is
hidden, there is
> > > > > > no easy way for this from my understanding.
Thanks,
Shaohua
> > > > > > this. One is like readahead syscall, the other is like micore/fincore
> > > > > > syscall.
> > > > > > Under a harddisk based netbook with Meego, the metadata readahead
> > > > > > reduced about 3.5s boot time in average from total 16s.
> > > > > > Last time I posted similar patches to btrfs maillist, which adds the
> > > > > > new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> > > > > > have a generic interface to do this so other filesystem can share some
> > > > > > code, so I came up with the new one. Comments and suggestions are
> > > > > > welcome!
> > > > > >
> > > > > > v1->v2:
> > > > > > 1. Added more comments and fix return values suggested by Andrew Morton
> > > > > > 2. fix a race condition pointed out by Yan Zheng
> > > > > >
> > > > > > initial post:
> > > > > > http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2
> > > > > >
> > > > > > Thanks,
> > > > > > Shaohua
> > > > > >
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> >
> >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-11 3:27 ` Shaohua Li
@ 2011-01-11 9:13 ` Wu Fengguang
2011-01-12 2:55 ` Shaohua Li
0 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2011-01-11 9:13 UTC (permalink / raw)
To: Li, Shaohua
Cc: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote:
> On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
> > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> > > On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
> > > > On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
> > > > > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> > > > > > Shaohua,
> > > > > >
> > > > > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > > > > > > Hi,
> > > > > > > We have file readahead to do asyn file read, but has no metadata
> > > > > > > readahead. For a list of files, their metadata is stored in fragmented
> > > > > > > disk space and metadata read is a sync operation, which impacts the
> > > > > > > efficiency of readahead much. The patches try to add meatadata readahead
> > > > > > > for btrfs.
> > > > > > > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > > > the inode to a fd so we could use existing syscalls (readahead, mincore
> > > > > > > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > > > > > > no easy way for this from my understanding. So we add two ioctls for
> > > > > >
> > > > > > If that is the main obstacle, why not do straightforward fincore()/
> > > > > > fadvise(), and add ioctls to btrfs to export/grab the hidden
> > > > > > btree_inode in any form? This will address btrfs' specific issue, and
> > > > > > have the benefit of making the VFS part general enough. You know
> > > > > > ext2/3/4 already have block_dev ready for metadata readahead.
> > > > > I forgot to update this comment. Please see patch 2 and patch 4, both
> > > > > incore and readahead need btrfs specific staff involved, so we can't use
> > > > > generic fincore or something.
> > > >
> > > > You can if you like :)
> > > >
> > > > - fincore() can return the referenced bit, which is generally
> > > > useful information
> > > metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
> > > we can't blindly filter out such pages with the bit.
> >
> > block_dev inodes have the accessed bits. Look at the below output.
> >
> > /dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the
> > dump_page_cache lines stand for Active/Referenced.
> ext4 already does readahead? please check other filesystems.
ext3/4 does readahead on accessing large directories. However that's
orthogonal feature to the user space metadata readahead. The latter is
still important for fast boot on ext3/4.
> filesystem sues bread like API to read metadata, which definitely
> doesn't set referenced bit.
__find_get_block() will call touch_buffer() which is a synonymous for
mark_page_accessed().
> > root@bay /home/wfg# echo /dev/sda5 > /debug/tracing/objects/mm/pages/dump-file
> > root@bay /home/wfg# cat /debug/tracing/trace
> > # tracer: nop
> > #
> > # TASK-PID CPU# TIMESTAMP FUNCTION
> > # | | | | |
> > zsh-2950 [003] 879.500764: dump_inode_cache: 0 55643986944 1703936 21879 D___ BLK mount /dev/sda5
> > zsh-2950 [003] 879.500774: dump_page_cache: 0 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500776: dump_page_cache: 2 3 ____R_____P 2 0
> > zsh-2950 [003] 879.500777: dump_page_cache: 1026 5 ___AR_____P 2 0
> > zsh-2950 [003] 879.500778: dump_page_cache: 1031 3 ___A______P 2 0
> > zsh-2950 [003] 879.500779: dump_page_cache: 1034 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500780: dump_page_cache: 1035 2 ___A______P 2 0
> > zsh-2950 [003] 879.500781: dump_page_cache: 1037 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500782: dump_page_cache: 1038 3 ____R_____P 2 0
> > zsh-2950 [003] 879.500782: dump_page_cache: 1041 1 ___A______P 2 0
> > zsh-2950 [003] 879.500783: dump_page_cache: 1057 1 ___AR_D___P 2 0
> > zsh-2950 [003] 879.500788: dump_page_cache: 1058 6 ___A______P 2 0
> > zsh-2950 [003] 879.500788: dump_page_cache: 9249 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500789: dump_page_cache: 524289 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500790: dump_page_cache: 524290 2 ___A______P 2 0
> > zsh-2950 [003] 879.500790: dump_page_cache: 524292 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500791: dump_page_cache: 524293 1 ___A______P 2 0
> > zsh-2950 [003] 879.500796: dump_page_cache: 524294 9 ____R_____P 2 0
> > zsh-2950 [003] 879.500797: dump_page_cache: 524303 1 ___A______P 2 0
> > zsh-2950 [003] 879.500798: dump_page_cache: 987136 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500798: dump_page_cache: 1048576 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500799: dump_page_cache: 1048577 2 ___A______P 2 0
> > zsh-2950 [003] 879.500800: dump_page_cache: 1048579 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500801: dump_page_cache: 1048580 5 ___A______P 2 0
> > zsh-2950 [003] 879.500802: dump_page_cache: 1048585 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500805: dump_page_cache: 1048586 5 ___A______P 2 0
> > zsh-2950 [003] 879.500805: dump_page_cache: 1048591 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500806: dump_page_cache: 1572864 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500807: dump_page_cache: 1572865 5 ___A______P 2 0
> > zsh-2950 [003] 879.500808: dump_page_cache: 1572870 1 ___AR_____P 2 0
> > zsh-2950 [003] 879.500811: dump_page_cache: 1572871 6 ___A______P 2 0
> > zsh-2950 [003] 879.500812: dump_page_cache: 1572877 3 ____R_____P 2 0
> > zsh-2950 [003] 879.500816: dump_page_cache: 2097153 8 ____R_____P 2 0
> > zsh-2950 [003] 879.500817: dump_page_cache: 2097161 1 ___A______P 2 0
> > zsh-2950 [003] 879.500818: dump_page_cache: 2097162 4 ____R_____P 2 0
> > zsh-2950 [003] 879.500819: dump_page_cache: 6324224 1 ____R_D___P 2 0
> > zsh-2950 [003] 879.500820: dump_page_cache: 6324225 3 ___AR_____P 2 0
> > zsh-2950 [003] 879.500825: dump_page_cache: 6324228 29 ___A______P 2 0
> > zsh-2950 [003] 879.500826: dump_page_cache: 6324257 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500828: dump_page_cache: 6324258 4 ___A______P 2 0
> > zsh-2950 [003] 879.500830: dump_page_cache: 6324262 11 ____R_____P 2 0
> > zsh-2950 [003] 879.500833: dump_page_cache: 6324273 16 ___AR_____P 2 0
> > zsh-2950 [003] 879.500833: dump_page_cache: 6324289 1 ___A______P 2 0
> > zsh-2950 [003] 879.500834: dump_page_cache: 6324290 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500835: dump_page_cache: 6324292 8 ___A______P 2 0
> > zsh-2950 [003] 879.500836: dump_page_cache: 6324300 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500837: dump_page_cache: 6324302 3 ___A______P 2 0
> > zsh-2950 [003] 879.500838: dump_page_cache: 6324305 4 ____R_____P 2 0
> > zsh-2950 [003] 879.500843: dump_page_cache: 6324309 28 ___AR_____P 2 0
> > zsh-2950 [003] 879.500844: dump_page_cache: 6324337 4 ___A______P 2 0
> > zsh-2950 [003] 879.500845: dump_page_cache: 6324341 2 ____R_____P 2 0
> > zsh-2950 [003] 879.500850: dump_page_cache: 6324343 30 ___AR_____P 2 0
> > zsh-2950 [003] 879.500851: dump_page_cache: 6324373 2 ___A______P 2 0
> > zsh-2950 [003] 879.500852: dump_page_cache: 6324375 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500853: dump_page_cache: 6324377 9 ___A______P 2 0
> > zsh-2950 [003] 879.500854: dump_page_cache: 6324386 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500855: dump_page_cache: 6324388 5 ___A______P 2 0
> > zsh-2950 [003] 879.500856: dump_page_cache: 6324393 3 ___AR_____P 2 0
> > zsh-2950 [003] 879.500858: dump_page_cache: 6324396 11 ___A______P 2 0
> > zsh-2950 [003] 879.500859: dump_page_cache: 6324407 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500864: dump_page_cache: 6324408 31 ___AR_____P 2 0
> > zsh-2950 [003] 879.500864: dump_page_cache: 6324439 1 ___A______P 2 0
> > zsh-2950 [003] 879.500865: dump_page_cache: 6324440 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500866: dump_page_cache: 6324441 2 ___A______P 2 0
> > zsh-2950 [003] 879.500867: dump_page_cache: 6324443 5 ____R_____P 2 0
> > zsh-2950 [003] 879.500872: dump_page_cache: 6324448 26 ___AR_____P 2 0
> > zsh-2950 [003] 879.500873: dump_page_cache: 6324474 6 ___A______P 2 0
> > zsh-2950 [003] 879.500874: dump_page_cache: 6324480 4 ____R_____P 2 0
> > zsh-2950 [003] 879.500879: dump_page_cache: 6324484 28 ___AR_____P 2 0
> > zsh-2950 [003] 879.500880: dump_page_cache: 6324512 4 ___A______P 2 0
> > zsh-2950 [003] 879.500881: dump_page_cache: 6324516 1 ____R_____P 2 0
> > zsh-2950 [003] 879.500881: dump_page_cache: 6324517 1 ___A______P 2 0
> > zsh-2950 [003] 879.500882: dump_page_cache: 6324518 2 ___AR_____P 2 0
> > zsh-2950 [003] 879.500888: dump_page_cache: 6324520 28 ___A______P 2 0
> > zsh-2950 [003] 879.500890: dump_page_cache: 6324548 2 ____R_____P 2 0
> >
> > > fincore can takes a parameter or it returns a bit to distinguish
> > > referenced pages, but I don't think it's a good API. This should be
> > > transparent to userspace.
> >
> > Users care about the "cached" status may well be interested in the
> > "active/referenced" status. They are co-related information. fincore()
> > won't be a simple replication of mincore() anyway. fincore() has to
> > deal with huge sparsely accessed files. The accessed bits of a file
> > page are normally more meaningful than the accessed bits of mapped
> > (anonymous) pages.
> if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough.
It's a reasonable thing to set the accessed bits. So I believe the
various filesystems are calling mark_page_accessed() on their metadata
inode, or can be changed to do it.
> > Another option may be to use the above
> > /debug/tracing/objects/mm/pages/dump-file interface.
> >
> > > > - btrfs_metadata_readahead() can be passed to some (faked)
> > > > ->readpages() for use with fadvise.
> > > this need filesystem specific hook too, the difference is your proposal
> > > uses fadvise but I'm using ioctl. There isn't big difference.
> >
> > True for btrfs. However they make big differences for other file systems.
> why?
The block_dev of ext2/3/4 can do metadata query/readahead directly
with fincore()+fadvise(), with no need for any additional ioctls.
Given that the vast majority desktops are running ext2/3/4, it seems
worthwhile to have a straightforward solution for them.
> > > BTW, it's hard to hook btrfs_inode to a fd even with a ioctl, at least I
> > > didn't find a easy way to do this. It might be possible to do this for
> > > example adding a fake device or fake fs (anon_inode doesn't work here,
> > > IIRC), which is a bit ugly. Before it's proved generic API can handle
> > > metadata readahead, I don't want to do it.
> >
> > Right, it could be hard to export btrfs_inode. I'm glad you speak it
> > out. If we cannot make it, it's valuable to point out the problem and
> > let everyone know the root cause we turn to an ioctl based workaround.
> > Then others will understand the design choices, and if lucky, join us
> > and help export the btrfs_inode.
> I didn't hide anything. I actually tell out this in the comments. this
> is what I said.
Ah, sorry for overlooking this message!
Thanks,
Fengguang
> In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > > > the inode to a fd so we could use existing syscalls
> (readahead, mincore
> > > > > > > or upcoming fincore) to do readahead, but the inode is
> hidden, there is
> > > > > > > no easy way for this from my understanding.
>
>
> Thanks,
> Shaohua
> > > > > > > this. One is like readahead syscall, the other is like micore/fincore
> > > > > > > syscall.
> > > > > > > Under a harddisk based netbook with Meego, the metadata readahead
> > > > > > > reduced about 3.5s boot time in average from total 16s.
> > > > > > > Last time I posted similar patches to btrfs maillist, which adds the
> > > > > > > new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
> > > > > > > have a generic interface to do this so other filesystem can share some
> > > > > > > code, so I came up with the new one. Comments and suggestions are
> > > > > > > welcome!
> > > > > > >
> > > > > > > v1->v2:
> > > > > > > 1. Added more comments and fix return values suggested by Andrew Morton
> > > > > > > 2. fix a race condition pointed out by Yan Zheng
> > > > > > >
> > > > > > > initial post:
> > > > > > > http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Shaohua
> > > > > > >
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > >
> > >
> > >
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-11 9:13 ` Wu Fengguang
@ 2011-01-12 2:55 ` Shaohua Li
[not found] ` <20110112025516.GA11303-yAZKuqJtXNMXR+D7ky4Foa2pdiUAq4bhAL8bYrjMMd8@public.gmane.org>
0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2011-01-12 2:55 UTC (permalink / raw)
To: Wu, Fengguang
Cc: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
On Tue, Jan 11, 2011 at 05:13:53PM +0800, Wu, Fengguang wrote:
> On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote:
> > On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
> > > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> > > > On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
> > > > > On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
> > > > > > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> > > > > > > Shaohua,
> > > > > > >
> > > > > > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > > > > > > > Hi,
> > > > > > > > We have file readahead to do asyn file read, but has no metadata
> > > > > > > > readahead. For a list of files, their metadata is stored in fragmented
> > > > > > > > disk space and metadata read is a sync operation, which impacts the
> > > > > > > > efficiency of readahead much. The patches try to add meatadata readahead
> > > > > > > > for btrfs.
> > > > > > > > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > > > > the inode to a fd so we could use existing syscalls (readahead, mincore
> > > > > > > > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > > > > > > > no easy way for this from my understanding. So we add two ioctls for
> > > > > > >
> > > > > > > If that is the main obstacle, why not do straightforward fincore()/
> > > > > > > fadvise(), and add ioctls to btrfs to export/grab the hidden
> > > > > > > btree_inode in any form? This will address btrfs' specific issue, and
> > > > > > > have the benefit of making the VFS part general enough. You know
> > > > > > > ext2/3/4 already have block_dev ready for metadata readahead.
> > > > > > I forgot to update this comment. Please see patch 2 and patch 4, both
> > > > > > incore and readahead need btrfs specific staff involved, so we can't use
> > > > > > generic fincore or something.
> > > > >
> > > > > You can if you like :)
> > > > >
> > > > > - fincore() can return the referenced bit, which is generally
> > > > > useful information
> > > > metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
> > > > we can't blindly filter out such pages with the bit.
> > >
> > > block_dev inodes have the accessed bits. Look at the below output.
> > >
> > > /dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the
> > > dump_page_cache lines stand for Active/Referenced.
> > ext4 already does readahead? please check other filesystems.
>
> ext3/4 does readahead on accessing large directories. However that's
> orthogonal feature to the user space metadata readahead. The latter is
> still important for fast boot on ext3/4.
>
> > filesystem sues bread like API to read metadata, which definitely
> > doesn't set referenced bit.
>
> __find_get_block() will call touch_buffer() which is a synonymous for
> mark_page_accessed().
yes, but only when the buffer is accessed at the second time.
> > > > fincore can takes a parameter or it returns a bit to distinguish
> > > > referenced pages, but I don't think it's a good API. This should be
> > > > transparent to userspace.
> > >
> > > Users care about the "cached" status may well be interested in the
> > > "active/referenced" status. They are co-related information. fincore()
> > > won't be a simple replication of mincore() anyway. fincore() has to
> > > deal with huge sparsely accessed files. The accessed bits of a file
> > > page are normally more meaningful than the accessed bits of mapped
> > > (anonymous) pages.
> > if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough.
>
> It's a reasonable thing to set the accessed bits. So I believe the
> various filesystems are calling mark_page_accessed() on their metadata
> inode, or can be changed to do it.
yes, we can, with a lot of pain. And filesystems must be smart to avoid marking the bit
for pages which are readahead in but actually are invalid. The second patch in the series
has more detailed infomation about this issue. The problem is if this is really worthy
for metadata readahead. Some filesystems might don't care about metadata readahead. If
we make fincore check the bit, then fincore syscall will not work for such filesystems,
which is bad.
> > > Another option may be to use the above
> > > /debug/tracing/objects/mm/pages/dump-file interface.
> > >
> > > > > - btrfs_metadata_readahead() can be passed to some (faked)
> > > > > ->readpages() for use with fadvise.
> > > > this need filesystem specific hook too, the difference is your proposal
> > > > uses fadvise but I'm using ioctl. There isn't big difference.
> > >
> > > True for btrfs. However they make big differences for other file systems.
> > why?
>
> The block_dev of ext2/3/4 can do metadata query/readahead directly
> with fincore()+fadvise(), with no need for any additional ioctls.
>
> Given that the vast majority desktops are running ext2/3/4, it seems
> worthwhile to have a straightforward solution for them.
This does make ext filesystem metadata readahead straightforward, but gives a lot
of pain for other filesystems. And even for ext filesystem, we need take care
about the 'invalid page' issue above.
On the other hand, with the ioctls approach, we can still make ext filesystem
metadata readahead straightforward (just several lines of code, we can even
add a lib API for such filesystems)
We'd better have a more generic approach for all filelsystems, while the ioctl
apporoach is better.
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
[not found] ` <20110112025516.GA11303-yAZKuqJtXNMXR+D7ky4Foa2pdiUAq4bhAL8bYrjMMd8@public.gmane.org>
@ 2011-01-16 3:38 ` Wu Fengguang
2011-01-17 1:32 ` Shaohua Li
0 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2011-01-16 3:38 UTC (permalink / raw)
To: Li, Shaohua
Cc: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
On Wed, Jan 12, 2011 at 10:55:16AM +0800, Li, Shaohua wrote:
> On Tue, Jan 11, 2011 at 05:13:53PM +0800, Wu, Fengguang wrote:
> > On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote:
> > > On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
> > > > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> > > > > On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
> > > > > > On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
> > > > > > > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> > > > > > > > Shaohua,
> > > > > > > >
> > > > > > > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > > > > > > > > Hi,
> > > > > > > > > We have file readahead to do asyn file read, but has no metadata
> > > > > > > > > readahead. For a list of files, their metadata is stored in fragmented
> > > > > > > > > disk space and metadata read is a sync operation, which impacts the
> > > > > > > > > efficiency of readahead much. The patches try to add meatadata readahead
> > > > > > > > > for btrfs.
> > > > > > > > > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > > > > > the inode to a fd so we could use existing syscalls (readahead, mincore
> > > > > > > > > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > > > > > > > > no easy way for this from my understanding. So we add two ioctls for
> > > > > > > >
> > > > > > > > If that is the main obstacle, why not do straightforward fincore()/
> > > > > > > > fadvise(), and add ioctls to btrfs to export/grab the hidden
> > > > > > > > btree_inode in any form? This will address btrfs' specific issue, and
> > > > > > > > have the benefit of making the VFS part general enough. You know
> > > > > > > > ext2/3/4 already have block_dev ready for metadata readahead.
> > > > > > > I forgot to update this comment. Please see patch 2 and patch 4, both
> > > > > > > incore and readahead need btrfs specific staff involved, so we can't use
> > > > > > > generic fincore or something.
> > > > > >
> > > > > > You can if you like :)
> > > > > >
> > > > > > - fincore() can return the referenced bit, which is generally
> > > > > > useful information
> > > > > metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
> > > > > we can't blindly filter out such pages with the bit.
> > > >
> > > > block_dev inodes have the accessed bits. Look at the below output.
> > > >
> > > > /dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the
> > > > dump_page_cache lines stand for Active/Referenced.
> > > ext4 already does readahead? please check other filesystems.
> >
> > ext3/4 does readahead on accessing large directories. However that's
> > orthogonal feature to the user space metadata readahead. The latter is
> > still important for fast boot on ext3/4.
> >
> > > filesystem sues bread like API to read metadata, which definitely
> > > doesn't set referenced bit.
> >
> > __find_get_block() will call touch_buffer() which is a synonymous for
> > mark_page_accessed().
> yes, but only when the buffer is accessed at the second time.
Not likely. Otherwise it would be a performance bug.
__getblk() has two code paths, both will call touch_buffer().
a)
__find_get_block()
touch_buffer()
b)
__getblk_slow
__find_get_block()
touch_buffer()
> > > > > fincore can takes a parameter or it returns a bit to distinguish
> > > > > referenced pages, but I don't think it's a good API. This should be
> > > > > transparent to userspace.
> > > >
> > > > Users care about the "cached" status may well be interested in the
> > > > "active/referenced" status. They are co-related information. fincore()
> > > > won't be a simple replication of mincore() anyway. fincore() has to
> > > > deal with huge sparsely accessed files. The accessed bits of a file
> > > > page are normally more meaningful than the accessed bits of mapped
> > > > (anonymous) pages.
> > > if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough.
> >
> > It's a reasonable thing to set the accessed bits. So I believe the
> > various filesystems are calling mark_page_accessed() on their metadata
> > inode, or can be changed to do it.
> yes, we can, with a lot of pain. And filesystems must be smart to avoid marking the bit
> for pages which are readahead in but actually are invalid. The second patch in the series
"invalid" means !PG_uptodate? I wonder why there is a need to test
that bit at all. !PG_uptodate seems an unrelated transitional state.
> has more detailed infomation about this issue. The problem is if this is really worthy
> for metadata readahead. Some filesystems might don't care about metadata readahead. If
> we make fincore check the bit, then fincore syscall will not work for such filesystems,
> which is bad.
fincore() will always work as is. If the filesystem don't care about
metadata readahead, then the metadata readahead that makes use of the
bits will naturally not work for them?
> > > > Another option may be to use the above
> > > > /debug/tracing/objects/mm/pages/dump-file interface.
> > > >
> > > > > > - btrfs_metadata_readahead() can be passed to some (faked)
> > > > > > ->readpages() for use with fadvise.
> > > > > this need filesystem specific hook too, the difference is your proposal
> > > > > uses fadvise but I'm using ioctl. There isn't big difference.
> > > >
> > > > True for btrfs. However they make big differences for other file systems.
> > > why?
> >
> > The block_dev of ext2/3/4 can do metadata query/readahead directly
> > with fincore()+fadvise(), with no need for any additional ioctls.
> >
> > Given that the vast majority desktops are running ext2/3/4, it seems
> > worthwhile to have a straightforward solution for them.
> This does make ext filesystem metadata readahead straightforward, but gives a lot
> of pain for other filesystems. And even for ext filesystem, we need take care
> about the 'invalid page' issue above.
> On the other hand, with the ioctls approach, we can still make ext filesystem
> metadata readahead straightforward (just several lines of code, we can even
> add a lib API for such filesystems)
> We'd better have a more generic approach for all filelsystems, while the ioctl
> apporoach is better.
Although I'm not all that fond of adding ioctls, I can understand the
difficulties and won't insist on you doing it the other way.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-16 3:38 ` Wu Fengguang
@ 2011-01-17 1:32 ` Shaohua Li
2011-01-18 4:41 ` Wu Fengguang
0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2011-01-17 1:32 UTC (permalink / raw)
To: Wu, Fengguang
Cc: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
On Sun, 2011-01-16 at 11:38 +0800, Wu, Fengguang wrote:
> On Wed, Jan 12, 2011 at 10:55:16AM +0800, Li, Shaohua wrote:
> > On Tue, Jan 11, 2011 at 05:13:53PM +0800, Wu, Fengguang wrote:
> > > On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote:
> > > > On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
> > > > > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> > > > > > On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
> > > > > > > On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
> > > > > > > > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
> > > > > > > > > Shaohua,
> > > > > > > > >
> > > > > > > > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
> > > > > > > > > > Hi,
> > > > > > > > > > We have file readahead to do asyn file read, but has no metadata
> > > > > > > > > > readahead. For a list of files, their metadata is stored in fragmented
> > > > > > > > > > disk space and metadata read is a sync operation, which impacts the
> > > > > > > > > > efficiency of readahead much. The patches try to add meatadata readahead
> > > > > > > > > > for btrfs.
> > > > > > > > > > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> > > > > > > > > > the inode to a fd so we could use existing syscalls (readahead, mincore
> > > > > > > > > > or upcoming fincore) to do readahead, but the inode is hidden, there is
> > > > > > > > > > no easy way for this from my understanding. So we add two ioctls for
> > > > > > > > >
> > > > > > > > > If that is the main obstacle, why not do straightforward fincore()/
> > > > > > > > > fadvise(), and add ioctls to btrfs to export/grab the hidden
> > > > > > > > > btree_inode in any form? This will address btrfs' specific issue, and
> > > > > > > > > have the benefit of making the VFS part general enough. You know
> > > > > > > > > ext2/3/4 already have block_dev ready for metadata readahead.
> > > > > > > > I forgot to update this comment. Please see patch 2 and patch 4, both
> > > > > > > > incore and readahead need btrfs specific staff involved, so we can't use
> > > > > > > > generic fincore or something.
> > > > > > >
> > > > > > > You can if you like :)
> > > > > > >
> > > > > > > - fincore() can return the referenced bit, which is generally
> > > > > > > useful information
> > > > > > metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
> > > > > > we can't blindly filter out such pages with the bit.
> > > > >
> > > > > block_dev inodes have the accessed bits. Look at the below output.
> > > > >
> > > > > /dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the
> > > > > dump_page_cache lines stand for Active/Referenced.
> > > > ext4 already does readahead? please check other filesystems.
> > >
> > > ext3/4 does readahead on accessing large directories. However that's
> > > orthogonal feature to the user space metadata readahead. The latter is
> > > still important for fast boot on ext3/4.
> > >
> > > > filesystem sues bread like API to read metadata, which definitely
> > > > doesn't set referenced bit.
> > >
> > > __find_get_block() will call touch_buffer() which is a synonymous for
> > > mark_page_accessed().
> > yes, but only when the buffer is accessed at the second time.
>
> Not likely. Otherwise it would be a performance bug.
>
> __getblk() has two code paths, both will call touch_buffer().
>
> a)
> __find_get_block()
> touch_buffer()
> b)
> __getblk_slow
> __find_get_block()
> touch_buffer()
I missed this, sorry.
> > > > > > fincore can takes a parameter or it returns a bit to distinguish
> > > > > > referenced pages, but I don't think it's a good API. This should be
> > > > > > transparent to userspace.
> > > > >
> > > > > Users care about the "cached" status may well be interested in the
> > > > > "active/referenced" status. They are co-related information. fincore()
> > > > > won't be a simple replication of mincore() anyway. fincore() has to
> > > > > deal with huge sparsely accessed files. The accessed bits of a file
> > > > > page are normally more meaningful than the accessed bits of mapped
> > > > > (anonymous) pages.
> > > > if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough.
> > >
> > > It's a reasonable thing to set the accessed bits. So I believe the
> > > various filesystems are calling mark_page_accessed() on their metadata
> > > inode, or can be changed to do it.
> > yes, we can, with a lot of pain. And filesystems must be smart to avoid marking the bit
> > for pages which are readahead in but actually are invalid. The second patch in the series
>
> "invalid" means !PG_uptodate? I wonder why there is a need to test
> that bit at all. !PG_uptodate seems an unrelated transitional state.
not PG_update, it's referenced bit. A readahead metadata page will have update bit set,
but it might not have referenced bit if it's an obsolete page. btrfs
doesn't use the buffer_head
> > has more detailed infomation about this issue. The problem is if this is really worthy
> > for metadata readahead. Some filesystems might don't care about metadata readahead. If
> > we make fincore check the bit, then fincore syscall will not work for such filesystems,
> > which is bad.
>
> fincore() will always work as is. If the filesystem don't care about
> metadata readahead, then the metadata readahead that makes use of the
> bits will naturally not work for them?
yes, they don't care about readahead, but they do care about fincore
output.
if fincore() checks the bits, it doesn't work even for normal file pages, if the pages get
deactivated.
> > > > > Another option may be to use the above
> > > > > /debug/tracing/objects/mm/pages/dump-file interface.
> > > > >
> > > > > > > - btrfs_metadata_readahead() can be passed to some (faked)
> > > > > > > ->readpages() for use with fadvise.
> > > > > > this need filesystem specific hook too, the difference is your proposal
> > > > > > uses fadvise but I'm using ioctl. There isn't big difference.
> > > > >
> > > > > True for btrfs. However they make big differences for other file systems.
> > > > why?
> > >
> > > The block_dev of ext2/3/4 can do metadata query/readahead directly
> > > with fincore()+fadvise(), with no need for any additional ioctls.
> > >
> > > Given that the vast majority desktops are running ext2/3/4, it seems
> > > worthwhile to have a straightforward solution for them.
> > This does make ext filesystem metadata readahead straightforward, but gives a lot
> > of pain for other filesystems. And even for ext filesystem, we need take care
> > about the 'invalid page' issue above.
> > On the other hand, with the ioctls approach, we can still make ext filesystem
> > metadata readahead straightforward (just several lines of code, we can even
> > add a lib API for such filesystems)
> > We'd better have a more generic approach for all filelsystems, while the ioctl
> > apporoach is better.
>
> Although I'm not all that fond of adding ioctls, I can understand the
> difficulties and won't insist on you doing it the other way.
Thanks!
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-17 1:32 ` Shaohua Li
@ 2011-01-18 4:41 ` Wu Fengguang
2011-01-18 5:15 ` Shaohua Li
0 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2011-01-18 4:41 UTC (permalink / raw)
To: Li, Shaohua
Cc: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
On Mon, Jan 17, 2011 at 09:32:37AM +0800, Li, Shaohua wrote:
> On Sun, 2011-01-16 at 11:38 +0800, Wu, Fengguang wrote:
> > On Wed, Jan 12, 2011 at 10:55:16AM +0800, Li, Shaohua wrote:
> > > On Tue, Jan 11, 2011 at 05:13:53PM +0800, Wu, Fengguang wrote:
> > > > On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote:
> > > > > On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
> > > > > > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> > > > > > > fincore can takes a parameter or it returns a bit to distinguish
> > > > > > > referenced pages, but I don't think it's a good API. This should be
> > > > > > > transparent to userspace.
> > > > > >
> > > > > > Users care about the "cached" status may well be interested in the
> > > > > > "active/referenced" status. They are co-related information. fincore()
> > > > > > won't be a simple replication of mincore() anyway. fincore() has to
> > > > > > deal with huge sparsely accessed files. The accessed bits of a file
> > > > > > page are normally more meaningful than the accessed bits of mapped
> > > > > > (anonymous) pages.
> > > > > if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough.
> > > >
> > > > It's a reasonable thing to set the accessed bits. So I believe the
> > > > various filesystems are calling mark_page_accessed() on their metadata
> > > > inode, or can be changed to do it.
> > > yes, we can, with a lot of pain. And filesystems must be smart to avoid marking the bit
> > > for pages which are readahead in but actually are invalid. The second patch in the series
> >
> > "invalid" means !PG_uptodate? I wonder why there is a need to test
> > that bit at all. !PG_uptodate seems an unrelated transitional state.
> not PG_update, it's referenced bit. A readahead metadata page will have update bit set,
> but it might not have referenced bit if it's an obsolete page. btrfs
> doesn't use the buffer_head
I do see PageUptodate() tests in your patch, perhaps they be removed?
> > > has more detailed infomation about this issue. The problem is if this is really worthy
> > > for metadata readahead. Some filesystems might don't care about metadata readahead. If
> > > we make fincore check the bit, then fincore syscall will not work for such filesystems,
> > > which is bad.
> >
> > fincore() will always work as is. If the filesystem don't care about
> > metadata readahead, then the metadata readahead that makes use of the
> > bits will naturally not work for them?
> yes, they don't care about readahead, but they do care about fincore
> output.
fincore() just reports the accessed bits as is. If the filesystem does
not use blockdev or export its internal metadata inode, the user won't
be able to run fincore() on the metadata inode at all.
> if fincore() checks the bits, it doesn't work even for normal file
> pages, if the pages get deactivated.
That's a problem independent of the interface. And for user space
readahead, it can be nicely fixed by collecting the pages-to-readahead
before the free pages drop low, ie. before any page reclaim actions.
It's "nice" because you don't want to readahead more data than
cache-able anyway and avoid thrashing for small memory systems.
> > > > > > Another option may be to use the above
> > > > > > /debug/tracing/objects/mm/pages/dump-file interface.
> > > > > >
> > > > > > > > - btrfs_metadata_readahead() can be passed to some (faked)
> > > > > > > > ->readpages() for use with fadvise.
> > > > > > > this need filesystem specific hook too, the difference is your proposal
> > > > > > > uses fadvise but I'm using ioctl. There isn't big difference.
> > > > > >
> > > > > > True for btrfs. However they make big differences for other file systems.
> > > > > why?
> > > >
> > > > The block_dev of ext2/3/4 can do metadata query/readahead directly
> > > > with fincore()+fadvise(), with no need for any additional ioctls.
> > > >
> > > > Given that the vast majority desktops are running ext2/3/4, it seems
> > > > worthwhile to have a straightforward solution for them.
> > > This does make ext filesystem metadata readahead straightforward, but gives a lot
> > > of pain for other filesystems. And even for ext filesystem, we need take care
> > > about the 'invalid page' issue above.
> > > On the other hand, with the ioctls approach, we can still make ext filesystem
> > > metadata readahead straightforward (just several lines of code, we can even
> > > add a lib API for such filesystems)
> > > We'd better have a more generic approach for all filelsystems, while the ioctl
> > > apporoach is better.
> >
> > Although I'm not all that fond of adding ioctls, I can understand the
> > difficulties and won't insist on you doing it the other way.
> Thanks!
I'm not sure how reality it is, but the other wild ideas that
intrigued me exporting the btrfs_inode in the initial plan is, it
might enable some interesting btrfs use cases. For example, to write
some user space lib/tool to examine btrfs_inode and do live fsck on
some snapshot. Or to mount btrfs only to read/write btrfs_inode, to
make use of btrfs' low level (RAID?) functionalities.
Just play for fun :)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-18 4:41 ` Wu Fengguang
@ 2011-01-18 5:15 ` Shaohua Li
2011-01-18 6:22 ` Wu Fengguang
0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2011-01-18 5:15 UTC (permalink / raw)
To: Wu, Fengguang
Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api@vger.kernel.org, mtk.manpages@gmail.com
On Tue, 2011-01-18 at 12:41 +0800, Wu, Fengguang wrote:
> On Mon, Jan 17, 2011 at 09:32:37AM +0800, Li, Shaohua wrote:
> > On Sun, 2011-01-16 at 11:38 +0800, Wu, Fengguang wrote:
> > > On Wed, Jan 12, 2011 at 10:55:16AM +0800, Li, Shaohua wrote:
> > > > On Tue, Jan 11, 2011 at 05:13:53PM +0800, Wu, Fengguang wrote:
> > > > > On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote:
> > > > > > On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
> > > > > > > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
>
> > > > > > > > fincore can takes a parameter or it returns a bit to distinguish
> > > > > > > > referenced pages, but I don't think it's a good API. This should be
> > > > > > > > transparent to userspace.
> > > > > > >
> > > > > > > Users care about the "cached" status may well be interested in the
> > > > > > > "active/referenced" status. They are co-related information. fincore()
> > > > > > > won't be a simple replication of mincore() anyway. fincore() has to
> > > > > > > deal with huge sparsely accessed files. The accessed bits of a file
> > > > > > > page are normally more meaningful than the accessed bits of mapped
> > > > > > > (anonymous) pages.
> > > > > > if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough.
> > > > >
> > > > > It's a reasonable thing to set the accessed bits. So I believe the
> > > > > various filesystems are calling mark_page_accessed() on their metadata
> > > > > inode, or can be changed to do it.
> > > > yes, we can, with a lot of pain. And filesystems must be smart to avoid marking the bit
> > > > for pages which are readahead in but actually are invalid. The second patch in the series
> > >
> > > "invalid" means !PG_uptodate? I wonder why there is a need to test
> > > that bit at all. !PG_uptodate seems an unrelated transitional state.
> > not PG_update, it's referenced bit. A readahead metadata page will have update bit set,
> > but it might not have referenced bit if it's an obsolete page. btrfs
> > doesn't use the buffer_head
>
> I do see PageUptodate() tests in your patch, perhaps they be removed?
uptodate bit isn't really needed, but I added it to make sure the page
is valid.
> > > > has more detailed infomation about this issue. The problem is if this is really worthy
> > > > for metadata readahead. Some filesystems might don't care about metadata readahead. If
> > > > we make fincore check the bit, then fincore syscall will not work for such filesystems,
> > > > which is bad.
> > >
> > > fincore() will always work as is. If the filesystem don't care about
> > > metadata readahead, then the metadata readahead that makes use of the
> > > bits will naturally not work for them?
> > yes, they don't care about readahead, but they do care about fincore
> > output.
>
> fincore() just reports the accessed bits as is. If the filesystem does
> not use blockdev or export its internal metadata inode, the user won't
> be able to run fincore() on the metadata inode at all.
>
> > if fincore() checks the bits, it doesn't work even for normal file
> > pages, if the pages get deactivated.
>
> That's a problem independent of the interface. And for user space
> readahead, it can be nicely fixed by collecting the pages-to-readahead
> before the free pages drop low, ie. before any page reclaim actions.
> It's "nice" because you don't want to readahead more data than
> cache-able anyway and avoid thrashing for small memory systems.
My point is fincore() isn't designed only for readahead. People will use
it like mincore, which is its normal usage. Checking the bits will break
its normal usage, because fincore just doesn't check if the fd means a
metadata inode.
> > > > > > > Another option may be to use the above
> > > > > > > /debug/tracing/objects/mm/pages/dump-file interface.
> > > > > > >
> > > > > > > > > - btrfs_metadata_readahead() can be passed to some (faked)
> > > > > > > > > ->readpages() for use with fadvise.
> > > > > > > > this need filesystem specific hook too, the difference is your proposal
> > > > > > > > uses fadvise but I'm using ioctl. There isn't big difference.
> > > > > > >
> > > > > > > True for btrfs. However they make big differences for other file systems.
> > > > > > why?
> > > > >
> > > > > The block_dev of ext2/3/4 can do metadata query/readahead directly
> > > > > with fincore()+fadvise(), with no need for any additional ioctls.
> > > > >
> > > > > Given that the vast majority desktops are running ext2/3/4, it seems
> > > > > worthwhile to have a straightforward solution for them.
> > > > This does make ext filesystem metadata readahead straightforward, but gives a lot
> > > > of pain for other filesystems. And even for ext filesystem, we need take care
> > > > about the 'invalid page' issue above.
> > > > On the other hand, with the ioctls approach, we can still make ext filesystem
> > > > metadata readahead straightforward (just several lines of code, we can even
> > > > add a lib API for such filesystems)
> > > > We'd better have a more generic approach for all filelsystems, while the ioctl
> > > > apporoach is better.
> > >
> > > Although I'm not all that fond of adding ioctls, I can understand the
> > > difficulties and won't insist on you doing it the other way.
> > Thanks!
>
> I'm not sure how reality it is, but the other wild ideas that
> intrigued me exporting the btrfs_inode in the initial plan is, it
> might enable some interesting btrfs use cases. For example, to write
> some user space lib/tool to examine btrfs_inode and do live fsck on
> some snapshot. Or to mount btrfs only to read/write btrfs_inode, to
> make use of btrfs' low level (RAID?) functionalities.
>
> Just play for fun :)
get it. don't know if there is valid usage though.
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-18 5:15 ` Shaohua Li
@ 2011-01-18 6:22 ` Wu Fengguang
2011-01-18 6:35 ` Shaohua Li
0 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2011-01-18 6:22 UTC (permalink / raw)
To: Li, Shaohua
Cc: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
On Tue, Jan 18, 2011 at 01:15:27PM +0800, Li, Shaohua wrote:
> On Tue, 2011-01-18 at 12:41 +0800, Wu, Fengguang wrote:
> > On Mon, Jan 17, 2011 at 09:32:37AM +0800, Li, Shaohua wrote:
> > > On Sun, 2011-01-16 at 11:38 +0800, Wu, Fengguang wrote:
> > > > On Wed, Jan 12, 2011 at 10:55:16AM +0800, Li, Shaohua wrote:
> > > > > On Tue, Jan 11, 2011 at 05:13:53PM +0800, Wu, Fengguang wrote:
> > > > > > On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote:
> > > > > > > On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
> > > > > > > > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> >
> > > > > > > > > fincore can takes a parameter or it returns a bit to distinguish
> > > > > > > > > referenced pages, but I don't think it's a good API. This should be
> > > > > > > > > transparent to userspace.
> > > > > > > >
> > > > > > > > Users care about the "cached" status may well be interested in the
> > > > > > > > "active/referenced" status. They are co-related information. fincore()
> > > > > > > > won't be a simple replication of mincore() anyway. fincore() has to
> > > > > > > > deal with huge sparsely accessed files. The accessed bits of a file
> > > > > > > > page are normally more meaningful than the accessed bits of mapped
> > > > > > > > (anonymous) pages.
> > > > > > > if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough.
> > > > > >
> > > > > > It's a reasonable thing to set the accessed bits. So I believe the
> > > > > > various filesystems are calling mark_page_accessed() on their metadata
> > > > > > inode, or can be changed to do it.
> > > > > yes, we can, with a lot of pain. And filesystems must be smart to avoid marking the bit
> > > > > for pages which are readahead in but actually are invalid. The second patch in the series
> > > >
> > > > "invalid" means !PG_uptodate? I wonder why there is a need to test
> > > > that bit at all. !PG_uptodate seems an unrelated transitional state.
> > > not PG_update, it's referenced bit. A readahead metadata page will have update bit set,
> > > but it might not have referenced bit if it's an obsolete page. btrfs
> > > doesn't use the buffer_head
> >
> > I do see PageUptodate() tests in your patch, perhaps they be removed?
> uptodate bit isn't really needed, but I added it to make sure the page
> is valid.
It may be nit pick, but I always try to remove optional code. The
PageUptodate() looks like an irrelevant test and a good candidate to
remove.
> > > > > has more detailed infomation about this issue. The problem is if this is really worthy
> > > > > for metadata readahead. Some filesystems might don't care about metadata readahead. If
> > > > > we make fincore check the bit, then fincore syscall will not work for such filesystems,
> > > > > which is bad.
> > > >
> > > > fincore() will always work as is. If the filesystem don't care about
> > > > metadata readahead, then the metadata readahead that makes use of the
> > > > bits will naturally not work for them?
> > > yes, they don't care about readahead, but they do care about fincore
> > > output.
> >
> > fincore() just reports the accessed bits as is. If the filesystem does
> > not use blockdev or export its internal metadata inode, the user won't
> > be able to run fincore() on the metadata inode at all.
> >
> > > if fincore() checks the bits, it doesn't work even for normal file
> > > pages, if the pages get deactivated.
> >
> > That's a problem independent of the interface. And for user space
> > readahead, it can be nicely fixed by collecting the pages-to-readahead
> > before the free pages drop low, ie. before any page reclaim actions.
> > It's "nice" because you don't want to readahead more data than
> > cache-able anyway and avoid thrashing for small memory systems.
> My point is fincore() isn't designed only for readahead. People will use
> it like mincore, which is its normal usage. Checking the bits will break
> its normal usage, because fincore just doesn't check if the fd means a
> metadata inode.
Sorry, you missed my point :) I mean to export the accessed bits as-is
via the fincore() interface, not to check the accessed bits and then
report "page not cached" to user space for !PG_referenced pages.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
2011-01-18 6:22 ` Wu Fengguang
@ 2011-01-18 6:35 ` Shaohua Li
0 siblings, 0 replies; 17+ messages in thread
From: Shaohua Li @ 2011-01-18 6:35 UTC (permalink / raw)
To: Wu, Fengguang
Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, linux-api@vger.kernel.org, mtk.manpages@gmail.com
On Tue, 2011-01-18 at 14:22 +0800, Wu, Fengguang wrote:
> On Tue, Jan 18, 2011 at 01:15:27PM +0800, Li, Shaohua wrote:
> > On Tue, 2011-01-18 at 12:41 +0800, Wu, Fengguang wrote:
> > > On Mon, Jan 17, 2011 at 09:32:37AM +0800, Li, Shaohua wrote:
> > > > On Sun, 2011-01-16 at 11:38 +0800, Wu, Fengguang wrote:
> > > > > On Wed, Jan 12, 2011 at 10:55:16AM +0800, Li, Shaohua wrote:
> > > > > > On Tue, Jan 11, 2011 at 05:13:53PM +0800, Wu, Fengguang wrote:
> > > > > > > On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote:
> > > > > > > > On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
> > > > > > > > > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
> > >
> > > > > > > > > > fincore can takes a parameter or it returns a bit to distinguish
> > > > > > > > > > referenced pages, but I don't think it's a good API. This should be
> > > > > > > > > > transparent to userspace.
> > > > > > > > >
> > > > > > > > > Users care about the "cached" status may well be interested in the
> > > > > > > > > "active/referenced" status. They are co-related information. fincore()
> > > > > > > > > won't be a simple replication of mincore() anyway. fincore() has to
> > > > > > > > > deal with huge sparsely accessed files. The accessed bits of a file
> > > > > > > > > page are normally more meaningful than the accessed bits of mapped
> > > > > > > > > (anonymous) pages.
> > > > > > > > if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't generic enough.
> > > > > > >
> > > > > > > It's a reasonable thing to set the accessed bits. So I believe the
> > > > > > > various filesystems are calling mark_page_accessed() on their metadata
> > > > > > > inode, or can be changed to do it.
> > > > > > yes, we can, with a lot of pain. And filesystems must be smart to avoid marking the bit
> > > > > > for pages which are readahead in but actually are invalid. The second patch in the series
> > > > >
> > > > > "invalid" means !PG_uptodate? I wonder why there is a need to test
> > > > > that bit at all. !PG_uptodate seems an unrelated transitional state.
> > > > not PG_update, it's referenced bit. A readahead metadata page will have update bit set,
> > > > but it might not have referenced bit if it's an obsolete page. btrfs
> > > > doesn't use the buffer_head
> > >
> > > I do see PageUptodate() tests in your patch, perhaps they be removed?
> > uptodate bit isn't really needed, but I added it to make sure the page
> > is valid.
>
> It may be nit pick, but I always try to remove optional code. The
> PageUptodate() looks like an irrelevant test and a good candidate to
> remove.
ok, I can do this.
> > > > > > has more detailed infomation about this issue. The problem is if this is really worthy
> > > > > > for metadata readahead. Some filesystems might don't care about metadata readahead. If
> > > > > > we make fincore check the bit, then fincore syscall will not work for such filesystems,
> > > > > > which is bad.
> > > > >
> > > > > fincore() will always work as is. If the filesystem don't care about
> > > > > metadata readahead, then the metadata readahead that makes use of the
> > > > > bits will naturally not work for them?
> > > > yes, they don't care about readahead, but they do care about fincore
> > > > output.
> > >
> > > fincore() just reports the accessed bits as is. If the filesystem does
> > > not use blockdev or export its internal metadata inode, the user won't
> > > be able to run fincore() on the metadata inode at all.
> > >
> > > > if fincore() checks the bits, it doesn't work even for normal file
> > > > pages, if the pages get deactivated.
> > >
> > > That's a problem independent of the interface. And for user space
> > > readahead, it can be nicely fixed by collecting the pages-to-readahead
> > > before the free pages drop low, ie. before any page reclaim actions.
> > > It's "nice" because you don't want to readahead more data than
> > > cache-able anyway and avoid thrashing for small memory systems.
> > My point is fincore() isn't designed only for readahead. People will use
> > it like mincore, which is its normal usage. Checking the bits will break
> > its normal usage, because fincore just doesn't check if the fd means a
> > metadata inode.
>
> Sorry, you missed my point :) I mean to export the accessed bits as-is
> via the fincore() interface, not to check the accessed bits and then
> report "page not cached" to user space for !PG_referenced pages.
I thought you said this before, and I think it's a bad API. userspace
should not be aware of such bits, because they are kernel internal.
Except the readahead usage, I can't imagine why userspace needs to know
the bits.
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-01-18 6:35 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-04 5:40 [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs Shaohua Li
2011-01-04 16:14 ` Jeff Moyer
[not found] ` <x498vz0abov.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2011-01-05 2:10 ` Shaohua Li
2011-01-10 14:26 ` Wu Fengguang
2011-01-11 0:15 ` Shaohua Li
2011-01-11 1:38 ` Wu Fengguang
2011-01-11 2:03 ` Shaohua Li
2011-01-11 3:07 ` Wu Fengguang
2011-01-11 3:27 ` Shaohua Li
2011-01-11 9:13 ` Wu Fengguang
2011-01-12 2:55 ` Shaohua Li
[not found] ` <20110112025516.GA11303-yAZKuqJtXNMXR+D7ky4Foa2pdiUAq4bhAL8bYrjMMd8@public.gmane.org>
2011-01-16 3:38 ` Wu Fengguang
2011-01-17 1:32 ` Shaohua Li
2011-01-18 4:41 ` Wu Fengguang
2011-01-18 5:15 ` Shaohua Li
2011-01-18 6:22 ` Wu Fengguang
2011-01-18 6:35 ` Shaohua Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).