From: "Xin Zhao" <uszhaoxin@gmail.com>
To: "Jan Kara" <jack@suse.cz>
Cc: "Dave Kleikamp" <shaggy@linux.vnet.ibm.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: Linux page cache issue?
Date: Thu, 29 Mar 2007 10:41:01 -0400 [thread overview]
Message-ID: <4ae3c140703290741p58199472u3bf9f3f58e4d1db1@mail.gmail.com> (raw)
In-Reply-To: <20070329092745.GA14616@atrey.karlin.mff.cuni.cz>
Hi Jan,
Many thanks for your kind reply.
I know we can use device inode's radix tree to achieve the same goal.
The only downside could be: First, by default, Linux will not add the
data pages into that radix tree. Only when a file is opened in
O_DIRECT, the data pages will be put into dev's radix tree. Moreover,
if the partition is big, I am not sure whether the lookup overhead is
an issue. So it might need some optimization.
Can you elaborate more about the aliasing issues mentioned in your
email? I do have some mechanisms to handle the following situation:
suppose two files share same data blocks. Now two processes open the
two files separately. If one process writes a file, the other file
will be affected. Is this the aliasing issue you referred to?
Thanks,
xin
On 3/29/07, Jan Kara <jack@suse.cz> wrote:
> Hello,
>
> > Now I want to explain the problem that leads me to explore the Linux
> > disk cache management. This is actually from my project. In a file
> > system I am working on, two files may have different inodes, but share
> > the same data blocks. Of course additional block-level reference
> > counting and copy-on-write mechanisms are needed to prevent operations
> > on one file from disrupting the other file. But the point is, the two
> > files share the same data blocks.
> >
> > I hope that consequential reads to the two files can benefit from disk
> > cache, since they have the same data blocks. But I noticed that Linux
> > splits disk buffer cache into many small parts and associate a file's
> > data with its mapping object. Linux determines whether a data page is
> > cached or not by lookup the file's mapping radix tree. So this is a
> > per-file radix tree. This design obviously makes each tree smaller and
> > faster to look up. But this design eliminates the possibility of
> > sharing disk cache across two files. For example, if a process reads
> > file 2 right after file 1 (both file 1 and 2 share the same data block
> > set). Even if the data blocks are already loaded in memory, but they
> > can only be located via file 1's mapping object. When Linux reads file
> > 2, it still think the data is not present in memory. So the process
> > still needs to load the data from disk again.
> Actually, there is one inode - the device inode - whose mapping can
> contain all the blocks of the filesystem. That is basically the radix
> tree you are looking for. ext3 for example uses it for accessing its
> metadata (indirect blocks etc.). But you have to be really careful to
> avoid aliasing issues and such when you'd like to map copies of those
> pages into mappings of several different inodes (BTW ext3cow filesystem
> may be interesting for you www.ext3cow.com).
>
> Honza
>
> > On 3/28/07, Dave Kleikamp <shaggy@linux.vnet.ibm.com> wrote:
> > >On Wed, 2007-03-28 at 02:45 -0400, Xin Zhao wrote:
> > >> Hi,
> > >>
> > >> If a Linux process opens and reads a file A, then it closes the file.
> > >> Will Linux keep the file A's data in cache for a while in case another
> > >> process opens and reads the same in a short time? I think that is what
> > >> I heard before.
> > >
> > >Yes.
> > >
> > >> But after I digged into the kernel code, I am confused.
> > >>
> > >> When a process closes the file A, iput() will be called, which in turn
> > >> calls the follows two functions:
> > >> iput_final()->generic_drop_inode()
> > >
> > >A comment from the top of fs/dcache.c:
> > >
> > >/*
> > > * Notes on the allocation strategy:
> > > *
> > > * The dcache is a master of the icache - whenever a dcache entry
> > > * exists, the inode will always exist. "iput()" is done either when
> > > * the dcache entry is deleted or garbage collected.
> > > */
> > >
> > >Basically, as long a a dentry is present, iput_final won't be called on
> > >the inode.
> > >
> > >> But from the following calling chain, we can see that file close will
> > >> eventually lead to evict and free all cached pages. Actually in
> > >> truncate_complete_page(), the pages will be freed. This seems to
> > >> imply that Linux has to re-read the same data from disk even if
> > >> another process B read the same file right after process A closes the
> > >> file. That does not make sense to me.
> > >>
> > >> /***calling chain ***/
> > >> generic_delete_inode/generic_forget_inode()->
> > >> truncate_inode_pages()->truncate_inode_pages_range()->
> > >> truncate_complete_page()->remove_from_page_cache()->
> > >> __remove_from_page_cache()->radix_tree_delete()
> > >>
> > >> Am I missing something? Can someone please provide some advise?
> > >>
> > >> Thanks a lot
> > >> -x
> > >
> > >Shaggy
> > >--
> > >David Kleikamp
> > >IBM Linux Technology Center
> > >
> > >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> --
> Jan Kara <jack@suse.cz>
> SuSE CR Labs
>
next prev parent reply other threads:[~2007-03-29 14:41 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-03-28 6:45 Linux page cache issue? Xin Zhao
2007-03-28 7:35 ` junjie cai
2007-03-28 7:38 ` Matthias Kaehlcke
2007-03-28 14:10 ` Dave Kleikamp
2007-03-28 15:39 ` Xin Zhao
[not found] ` <alpine.DEB.0.83.0703281157010.2527@sigma.j-a-k-j.com>
2007-03-28 16:15 ` Xin Zhao
2007-03-29 9:27 ` Jan Kara
2007-03-29 14:41 ` Xin Zhao [this message]
2007-04-02 12:51 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4ae3c140703290741p58199472u3bf9f3f58e4d1db1@mail.gmail.com \
--to=uszhaoxin@gmail.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=shaggy@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).