From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Konstantin Khlebnikov <koct9i@gmail.com>,
Wu Fengguang <fengguang.wu@intel.com>,
Arnaldo Carvalho de Melo <acme@redhat.com>,
Borislav Petkov <bp@alien8.de>,
Johannes Weiner <hannes@cmpxchg.org>,
Rusty Russell <rusty@rustcorp.com.au>,
David Miller <davem@davemloft.net>,
Andres Freund <andres@2ndquadrant.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 2/3] mm: introduce fincore()
Date: Mon, 2 Jun 2014 15:23:22 +0300 [thread overview]
Message-ID: <20140602122322.GB8691@node.dhcp.inet.fi> (raw)
In-Reply-To: <1401686699-9723-3-git-send-email-n-horiguchi@ah.jp.nec.com>
On Mon, Jun 02, 2014 at 01:24:58AM -0400, Naoya Horiguchi wrote:
> This patch provides a new system call fincore(2), which provides mincore()-
> like information, i.e. page residency of a given file. But unlike mincore(),
> fincore() can have a mode flag and it enables us to extract more detailed
> information about page cache like pfn and page flag. This kind of information
> is very helpful for example when applications want to know the file cache
> status to control IO on their own way.
>
> Detail about the data format being passed to userspace are explained in
> inline comment, but generally in long entry format, we can choose which
> information is extraced flexibly, so you don't have to waste memory by
> extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
> we can skip hole pages (not on memory,) which makes us avoid a flood of
> meaningless zero entries when calling on extremely large (but only few
> pages of it are loaded on memory) file.
>
> Basic testset is added in a next patch on tools/testing/selftests/fincore/.
>
> [1] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
...
> diff --git v3.15-rc7.orig/mm/fincore.c v3.15-rc7/mm/fincore.c
> new file mode 100644
> index 000000000000..3fc3ef465471
> --- /dev/null
> +++ v3.15-rc7/mm/fincore.c
> @@ -0,0 +1,362 @@
> +/*
> + * fincore(2) system call
> + *
> + * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:
> + *
> + * - FINCORE_BMAP:
> + * The page status is returned in a vector of bytes.
> + * The least significant bit of each byte is 1 if the referenced page
> + * is in memory, otherwise it is zero.
I'm okay with bytemap. Just wounder why not bitmap?
> + *
> + * - FINCORE_PFN:
> + * stores pfn, using 8 bytes.
> + *
> + * - FINCORE_PAGEFLAGS:
> + * stores page flags, using 8 bytes. See definition of KPF_* for details.
> + *
> + * - FINCORE_PAGECACHE_TAGS:
> + * stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
> + * for details.
Is it safe to expose this info to unprivilaged process (consider all three
flags above)?
> + * - FINCORE_SKIP_HOLE: if this flag is set, fincore() doesn't store any
> + * information about hole. Instead each records per page has the entry
> + * of page offset (using 8 bytes.) This mode is useful if we handle
> + * large file and only few pages are on memory for the file.
Hm.. It's probably overkill, but instead of filling userspace buffer we
could return file descriptor and define lseek(SEEK_HOLE). Just thinking.
> + *
> + * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
> + * data in this mode is like this:
> + *
> + * page offset 0 1 2 3 4
> + * +---+---+---+---+---+
> + * | 1 | 0 | 0 | 1 | 1 | ...
> + * +---+---+---+---+---+
> + * <->
> + * 1 byte
> + *
> + * For FINCORE_PFN, page data is formatted like this:
> + *
> + * page offset 0 1 2 3 4
> + * +-------+-------+-------+-------+-------+
> + * | pfn | pfn | pfn | pfn | pfn | ...
> + * +-------+-------+-------+-------+-------+
> + * <----->
> + * 8 byte
> + *
> + * We can use multiple flags among FINCORE_(PFN|PAGEFLAGS|PAGECACHE_TAGS).
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:
> + *
> + * page offset 0 page offset 1 page offset 2
> + * +-------+-------+-------+-------+-------+-------+
> + * | pfn | flags | pfn | flags | pfn | flags | ...
> + * +-------+-------+-------+-------+-------+-------+
> + * <-------------> <-------------> <------------->
> + * 16 bytes 16 bytes 16 bytes
> + *
> + * When FINCORE_SKIP_HOLE is set, we ignore holes and add page offset entry
> + * (8 bytes) instead. For example, the data format of mode
> + * FINCORE_PFN|FINCORE_SKIP_HOLE is like follows:
> + *
> + * +-------+-------+-------+-------+-------+-------+
> + * | pgoff | pfn | pgoff | pfn | pgoff | pfn | ...
> + * +-------+-------+-------+-------+-------+-------+
> + * <-------------> <-------------> <------------->
> + * 16 bytes 16 bytes 16 bytes
> + */
> +#define FINCORE_BMAP 0x01 /* bytemap mode */
> +#define FINCORE_PFN 0x02
> +#define FINCORE_PAGE_FLAGS 0x04
> +#define FINCORE_PAGECACHE_TAGS 0x08
> +#define FINCORE_SKIP_HOLE 0x10
FINCORE_SKIP_HOLE is greater then FINCORE_PFN but pgoff precedes pfn in
records. It's confusing. We need clear definition of record format.
What about rename FINCORE_SKIP_HOLE -> FINCORE_PGOFF, move it before
FINCORE_PFN. So FINCORE_PGOFF is less than FINCORE_PFN, which is less than
FINCORE_PAGE_FLAGS, which is less than FINCORE_PAGECACHE_TAGS. It matches
order in records:
FINCORE_PGOFF|FINCORE_PFN|FINCORE_PAGEFLAGS|FINCORE_PAGECACHE_TAGS
+-------+-------+-------+-------+-------+-------+-------+-------+
| pgoff | pfn | flags | tags | pgoff | pfn | flags | tags | ...
+-------+-------+-------+-------+-------+-------+-------+-------+
<-----------------------------> <------------------------------>
32 bytes 32 bytes
> +
> +#define FINCORE_MODE_MASK 0x1f
> +#define FINCORE_LONGENTRY_MASK (FINCORE_PFN | FINCORE_PAGE_FLAGS | \
> + FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
> +
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-06-02 12:23 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-21 2:26 [PATCH 0/4] pagecache scanning with /proc/kpagecache Naoya Horiguchi
2014-05-21 2:26 ` [PATCH 1/4] radix-tree: add end_index to support ranged iteration Naoya Horiguchi
2014-05-21 8:21 ` Konstantin Khlebnikov
2014-05-21 19:26 ` Naoya Horiguchi
2014-05-21 2:26 ` [PATCH 2/4] fs/proc/page.c: introduce /proc/kpagecache interface Naoya Horiguchi
2014-05-21 2:26 ` [PATCH 3/4] tools/vm/page-types.c: rework on file cache scanning mode Naoya Horiguchi
2014-05-21 2:26 ` [PATCH 4/4] Documentation: update Documentation/vm/pagemap.txt Naoya Horiguchi
2014-05-21 22:42 ` [PATCH 0/4] pagecache scanning with /proc/kpagecache Andrew Morton
2014-05-22 2:19 ` Naoya Horiguchi
[not found] ` <537d5ee4.4914e00a.5672.ffff85d5SMTPIN_ADDED_BROKEN@mx.google.com>
2014-05-22 2:33 ` Andrew Morton
2014-05-22 9:50 ` Konstantin Khlebnikov
2014-05-22 10:36 ` Kirill A. Shutemov
2014-05-22 17:47 ` Naoya Horiguchi
2014-05-22 21:02 ` Naoya Horiguchi
2014-06-02 5:24 ` [RFC][PATCH 0/3] mm: introduce fincore() Naoya Horiguchi
2014-06-02 5:24 ` [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration Naoya Horiguchi
2014-06-02 16:12 ` Dave Hansen
2014-06-02 16:37 ` Naoya Horiguchi
[not found] ` <1401727052-f7v7kykv@n-horiguchi@ah.jp.nec.com>
2014-06-02 16:45 ` Dave Hansen
2014-06-02 17:14 ` Naoya Horiguchi
2014-06-02 18:19 ` Dave Hansen
2014-06-02 18:48 ` Naoya Horiguchi
2014-06-02 21:16 ` Andrew Morton
2014-06-02 21:51 ` Naoya Horiguchi
2014-06-02 5:24 ` [PATCH 2/3] mm: introduce fincore() Naoya Horiguchi
2014-06-02 6:42 ` Christoph Hellwig
2014-06-02 14:19 ` Naoya Horiguchi
2014-06-02 7:06 ` Michael Kerrisk
2014-06-02 14:21 ` Naoya Horiguchi
2014-06-02 12:23 ` Kirill A. Shutemov [this message]
2014-06-02 14:52 ` Naoya Horiguchi
2014-06-02 16:11 ` Dave Hansen
2014-06-02 16:22 ` Naoya Horiguchi
2014-06-02 5:24 ` [PATCH 3/3] selftest: add test code for fincore() Naoya Horiguchi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140602122322.GB8691@node.dhcp.inet.fi \
--to=kirill@shutemov.name \
--cc=acme@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andres@2ndquadrant.com \
--cc=bp@alien8.de \
--cc=davem@davemloft.net \
--cc=fengguang.wu@intel.com \
--cc=hannes@cmpxchg.org \
--cc=koct9i@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=n-horiguchi@ah.jp.nec.com \
--cc=rusty@rustcorp.com.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).