linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Konstantin Khlebnikov <koct9i@gmail.com>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Arnaldo Carvalho de Melo <acme@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rusty Russell <rusty@rustcorp.com.au>,
	David Miller <davem@davemloft.net>,
	Andres Freund <andres@2ndquadrant.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 2/3] mm: introduce fincore()
Date: Mon, 2 Jun 2014 15:23:22 +0300	[thread overview]
Message-ID: <20140602122322.GB8691@node.dhcp.inet.fi> (raw)
In-Reply-To: <1401686699-9723-3-git-send-email-n-horiguchi@ah.jp.nec.com>

On Mon, Jun 02, 2014 at 01:24:58AM -0400, Naoya Horiguchi wrote:
> This patch provides a new system call fincore(2), which provides mincore()-
> like information, i.e. page residency of a given file. But unlike mincore(),
> fincore() can have a mode flag and it enables us to extract more detailed
> information about page cache like pfn and page flag. This kind of information
> is very helpful for example when applications want to know the file cache
> status to control IO on their own way.
> 
> Detail about the data format being passed to userspace are explained in
> inline comment, but generally in long entry format, we can choose which
> information is extraced flexibly, so you don't have to waste memory by
> extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
> we can skip hole pages (not on memory,) which makes us avoid a flood of
> meaningless zero entries when calling on extremely large (but only few
> pages of it are loaded on memory) file.
> 
> Basic testset is added in a next patch on tools/testing/selftests/fincore/.
> 
> [1] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

...

> diff --git v3.15-rc7.orig/mm/fincore.c v3.15-rc7/mm/fincore.c
> new file mode 100644
> index 000000000000..3fc3ef465471
> --- /dev/null
> +++ v3.15-rc7/mm/fincore.c
> @@ -0,0 +1,362 @@
> +/*
> + * fincore(2) system call
> + *
> + * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:
> + *
> + * - FINCORE_BMAP:
> + *     The page status is returned in a vector of bytes.
> + *     The least significant bit of each byte is 1 if the referenced page
> + *     is in memory, otherwise it is zero.

I'm okay with bytemap. Just wounder why not bitmap?

> + *
> + * - FINCORE_PFN:
> + *     stores pfn, using 8 bytes.
> + *
> + * - FINCORE_PAGEFLAGS:
> + *     stores page flags, using 8 bytes. See definition of KPF_* for details.
> + *
> + * - FINCORE_PAGECACHE_TAGS:
> + *     stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
> + *     for details.

Is it safe to expose this info to unprivilaged process (consider all three
flags above)?

> + * - FINCORE_SKIP_HOLE: if this flag is set, fincore() doesn't store any
> + *     information about hole. Instead each records per page has the entry
> + *     of page offset (using 8 bytes.) This mode is useful if we handle
> + *     large file and only few pages are on memory for the file.

Hm.. It's probably overkill, but instead of filling userspace buffer we
could return file descriptor and define lseek(SEEK_HOLE). Just thinking.

> + *
> + * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
> + * data in this mode is like this:
> + *
> + *   page offset  0   1   2   3   4
> + *              +---+---+---+---+---+
> + *              | 1 | 0 | 0 | 1 | 1 | ...
> + *              +---+---+---+---+---+
> + *               <->
> + *              1 byte
> + *
> + * For FINCORE_PFN, page data is formatted like this:
> + *
> + *   page offset    0       1       2       3       4
> + *              +-------+-------+-------+-------+-------+
> + *              |  pfn  |  pfn  |  pfn  |  pfn  |  pfn  | ...
> + *              +-------+-------+-------+-------+-------+
> + *               <----->
> + *               8 byte
> + *
> + * We can use multiple flags among FINCORE_(PFN|PAGEFLAGS|PAGECACHE_TAGS).
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:
> + *
> + *    page offset 0    page offset 1   page offset 2
> + *   +-------+-------+-------+-------+-------+-------+
> + *   |  pfn  | flags |  pfn  | flags |  pfn  | flags | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + *
> + * When FINCORE_SKIP_HOLE is set, we ignore holes and add page offset entry
> + * (8 bytes) instead. For example, the data format of mode
> + * FINCORE_PFN|FINCORE_SKIP_HOLE is like follows:
> + *
> + *   +-------+-------+-------+-------+-------+-------+
> + *   | pgoff |  pfn  | pgoff |  pfn  | pgoff |  pfn  | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + */
> +#define FINCORE_BMAP		0x01	/* bytemap mode */
> +#define FINCORE_PFN		0x02
> +#define FINCORE_PAGE_FLAGS	0x04
> +#define FINCORE_PAGECACHE_TAGS	0x08
> +#define FINCORE_SKIP_HOLE	0x10

FINCORE_SKIP_HOLE is greater then FINCORE_PFN but pgoff precedes pfn in
records. It's confusing. We need clear definition of record format.

What about rename FINCORE_SKIP_HOLE -> FINCORE_PGOFF, move it before
FINCORE_PFN. So FINCORE_PGOFF is less than FINCORE_PFN, which is less than
FINCORE_PAGE_FLAGS, which is less than FINCORE_PAGECACHE_TAGS. It matches
order in records:

FINCORE_PGOFF|FINCORE_PFN|FINCORE_PAGEFLAGS|FINCORE_PAGECACHE_TAGS

 +-------+-------+-------+-------+-------+-------+-------+-------+
 | pgoff |  pfn  | flags |  tags | pgoff |  pfn  | flags |  tags | ...
 +-------+-------+-------+-------+-------+-------+-------+-------+
  <-----------------------------> <------------------------------>
             32 bytes                        32 bytes

> +
> +#define FINCORE_MODE_MASK	0x1f
> +#define FINCORE_LONGENTRY_MASK	(FINCORE_PFN | FINCORE_PAGE_FLAGS | \
> +				 FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
> +
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2014-06-02 12:23 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-21  2:26 [PATCH 0/4] pagecache scanning with /proc/kpagecache Naoya Horiguchi
2014-05-21  2:26 ` [PATCH 1/4] radix-tree: add end_index to support ranged iteration Naoya Horiguchi
2014-05-21  8:21   ` Konstantin Khlebnikov
2014-05-21 19:26     ` Naoya Horiguchi
2014-05-21  2:26 ` [PATCH 2/4] fs/proc/page.c: introduce /proc/kpagecache interface Naoya Horiguchi
2014-05-21  2:26 ` [PATCH 3/4] tools/vm/page-types.c: rework on file cache scanning mode Naoya Horiguchi
2014-05-21  2:26 ` [PATCH 4/4] Documentation: update Documentation/vm/pagemap.txt Naoya Horiguchi
2014-05-21 22:42 ` [PATCH 0/4] pagecache scanning with /proc/kpagecache Andrew Morton
2014-05-22  2:19   ` Naoya Horiguchi
     [not found]   ` <537d5ee4.4914e00a.5672.ffff85d5SMTPIN_ADDED_BROKEN@mx.google.com>
2014-05-22  2:33     ` Andrew Morton
2014-05-22  9:50       ` Konstantin Khlebnikov
2014-05-22 10:36         ` Kirill A. Shutemov
2014-05-22 17:47           ` Naoya Horiguchi
2014-05-22 21:02             ` Naoya Horiguchi
2014-06-02  5:24       ` [RFC][PATCH 0/3] mm: introduce fincore() Naoya Horiguchi
2014-06-02  5:24         ` [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration Naoya Horiguchi
2014-06-02 16:12           ` Dave Hansen
2014-06-02 16:37             ` Naoya Horiguchi
     [not found]             ` <1401727052-f7v7kykv@n-horiguchi@ah.jp.nec.com>
2014-06-02 16:45               ` Dave Hansen
2014-06-02 17:14                 ` Naoya Horiguchi
2014-06-02 18:19                   ` Dave Hansen
2014-06-02 18:48                     ` Naoya Horiguchi
2014-06-02 21:16             ` Andrew Morton
2014-06-02 21:51               ` Naoya Horiguchi
2014-06-02  5:24         ` [PATCH 2/3] mm: introduce fincore() Naoya Horiguchi
2014-06-02  6:42           ` Christoph Hellwig
2014-06-02 14:19             ` Naoya Horiguchi
2014-06-02  7:06           ` Michael Kerrisk
2014-06-02 14:21             ` Naoya Horiguchi
2014-06-02 12:23           ` Kirill A. Shutemov [this message]
2014-06-02 14:52             ` Naoya Horiguchi
2014-06-02 16:11           ` Dave Hansen
2014-06-02 16:22             ` Naoya Horiguchi
2014-06-02  5:24         ` [PATCH 3/3] selftest: add test code for fincore() Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140602122322.GB8691@node.dhcp.inet.fi \
    --to=kirill@shutemov.name \
    --cc=acme@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andres@2ndquadrant.com \
    --cc=bp@alien8.de \
    --cc=davem@davemloft.net \
    --cc=fengguang.wu@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=koct9i@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).