All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Konstantin Khlebnikov <koct9i@gmail.com>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Arnaldo Carvalho de Melo <acme@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rusty Russell <rusty@rustcorp.com.au>,
	David Miller <davem@davemloft.net>,
	Andres Freund <andres@2ndquadrant.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 2/3] mm: introduce fincore()
Date: Mon, 2 Jun 2014 15:23:22 +0300	[thread overview]
Message-ID: <20140602122322.GB8691@node.dhcp.inet.fi> (raw)
In-Reply-To: <1401686699-9723-3-git-send-email-n-horiguchi@ah.jp.nec.com>

On Mon, Jun 02, 2014 at 01:24:58AM -0400, Naoya Horiguchi wrote:
> This patch provides a new system call fincore(2), which provides mincore()-
> like information, i.e. page residency of a given file. But unlike mincore(),
> fincore() can have a mode flag and it enables us to extract more detailed
> information about page cache like pfn and page flag. This kind of information
> is very helpful for example when applications want to know the file cache
> status to control IO on their own way.
> 
> Detail about the data format being passed to userspace are explained in
> inline comment, but generally in long entry format, we can choose which
> information is extraced flexibly, so you don't have to waste memory by
> extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
> we can skip hole pages (not on memory,) which makes us avoid a flood of
> meaningless zero entries when calling on extremely large (but only few
> pages of it are loaded on memory) file.
> 
> Basic testset is added in a next patch on tools/testing/selftests/fincore/.
> 
> [1] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

...

> diff --git v3.15-rc7.orig/mm/fincore.c v3.15-rc7/mm/fincore.c
> new file mode 100644
> index 000000000000..3fc3ef465471
> --- /dev/null
> +++ v3.15-rc7/mm/fincore.c
> @@ -0,0 +1,362 @@
> +/*
> + * fincore(2) system call
> + *
> + * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:
> + *
> + * - FINCORE_BMAP:
> + *     The page status is returned in a vector of bytes.
> + *     The least significant bit of each byte is 1 if the referenced page
> + *     is in memory, otherwise it is zero.

I'm okay with bytemap. Just wounder why not bitmap?

> + *
> + * - FINCORE_PFN:
> + *     stores pfn, using 8 bytes.
> + *
> + * - FINCORE_PAGEFLAGS:
> + *     stores page flags, using 8 bytes. See definition of KPF_* for details.
> + *
> + * - FINCORE_PAGECACHE_TAGS:
> + *     stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
> + *     for details.

Is it safe to expose this info to unprivilaged process (consider all three
flags above)?

> + * - FINCORE_SKIP_HOLE: if this flag is set, fincore() doesn't store any
> + *     information about hole. Instead each records per page has the entry
> + *     of page offset (using 8 bytes.) This mode is useful if we handle
> + *     large file and only few pages are on memory for the file.

Hm.. It's probably overkill, but instead of filling userspace buffer we
could return file descriptor and define lseek(SEEK_HOLE). Just thinking.

> + *
> + * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
> + * data in this mode is like this:
> + *
> + *   page offset  0   1   2   3   4
> + *              +---+---+---+---+---+
> + *              | 1 | 0 | 0 | 1 | 1 | ...
> + *              +---+---+---+---+---+
> + *               <->
> + *              1 byte
> + *
> + * For FINCORE_PFN, page data is formatted like this:
> + *
> + *   page offset    0       1       2       3       4
> + *              +-------+-------+-------+-------+-------+
> + *              |  pfn  |  pfn  |  pfn  |  pfn  |  pfn  | ...
> + *              +-------+-------+-------+-------+-------+
> + *               <----->
> + *               8 byte
> + *
> + * We can use multiple flags among FINCORE_(PFN|PAGEFLAGS|PAGECACHE_TAGS).
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:
> + *
> + *    page offset 0    page offset 1   page offset 2
> + *   +-------+-------+-------+-------+-------+-------+
> + *   |  pfn  | flags |  pfn  | flags |  pfn  | flags | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + *
> + * When FINCORE_SKIP_HOLE is set, we ignore holes and add page offset entry
> + * (8 bytes) instead. For example, the data format of mode
> + * FINCORE_PFN|FINCORE_SKIP_HOLE is like follows:
> + *
> + *   +-------+-------+-------+-------+-------+-------+
> + *   | pgoff |  pfn  | pgoff |  pfn  | pgoff |  pfn  | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + */
> +#define FINCORE_BMAP		0x01	/* bytemap mode */
> +#define FINCORE_PFN		0x02
> +#define FINCORE_PAGE_FLAGS	0x04
> +#define FINCORE_PAGECACHE_TAGS	0x08
> +#define FINCORE_SKIP_HOLE	0x10

FINCORE_SKIP_HOLE is greater then FINCORE_PFN but pgoff precedes pfn in
records. It's confusing. We need clear definition of record format.

What about rename FINCORE_SKIP_HOLE -> FINCORE_PGOFF, move it before
FINCORE_PFN. So FINCORE_PGOFF is less than FINCORE_PFN, which is less than
FINCORE_PAGE_FLAGS, which is less than FINCORE_PAGECACHE_TAGS. It matches
order in records:

FINCORE_PGOFF|FINCORE_PFN|FINCORE_PAGEFLAGS|FINCORE_PAGECACHE_TAGS

 +-------+-------+-------+-------+-------+-------+-------+-------+
 | pgoff |  pfn  | flags |  tags | pgoff |  pfn  | flags |  tags | ...
 +-------+-------+-------+-------+-------+-------+-------+-------+
  <-----------------------------> <------------------------------>
             32 bytes                        32 bytes

> +
> +#define FINCORE_MODE_MASK	0x1f
> +#define FINCORE_LONGENTRY_MASK	(FINCORE_PFN | FINCORE_PAGE_FLAGS | \
> +				 FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
> +
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Konstantin Khlebnikov <koct9i@gmail.com>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Arnaldo Carvalho de Melo <acme@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rusty Russell <rusty@rustcorp.com.au>,
	David Miller <davem@davemloft.net>,
	Andres Freund <andres@2ndquadrant.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 2/3] mm: introduce fincore()
Date: Mon, 2 Jun 2014 15:23:22 +0300	[thread overview]
Message-ID: <20140602122322.GB8691@node.dhcp.inet.fi> (raw)
In-Reply-To: <1401686699-9723-3-git-send-email-n-horiguchi@ah.jp.nec.com>

On Mon, Jun 02, 2014 at 01:24:58AM -0400, Naoya Horiguchi wrote:
> This patch provides a new system call fincore(2), which provides mincore()-
> like information, i.e. page residency of a given file. But unlike mincore(),
> fincore() can have a mode flag and it enables us to extract more detailed
> information about page cache like pfn and page flag. This kind of information
> is very helpful for example when applications want to know the file cache
> status to control IO on their own way.
> 
> Detail about the data format being passed to userspace are explained in
> inline comment, but generally in long entry format, we can choose which
> information is extraced flexibly, so you don't have to waste memory by
> extracting unnecessary information. And with FINCORE_SKIP_HOLE flag,
> we can skip hole pages (not on memory,) which makes us avoid a flood of
> meaningless zero entries when calling on extremely large (but only few
> pages of it are loaded on memory) file.
> 
> Basic testset is added in a next patch on tools/testing/selftests/fincore/.
> 
> [1] http://thread.gmane.org/gmane.linux.kernel/1439212/focus=1441919
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

...

> diff --git v3.15-rc7.orig/mm/fincore.c v3.15-rc7/mm/fincore.c
> new file mode 100644
> index 000000000000..3fc3ef465471
> --- /dev/null
> +++ v3.15-rc7/mm/fincore.c
> @@ -0,0 +1,362 @@
> +/*
> + * fincore(2) system call
> + *
> + * Copyright (C) 2014 NEC Corporation, Naoya Horiguchi
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:
> + *
> + * - FINCORE_BMAP:
> + *     The page status is returned in a vector of bytes.
> + *     The least significant bit of each byte is 1 if the referenced page
> + *     is in memory, otherwise it is zero.

I'm okay with bytemap. Just wounder why not bitmap?

> + *
> + * - FINCORE_PFN:
> + *     stores pfn, using 8 bytes.
> + *
> + * - FINCORE_PAGEFLAGS:
> + *     stores page flags, using 8 bytes. See definition of KPF_* for details.
> + *
> + * - FINCORE_PAGECACHE_TAGS:
> + *     stores pagecache tags, using 8 bytes. See definition of PAGECACHE_TAG_*
> + *     for details.

Is it safe to expose this info to unprivilaged process (consider all three
flags above)?

> + * - FINCORE_SKIP_HOLE: if this flag is set, fincore() doesn't store any
> + *     information about hole. Instead each records per page has the entry
> + *     of page offset (using 8 bytes.) This mode is useful if we handle
> + *     large file and only few pages are on memory for the file.

Hm.. It's probably overkill, but instead of filling userspace buffer we
could return file descriptor and define lseek(SEEK_HOLE). Just thinking.

> + *
> + * FINCORE_BMAP shouldn't be used combined with any other flags, and returnd
> + * data in this mode is like this:
> + *
> + *   page offset  0   1   2   3   4
> + *              +---+---+---+---+---+
> + *              | 1 | 0 | 0 | 1 | 1 | ...
> + *              +---+---+---+---+---+
> + *               <->
> + *              1 byte
> + *
> + * For FINCORE_PFN, page data is formatted like this:
> + *
> + *   page offset    0       1       2       3       4
> + *              +-------+-------+-------+-------+-------+
> + *              |  pfn  |  pfn  |  pfn  |  pfn  |  pfn  | ...
> + *              +-------+-------+-------+-------+-------+
> + *               <----->
> + *               8 byte
> + *
> + * We can use multiple flags among FINCORE_(PFN|PAGEFLAGS|PAGECACHE_TAGS).
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:
> + *
> + *    page offset 0    page offset 1   page offset 2
> + *   +-------+-------+-------+-------+-------+-------+
> + *   |  pfn  | flags |  pfn  | flags |  pfn  | flags | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + *
> + * When FINCORE_SKIP_HOLE is set, we ignore holes and add page offset entry
> + * (8 bytes) instead. For example, the data format of mode
> + * FINCORE_PFN|FINCORE_SKIP_HOLE is like follows:
> + *
> + *   +-------+-------+-------+-------+-------+-------+
> + *   | pgoff |  pfn  | pgoff |  pfn  | pgoff |  pfn  | ...
> + *   +-------+-------+-------+-------+-------+-------+
> + *    <-------------> <-------------> <------------->
> + *       16 bytes        16 bytes        16 bytes
> + */
> +#define FINCORE_BMAP		0x01	/* bytemap mode */
> +#define FINCORE_PFN		0x02
> +#define FINCORE_PAGE_FLAGS	0x04
> +#define FINCORE_PAGECACHE_TAGS	0x08
> +#define FINCORE_SKIP_HOLE	0x10

FINCORE_SKIP_HOLE is greater then FINCORE_PFN but pgoff precedes pfn in
records. It's confusing. We need clear definition of record format.

What about rename FINCORE_SKIP_HOLE -> FINCORE_PGOFF, move it before
FINCORE_PFN. So FINCORE_PGOFF is less than FINCORE_PFN, which is less than
FINCORE_PAGE_FLAGS, which is less than FINCORE_PAGECACHE_TAGS. It matches
order in records:

FINCORE_PGOFF|FINCORE_PFN|FINCORE_PAGEFLAGS|FINCORE_PAGECACHE_TAGS

 +-------+-------+-------+-------+-------+-------+-------+-------+
 | pgoff |  pfn  | flags |  tags | pgoff |  pfn  | flags |  tags | ...
 +-------+-------+-------+-------+-------+-------+-------+-------+
  <-----------------------------> <------------------------------>
             32 bytes                        32 bytes

> +
> +#define FINCORE_MODE_MASK	0x1f
> +#define FINCORE_LONGENTRY_MASK	(FINCORE_PFN | FINCORE_PAGE_FLAGS | \
> +				 FINCORE_PAGECACHE_TAGS | FINCORE_SKIP_HOLE)
> +
-- 
 Kirill A. Shutemov

  parent reply	other threads:[~2014-06-02 12:23 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-21  2:26 [PATCH 0/4] pagecache scanning with /proc/kpagecache Naoya Horiguchi
2014-05-21  2:26 ` Naoya Horiguchi
2014-05-21  2:26 ` [PATCH 1/4] radix-tree: add end_index to support ranged iteration Naoya Horiguchi
2014-05-21  2:26   ` Naoya Horiguchi
2014-05-21  8:21   ` Konstantin Khlebnikov
2014-05-21  8:21     ` Konstantin Khlebnikov
2014-05-21 19:26     ` Naoya Horiguchi
2014-05-21  2:26 ` [PATCH 2/4] fs/proc/page.c: introduce /proc/kpagecache interface Naoya Horiguchi
2014-05-21  2:26   ` Naoya Horiguchi
2014-05-21  2:26 ` [PATCH 3/4] tools/vm/page-types.c: rework on file cache scanning mode Naoya Horiguchi
2014-05-21  2:26   ` Naoya Horiguchi
2014-05-21  2:26 ` [PATCH 4/4] Documentation: update Documentation/vm/pagemap.txt Naoya Horiguchi
2014-05-21  2:26   ` Naoya Horiguchi
2014-05-21 22:42 ` [PATCH 0/4] pagecache scanning with /proc/kpagecache Andrew Morton
2014-05-21 22:42   ` Andrew Morton
2014-05-22  2:19   ` Naoya Horiguchi
     [not found]   ` <537d5ee4.4914e00a.5672.ffff85d5SMTPIN_ADDED_BROKEN@mx.google.com>
2014-05-22  2:33     ` Andrew Morton
2014-05-22  2:33       ` Andrew Morton
2014-05-22  9:50       ` Konstantin Khlebnikov
2014-05-22  9:50         ` Konstantin Khlebnikov
2014-05-22 10:36         ` Kirill A. Shutemov
2014-05-22 10:36           ` Kirill A. Shutemov
2014-05-22 17:47           ` Naoya Horiguchi
2014-05-22 21:02             ` Naoya Horiguchi
2014-06-02  5:24       ` [RFC][PATCH 0/3] mm: introduce fincore() Naoya Horiguchi
2014-06-02  5:24         ` Naoya Horiguchi
2014-06-02  5:24         ` [PATCH 1/3] replace PAGECACHE_TAG_* definition with enumeration Naoya Horiguchi
2014-06-02  5:24           ` Naoya Horiguchi
2014-06-02 16:12           ` Dave Hansen
2014-06-02 16:12             ` Dave Hansen
2014-06-02 16:37             ` Naoya Horiguchi
     [not found]             ` <1401727052-f7v7kykv@n-horiguchi@ah.jp.nec.com>
2014-06-02 16:45               ` Dave Hansen
2014-06-02 16:45                 ` Dave Hansen
2014-06-02 17:14                 ` Naoya Horiguchi
2014-06-02 18:19                   ` Dave Hansen
2014-06-02 18:19                     ` Dave Hansen
2014-06-02 18:48                     ` Naoya Horiguchi
2014-06-02 21:16             ` Andrew Morton
2014-06-02 21:16               ` Andrew Morton
2014-06-02 21:51               ` Naoya Horiguchi
2014-06-02  5:24         ` [PATCH 2/3] mm: introduce fincore() Naoya Horiguchi
2014-06-02  5:24           ` Naoya Horiguchi
     [not found]           ` <1401686699-9723-3-git-send-email-n-horiguchi-PaJj6Psr51x8UrSeD/g0lQ@public.gmane.org>
2014-06-02  6:42             ` Christoph Hellwig
2014-06-02  6:42               ` Christoph Hellwig
2014-06-02  6:42               ` Christoph Hellwig
2014-06-02 14:19               ` Naoya Horiguchi
2014-06-02  7:06             ` Michael Kerrisk
2014-06-02  7:06               ` Michael Kerrisk
2014-06-02  7:06               ` Michael Kerrisk
2014-06-02 14:21               ` Naoya Horiguchi
2014-06-02 12:23           ` Kirill A. Shutemov [this message]
2014-06-02 12:23             ` Kirill A. Shutemov
2014-06-02 14:52             ` Naoya Horiguchi
2014-06-02 16:11           ` Dave Hansen
2014-06-02 16:11             ` Dave Hansen
2014-06-02 16:22             ` Naoya Horiguchi
2014-06-02  5:24         ` [PATCH 3/3] selftest: add test code for fincore() Naoya Horiguchi
2014-06-02  5:24           ` Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140602122322.GB8691@node.dhcp.inet.fi \
    --to=kirill@shutemov.name \
    --cc=acme@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andres@2ndquadrant.com \
    --cc=bp@alien8.de \
    --cc=davem@davemloft.net \
    --cc=fengguang.wu@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=koct9i@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.