linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Arnaldo Carvalho de Melo <acme@redhat.com>,
	Borislav Petkov <bp@alien8.de>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rusty Russell <rusty@rustcorp.com.au>,
	David Miller <davem@davemloft.net>,
	Andres Freund <andres@2ndquadrant.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Linux API <linux-api@vger.kernel.org>,
	Naoya Horiguchi <nao.horiguchi@gmail.com>,
	Kees Cook <kees@outflux.net>
Subject: Re: [PATCH v3 1/3] mm: introduce fincore()
Date: Mon, 07 Jul 2014 12:01:41 -0700	[thread overview]
Message-ID: <53BAEE95.50807@intel.com> (raw)
In-Reply-To: <1404756006-23794-2-git-send-email-n-horiguchi@ah.jp.nec.com>

> +/*
> + * You can control how the buffer in userspace is filled with this mode
> + * parameters:

I agree that we don't have any good mechanisms for looking at the page
cache from userspace.  I've hacked some things up using mincore() and
they weren't pretty, so I welcome _something_ like this.

But, is this trying to do too many things at once?  Do we have solid use
cases spelled out for each of these modes?  Have we thought out how they
will be used in practice?

The biggest question for me, though, is whether we want to start
designing these per-page interfaces to consider different page sizes, or
whether we're going to just continue to pretend that the entire world is
4k pages.  Using FINCORE_BMAP on 1GB hugetlbfs files would be a bit
silly, for instance.

> + * - FINCORE_BMAP:
> + *     the page status is returned in a vector of bytes.
> + *     The least significant bit of each byte is 1 if the referenced page
> + *     is in memory, otherwise it is zero.

I know this is consistent with mincore(), but it did always bother me
that mincore() was so sparse.  Seems like it is wasting 7/8 of its bits.

> + * - FINCORE_PGOFF:
> + *     if this flag is set, fincore() doesn't store any information about
> + *     holes. Instead each records per page has the entry of page offset,
> + *     using 8 bytes. This mode is useful if we handle a large file and
> + *     only few pages are on memory.

This bothers me a bit.  How would someone know how sparse file was
before calling this?  If it's not sparse, and they use this, they'll end
up using 8x the memory they would have using FINCORE_BMAP.  If it *is*
sparse, and they use FINCORE_BMAP, they will either waste tons of memory
on buffers, or have to make a ton of calls.

I guess this could also be used to do *searches*, which would let you
search out holes.  Let's say you have a 2TB file.  You could call this
with a buffer size of 1 entry and do searches, say 0->1TB.  If you get
your one entry back, you know it's not completely sparse.

But, that wouldn't work with it as-designed.  The length of the buffer
and the range of the file being checked are coupled together, so you
can't say:

	vec = malloc(sizeof(long));
	fincore(fd, 0, 1TB, FINCORE_PGOFF, vec, extra);

without overflowing vec.

Is it really right to say this is going to be 8 bytes?  Would we want it
to share types with something else, like be an loff_t?

> + * - FINCORE_PFN:
> + *     stores pfn, using 8 bytes.

These are all an unprivileged operations from what I can tell.  I know
we're going to a lot of trouble to hide kernel addresses from being seen
in userspace.  This seems like it would be undesirable for the folks
that care about not leaking kernel addresses, especially for
unprivileged users.

This would essentially tell userspace where in the kernel's address
space some user-controlled data will be.

> + * We can use multiple flags among the flags in FINCORE_LONGENTRY_MASK.
> + * For example, when the mode is FINCORE_PFN|FINCORE_PAGEFLAGS, the per-page
> + * information is stored like this:

Instead of specifying the ordering in the manpages alone, would it be
smarter to just say that the ordering of the items is dependent on the
ordering of the flags?  In other words if FINCORE_PFN <
FINCORE_PAGEFLAGS, then its field comes first?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-07-07 19:02 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-07 18:00 [PATCH v3 0/3] mm: introduce fincore() v3 Naoya Horiguchi
2014-07-07 18:00 ` [PATCH v3 1/3] mm: introduce fincore() Naoya Horiguchi
2014-07-07 19:01   ` Dave Hansen [this message]
2014-07-07 20:21     ` Naoya Horiguchi
2014-07-07 20:43       ` Dave Hansen
2014-07-07 21:48         ` Naoya Horiguchi
2014-07-07 22:44           ` Dave Hansen
2014-07-08 15:35             ` Naoya Horiguchi
2014-07-08 19:03     ` Naoya Horiguchi
2014-07-08 19:42       ` Dave Hansen
2014-07-08 20:41         ` Naoya Horiguchi
2014-07-08 22:32           ` Dave Hansen
2014-07-11 16:53             ` Naoya Horiguchi
2014-07-07 18:00 ` [PATCH v3 2/3] selftests/fincore: add test code for fincore() Naoya Horiguchi
2014-07-07 18:00 ` [PATCH v3 3/3] man2/fincore.2: document general description about fincore(2) Naoya Horiguchi
2014-07-07 19:08   ` Dave Hansen
2014-07-07 20:59     ` Naoya Horiguchi
2014-07-07 22:34       ` Dave Hansen
2014-07-08 15:43         ` Naoya Horiguchi
2014-07-08 12:16 ` [PATCH v3 0/3] mm: introduce fincore() v3 Christoph Hellwig
2014-07-08 13:27   ` Naoya Horiguchi
2014-07-09  8:51     ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53BAEE95.50807@intel.com \
    --to=dave.hansen@intel.com \
    --cc=acme@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andres@2ndquadrant.com \
    --cc=bp@alien8.de \
    --cc=davem@davemloft.net \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=kees@outflux.net \
    --cc=kirill@shutemov.name \
    --cc=koct9i@gmail.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mtk.manpages@gmail.com \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).