All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
To: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Andres Lagar-Cavilla
	<andreslc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Raghavendra K T
	<raghavendra.kt-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>,
	Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Michel Lespinasse
	<walken-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>,
	Cyrill Gorcunov
	<gorcunov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>,
	Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org>,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH -mm v9 6/8] proc: add kpageidle file
Date: Wed, 22 Jul 2015 18:20:29 +0300	[thread overview]
Message-ID: <20150722152029.GL23374@esperanza> (raw)
In-Reply-To: <20150721163452.c1e4075a2b193bcd325fad56-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

On Tue, Jul 21, 2015 at 04:34:52PM -0700, Andrew Morton wrote:
> On Sun, 19 Jul 2015 15:31:15 +0300 Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> 
> > Knowing the portion of memory that is not used by a certain application
> > or memory cgroup (idle memory) can be useful for partitioning the system
> > efficiently, e.g. by setting memory cgroup limits appropriately.
> > Currently, the only means to estimate the amount of idle memory provided
> > by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
> > access bit for all pages mapped to a particular process by writing 1 to
> > clear_refs, wait for some time, and then count smaps:Referenced.
> > However, this method has two serious shortcomings:
> > 
> >  - it does not count unmapped file pages
> >  - it affects the reclaimer logic
> > 
> > To overcome these drawbacks, this patch introduces two new page flags,
> > Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
> > can only be set from userspace by setting bit in /proc/kpageidle at the
> > offset corresponding to the page, and it is cleared whenever the page is
> > accessed either through page tables (it is cleared in page_referenced()
> > in this case) or using the read(2) system call (mark_page_accessed()).
> > Thus by setting the Idle flag for pages of a particular workload, which
> > can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
> > let the workload access its working set, and then reading the kpageidle
> > file, one can estimate the amount of pages that are not used by the
> > workload.
> > 
> > The Young page flag is used to avoid interference with the memory
> > reclaimer. A page's Young flag is set whenever the Access bit of a page
> > table entry pointing to the page is cleared by writing to kpageidle. If
> > page_referenced() is called on a Young page, it will add 1 to its return
> > value, therefore concealing the fact that the Access bit was cleared.
> > 
> > Note, since there is no room for extra page flags on 32 bit, this
> > feature uses extended page flags when compiled on 32 bit.
> > 
> > ...
> >
> >
> > ...
> >
> > +static void kpageidle_clear_pte_refs(struct page *page)
> > +{
> > +	struct rmap_walk_control rwc = {
> > +		.rmap_one = kpageidle_clear_pte_refs_one,
> > +		.anon_lock = page_lock_anon_vma_read,
> > +	};
> 
> I think this can be static const, since `arg' is unused?  That would
> save some cycles and stack.

Good catch, thanks.

> 
> > +	bool need_lock;
> > +
> > +	if (!page_mapped(page) ||
> > +	    !page_rmapping(page))
> > +		return;
> > +
> > +	need_lock = !PageAnon(page) || PageKsm(page);
> > +	if (need_lock && !trylock_page(page))
> 
> Oh.  So the feature is a bit unreliable.
> 
> I'm not immediately seeing anything which would prevent us from using
> plain old lock_page() here.  What's going on?

A page may be locked for quite a long period of time, e.g.
truncate_inode_pages_range() may wait until a page writeback finishes
under the page lock. Instead of stalling kpageidle scan, we'd better
move on to the next page. Of course, the result won't be 100% accurate.
In fact, it isn't accurate anyway, because we skip isolated pages,
neither can it possibly be 100% accurate, because the scan itself is not
instant so that while we are performing it the system usage pattern
might change. This new API is only supposed to give a good estimate of
memory usage pattern, which could be used as a hint for adjusting the
system configuration to improve performance.

> 
> > +		return;
> > +
> > +	rmap_walk(page, &rwc);
> > +
> > +	if (need_lock)
> > +		unlock_page(page);
> > +}
> > +
> > +static ssize_t kpageidle_read(struct file *file, char __user *buf,
> > +			      size_t count, loff_t *ppos)
> > +{
> > +	u64 __user *out = (u64 __user *)buf;
> > +	struct page *page;
> > +	unsigned long pfn, end_pfn;
> > +	ssize_t ret = 0;
> > +	u64 idle_bitmap = 0;
> > +	int bit;
> > +
> > +	if (*ppos & KPMMASK || count & KPMMASK)
> > +		return -EINVAL;
> 
> Interface requires 8-byte aligned offset and size.
> 
> > +	pfn = *ppos * BITS_PER_BYTE;
> > +	if (pfn >= max_pfn)
> > +		return 0;
> > +
> > +	end_pfn = pfn + count * BITS_PER_BYTE;
> > +	if (end_pfn > max_pfn)
> > +		end_pfn = ALIGN(max_pfn, KPMBITS);
> 
> So we lose up to 63 pages.  Presumably max_pfn is well enough aligned
> for this to not matter, dunno.

ALIGN(x, a) resolves to ((x + a - 1) & ~(a - 1)), which is >= x, so we
shouldn't loose anything.

> 
> > +	for (; pfn < end_pfn; pfn++) {
> > +		bit = pfn % KPMBITS;
> > +		page = kpageidle_get_page(pfn);
> > +		if (page) {
> > +			if (page_is_idle(page)) {
> > +				/*
> > +				 * The page might have been referenced via a
> > +				 * pte, in which case it is not idle. Clear
> > +				 * refs and recheck.
> > +				 */
> > +				kpageidle_clear_pte_refs(page);
> > +				if (page_is_idle(page))
> > +					idle_bitmap |= 1ULL << bit;
> 
> I don't understand what's going on here.  More details, please?

The output is a bitmap, which is stored as an array of 8-byte elements,
where byte order within each word is native, i.e. if page at pfn #i is
idle we need to set bit #i%64 of element #i/64 of the array. I'll
reflect this in the documentation.

> 
> > +			}
> > +			put_page(page);
> > +		}
> > +		if (bit == KPMBITS - 1) {
> > +			if (put_user(idle_bitmap, out)) {
> > +				ret = -EFAULT;
> > +				break;
> > +			}
> > +			idle_bitmap = 0;
> > +			out++;
> > +		}
> > +	}
> > +
> > +	*ppos += (char __user *)out - buf;
> > +	if (!ret)
> > +		ret = (char __user *)out - buf;
> > +	return ret;
> > +}
> > +
> > +static ssize_t kpageidle_write(struct file *file, const char __user *buf,
> > +			       size_t count, loff_t *ppos)
> > +{
> > +	const u64 __user *in = (const u64 __user *)buf;
> > +	struct page *page;
> > +	unsigned long pfn, end_pfn;
> > +	ssize_t ret = 0;
> > +	u64 idle_bitmap = 0;
> > +	int bit;
> > +
> > +	if (*ppos & KPMMASK || count & KPMMASK)
> > +		return -EINVAL;
> > +
> > +	pfn = *ppos * BITS_PER_BYTE;
> > +	if (pfn >= max_pfn)
> > +		return -ENXIO;
> > +
> > +	end_pfn = pfn + count * BITS_PER_BYTE;
> > +	if (end_pfn > max_pfn)
> > +		end_pfn = ALIGN(max_pfn, KPMBITS);
> > +
> > +	for (; pfn < end_pfn; pfn++) {
> > +		bit = pfn % KPMBITS;
> > +		if (bit == 0) {
> > +			if (get_user(idle_bitmap, in)) {
> > +				ret = -EFAULT;
> > +				break;
> > +			}
> > +			in++;
> > +		}
> > +		if (idle_bitmap >> bit & 1) {
> 
> Hate it when I have to go look up a C precedence table.  This is
> 
> 		if ((idle_bitmap >> bit) & 1) {

Fixed.

Here goes the incremental patch with all the fixes:
---
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7ff7cba8617b..9daa6e92450f 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -362,7 +362,11 @@ static int kpageidle_clear_pte_refs_one(struct page *page,
 
 static void kpageidle_clear_pte_refs(struct page *page)
 {
-	struct rmap_walk_control rwc = {
+	/*
+	 * Since rwc.arg is unused, rwc is effectively immutable, so we
+	 * can make it static const to save some cycles and stack.
+	 */
+	static const struct rmap_walk_control rwc = {
 		.rmap_one = kpageidle_clear_pte_refs_one,
 		.anon_lock = page_lock_anon_vma_read,
 	};
@@ -376,7 +380,7 @@ static void kpageidle_clear_pte_refs(struct page *page)
 	if (need_lock && !trylock_page(page))
 		return;
 
-	rmap_walk(page, &rwc);
+	rmap_walk(page, (struct rmap_walk_control *)&rwc);
 
 	if (need_lock)
 		unlock_page(page);
@@ -466,7 +470,7 @@ static ssize_t kpageidle_write(struct file *file, const char __user *buf,
 			}
 			in++;
 		}
-		if (idle_bitmap >> bit & 1) {
+		if ((idle_bitmap >> bit) & 1) {
 			page = kpageidle_get_page(pfn);
 			if (page) {
 				kpageidle_clear_pte_refs(page);

WARNING: multiple messages have this Message-ID (diff)
From: Vladimir Davydov <vdavydov@parallels.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Andres Lagar-Cavilla <andreslc@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>,
	Michel Lespinasse <walken@google.com>,
	David Rientjes <rientjes@google.com>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH -mm v9 6/8] proc: add kpageidle file
Date: Wed, 22 Jul 2015 18:20:29 +0300	[thread overview]
Message-ID: <20150722152029.GL23374@esperanza> (raw)
In-Reply-To: <20150721163452.c1e4075a2b193bcd325fad56@linux-foundation.org>

On Tue, Jul 21, 2015 at 04:34:52PM -0700, Andrew Morton wrote:
> On Sun, 19 Jul 2015 15:31:15 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:
> 
> > Knowing the portion of memory that is not used by a certain application
> > or memory cgroup (idle memory) can be useful for partitioning the system
> > efficiently, e.g. by setting memory cgroup limits appropriately.
> > Currently, the only means to estimate the amount of idle memory provided
> > by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
> > access bit for all pages mapped to a particular process by writing 1 to
> > clear_refs, wait for some time, and then count smaps:Referenced.
> > However, this method has two serious shortcomings:
> > 
> >  - it does not count unmapped file pages
> >  - it affects the reclaimer logic
> > 
> > To overcome these drawbacks, this patch introduces two new page flags,
> > Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
> > can only be set from userspace by setting bit in /proc/kpageidle at the
> > offset corresponding to the page, and it is cleared whenever the page is
> > accessed either through page tables (it is cleared in page_referenced()
> > in this case) or using the read(2) system call (mark_page_accessed()).
> > Thus by setting the Idle flag for pages of a particular workload, which
> > can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
> > let the workload access its working set, and then reading the kpageidle
> > file, one can estimate the amount of pages that are not used by the
> > workload.
> > 
> > The Young page flag is used to avoid interference with the memory
> > reclaimer. A page's Young flag is set whenever the Access bit of a page
> > table entry pointing to the page is cleared by writing to kpageidle. If
> > page_referenced() is called on a Young page, it will add 1 to its return
> > value, therefore concealing the fact that the Access bit was cleared.
> > 
> > Note, since there is no room for extra page flags on 32 bit, this
> > feature uses extended page flags when compiled on 32 bit.
> > 
> > ...
> >
> >
> > ...
> >
> > +static void kpageidle_clear_pte_refs(struct page *page)
> > +{
> > +	struct rmap_walk_control rwc = {
> > +		.rmap_one = kpageidle_clear_pte_refs_one,
> > +		.anon_lock = page_lock_anon_vma_read,
> > +	};
> 
> I think this can be static const, since `arg' is unused?  That would
> save some cycles and stack.

Good catch, thanks.

> 
> > +	bool need_lock;
> > +
> > +	if (!page_mapped(page) ||
> > +	    !page_rmapping(page))
> > +		return;
> > +
> > +	need_lock = !PageAnon(page) || PageKsm(page);
> > +	if (need_lock && !trylock_page(page))
> 
> Oh.  So the feature is a bit unreliable.
> 
> I'm not immediately seeing anything which would prevent us from using
> plain old lock_page() here.  What's going on?

A page may be locked for quite a long period of time, e.g.
truncate_inode_pages_range() may wait until a page writeback finishes
under the page lock. Instead of stalling kpageidle scan, we'd better
move on to the next page. Of course, the result won't be 100% accurate.
In fact, it isn't accurate anyway, because we skip isolated pages,
neither can it possibly be 100% accurate, because the scan itself is not
instant so that while we are performing it the system usage pattern
might change. This new API is only supposed to give a good estimate of
memory usage pattern, which could be used as a hint for adjusting the
system configuration to improve performance.

> 
> > +		return;
> > +
> > +	rmap_walk(page, &rwc);
> > +
> > +	if (need_lock)
> > +		unlock_page(page);
> > +}
> > +
> > +static ssize_t kpageidle_read(struct file *file, char __user *buf,
> > +			      size_t count, loff_t *ppos)
> > +{
> > +	u64 __user *out = (u64 __user *)buf;
> > +	struct page *page;
> > +	unsigned long pfn, end_pfn;
> > +	ssize_t ret = 0;
> > +	u64 idle_bitmap = 0;
> > +	int bit;
> > +
> > +	if (*ppos & KPMMASK || count & KPMMASK)
> > +		return -EINVAL;
> 
> Interface requires 8-byte aligned offset and size.
> 
> > +	pfn = *ppos * BITS_PER_BYTE;
> > +	if (pfn >= max_pfn)
> > +		return 0;
> > +
> > +	end_pfn = pfn + count * BITS_PER_BYTE;
> > +	if (end_pfn > max_pfn)
> > +		end_pfn = ALIGN(max_pfn, KPMBITS);
> 
> So we lose up to 63 pages.  Presumably max_pfn is well enough aligned
> for this to not matter, dunno.

ALIGN(x, a) resolves to ((x + a - 1) & ~(a - 1)), which is >= x, so we
shouldn't loose anything.

> 
> > +	for (; pfn < end_pfn; pfn++) {
> > +		bit = pfn % KPMBITS;
> > +		page = kpageidle_get_page(pfn);
> > +		if (page) {
> > +			if (page_is_idle(page)) {
> > +				/*
> > +				 * The page might have been referenced via a
> > +				 * pte, in which case it is not idle. Clear
> > +				 * refs and recheck.
> > +				 */
> > +				kpageidle_clear_pte_refs(page);
> > +				if (page_is_idle(page))
> > +					idle_bitmap |= 1ULL << bit;
> 
> I don't understand what's going on here.  More details, please?

The output is a bitmap, which is stored as an array of 8-byte elements,
where byte order within each word is native, i.e. if page at pfn #i is
idle we need to set bit #i%64 of element #i/64 of the array. I'll
reflect this in the documentation.

> 
> > +			}
> > +			put_page(page);
> > +		}
> > +		if (bit == KPMBITS - 1) {
> > +			if (put_user(idle_bitmap, out)) {
> > +				ret = -EFAULT;
> > +				break;
> > +			}
> > +			idle_bitmap = 0;
> > +			out++;
> > +		}
> > +	}
> > +
> > +	*ppos += (char __user *)out - buf;
> > +	if (!ret)
> > +		ret = (char __user *)out - buf;
> > +	return ret;
> > +}
> > +
> > +static ssize_t kpageidle_write(struct file *file, const char __user *buf,
> > +			       size_t count, loff_t *ppos)
> > +{
> > +	const u64 __user *in = (const u64 __user *)buf;
> > +	struct page *page;
> > +	unsigned long pfn, end_pfn;
> > +	ssize_t ret = 0;
> > +	u64 idle_bitmap = 0;
> > +	int bit;
> > +
> > +	if (*ppos & KPMMASK || count & KPMMASK)
> > +		return -EINVAL;
> > +
> > +	pfn = *ppos * BITS_PER_BYTE;
> > +	if (pfn >= max_pfn)
> > +		return -ENXIO;
> > +
> > +	end_pfn = pfn + count * BITS_PER_BYTE;
> > +	if (end_pfn > max_pfn)
> > +		end_pfn = ALIGN(max_pfn, KPMBITS);
> > +
> > +	for (; pfn < end_pfn; pfn++) {
> > +		bit = pfn % KPMBITS;
> > +		if (bit == 0) {
> > +			if (get_user(idle_bitmap, in)) {
> > +				ret = -EFAULT;
> > +				break;
> > +			}
> > +			in++;
> > +		}
> > +		if (idle_bitmap >> bit & 1) {
> 
> Hate it when I have to go look up a C precedence table.  This is
> 
> 		if ((idle_bitmap >> bit) & 1) {

Fixed.

Here goes the incremental patch with all the fixes:
---
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7ff7cba8617b..9daa6e92450f 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -362,7 +362,11 @@ static int kpageidle_clear_pte_refs_one(struct page *page,
 
 static void kpageidle_clear_pte_refs(struct page *page)
 {
-	struct rmap_walk_control rwc = {
+	/*
+	 * Since rwc.arg is unused, rwc is effectively immutable, so we
+	 * can make it static const to save some cycles and stack.
+	 */
+	static const struct rmap_walk_control rwc = {
 		.rmap_one = kpageidle_clear_pte_refs_one,
 		.anon_lock = page_lock_anon_vma_read,
 	};
@@ -376,7 +380,7 @@ static void kpageidle_clear_pte_refs(struct page *page)
 	if (need_lock && !trylock_page(page))
 		return;
 
-	rmap_walk(page, &rwc);
+	rmap_walk(page, (struct rmap_walk_control *)&rwc);
 
 	if (need_lock)
 		unlock_page(page);
@@ -466,7 +470,7 @@ static ssize_t kpageidle_write(struct file *file, const char __user *buf,
 			}
 			in++;
 		}
-		if (idle_bitmap >> bit & 1) {
+		if ((idle_bitmap >> bit) & 1) {
 			page = kpageidle_get_page(pfn);
 			if (page) {
 				kpageidle_clear_pte_refs(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Vladimir Davydov <vdavydov@parallels.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Andres Lagar-Cavilla <andreslc@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, "Greg Thelen" <gthelen@google.com>,
	Michel Lespinasse <walken@google.com>,
	"David Rientjes" <rientjes@google.com>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Jonathan Corbet <corbet@lwn.net>, <linux-api@vger.kernel.org>,
	<linux-doc@vger.kernel.org>, <linux-mm@kvack.org>,
	<cgroups@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH -mm v9 6/8] proc: add kpageidle file
Date: Wed, 22 Jul 2015 18:20:29 +0300	[thread overview]
Message-ID: <20150722152029.GL23374@esperanza> (raw)
In-Reply-To: <20150721163452.c1e4075a2b193bcd325fad56@linux-foundation.org>

On Tue, Jul 21, 2015 at 04:34:52PM -0700, Andrew Morton wrote:
> On Sun, 19 Jul 2015 15:31:15 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:
> 
> > Knowing the portion of memory that is not used by a certain application
> > or memory cgroup (idle memory) can be useful for partitioning the system
> > efficiently, e.g. by setting memory cgroup limits appropriately.
> > Currently, the only means to estimate the amount of idle memory provided
> > by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
> > access bit for all pages mapped to a particular process by writing 1 to
> > clear_refs, wait for some time, and then count smaps:Referenced.
> > However, this method has two serious shortcomings:
> > 
> >  - it does not count unmapped file pages
> >  - it affects the reclaimer logic
> > 
> > To overcome these drawbacks, this patch introduces two new page flags,
> > Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
> > can only be set from userspace by setting bit in /proc/kpageidle at the
> > offset corresponding to the page, and it is cleared whenever the page is
> > accessed either through page tables (it is cleared in page_referenced()
> > in this case) or using the read(2) system call (mark_page_accessed()).
> > Thus by setting the Idle flag for pages of a particular workload, which
> > can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
> > let the workload access its working set, and then reading the kpageidle
> > file, one can estimate the amount of pages that are not used by the
> > workload.
> > 
> > The Young page flag is used to avoid interference with the memory
> > reclaimer. A page's Young flag is set whenever the Access bit of a page
> > table entry pointing to the page is cleared by writing to kpageidle. If
> > page_referenced() is called on a Young page, it will add 1 to its return
> > value, therefore concealing the fact that the Access bit was cleared.
> > 
> > Note, since there is no room for extra page flags on 32 bit, this
> > feature uses extended page flags when compiled on 32 bit.
> > 
> > ...
> >
> >
> > ...
> >
> > +static void kpageidle_clear_pte_refs(struct page *page)
> > +{
> > +	struct rmap_walk_control rwc = {
> > +		.rmap_one = kpageidle_clear_pte_refs_one,
> > +		.anon_lock = page_lock_anon_vma_read,
> > +	};
> 
> I think this can be static const, since `arg' is unused?  That would
> save some cycles and stack.

Good catch, thanks.

> 
> > +	bool need_lock;
> > +
> > +	if (!page_mapped(page) ||
> > +	    !page_rmapping(page))
> > +		return;
> > +
> > +	need_lock = !PageAnon(page) || PageKsm(page);
> > +	if (need_lock && !trylock_page(page))
> 
> Oh.  So the feature is a bit unreliable.
> 
> I'm not immediately seeing anything which would prevent us from using
> plain old lock_page() here.  What's going on?

A page may be locked for quite a long period of time, e.g.
truncate_inode_pages_range() may wait until a page writeback finishes
under the page lock. Instead of stalling kpageidle scan, we'd better
move on to the next page. Of course, the result won't be 100% accurate.
In fact, it isn't accurate anyway, because we skip isolated pages,
neither can it possibly be 100% accurate, because the scan itself is not
instant so that while we are performing it the system usage pattern
might change. This new API is only supposed to give a good estimate of
memory usage pattern, which could be used as a hint for adjusting the
system configuration to improve performance.

> 
> > +		return;
> > +
> > +	rmap_walk(page, &rwc);
> > +
> > +	if (need_lock)
> > +		unlock_page(page);
> > +}
> > +
> > +static ssize_t kpageidle_read(struct file *file, char __user *buf,
> > +			      size_t count, loff_t *ppos)
> > +{
> > +	u64 __user *out = (u64 __user *)buf;
> > +	struct page *page;
> > +	unsigned long pfn, end_pfn;
> > +	ssize_t ret = 0;
> > +	u64 idle_bitmap = 0;
> > +	int bit;
> > +
> > +	if (*ppos & KPMMASK || count & KPMMASK)
> > +		return -EINVAL;
> 
> Interface requires 8-byte aligned offset and size.
> 
> > +	pfn = *ppos * BITS_PER_BYTE;
> > +	if (pfn >= max_pfn)
> > +		return 0;
> > +
> > +	end_pfn = pfn + count * BITS_PER_BYTE;
> > +	if (end_pfn > max_pfn)
> > +		end_pfn = ALIGN(max_pfn, KPMBITS);
> 
> So we lose up to 63 pages.  Presumably max_pfn is well enough aligned
> for this to not matter, dunno.

ALIGN(x, a) resolves to ((x + a - 1) & ~(a - 1)), which is >= x, so we
shouldn't loose anything.

> 
> > +	for (; pfn < end_pfn; pfn++) {
> > +		bit = pfn % KPMBITS;
> > +		page = kpageidle_get_page(pfn);
> > +		if (page) {
> > +			if (page_is_idle(page)) {
> > +				/*
> > +				 * The page might have been referenced via a
> > +				 * pte, in which case it is not idle. Clear
> > +				 * refs and recheck.
> > +				 */
> > +				kpageidle_clear_pte_refs(page);
> > +				if (page_is_idle(page))
> > +					idle_bitmap |= 1ULL << bit;
> 
> I don't understand what's going on here.  More details, please?

The output is a bitmap, which is stored as an array of 8-byte elements,
where byte order within each word is native, i.e. if page at pfn #i is
idle we need to set bit #i%64 of element #i/64 of the array. I'll
reflect this in the documentation.

> 
> > +			}
> > +			put_page(page);
> > +		}
> > +		if (bit == KPMBITS - 1) {
> > +			if (put_user(idle_bitmap, out)) {
> > +				ret = -EFAULT;
> > +				break;
> > +			}
> > +			idle_bitmap = 0;
> > +			out++;
> > +		}
> > +	}
> > +
> > +	*ppos += (char __user *)out - buf;
> > +	if (!ret)
> > +		ret = (char __user *)out - buf;
> > +	return ret;
> > +}
> > +
> > +static ssize_t kpageidle_write(struct file *file, const char __user *buf,
> > +			       size_t count, loff_t *ppos)
> > +{
> > +	const u64 __user *in = (const u64 __user *)buf;
> > +	struct page *page;
> > +	unsigned long pfn, end_pfn;
> > +	ssize_t ret = 0;
> > +	u64 idle_bitmap = 0;
> > +	int bit;
> > +
> > +	if (*ppos & KPMMASK || count & KPMMASK)
> > +		return -EINVAL;
> > +
> > +	pfn = *ppos * BITS_PER_BYTE;
> > +	if (pfn >= max_pfn)
> > +		return -ENXIO;
> > +
> > +	end_pfn = pfn + count * BITS_PER_BYTE;
> > +	if (end_pfn > max_pfn)
> > +		end_pfn = ALIGN(max_pfn, KPMBITS);
> > +
> > +	for (; pfn < end_pfn; pfn++) {
> > +		bit = pfn % KPMBITS;
> > +		if (bit == 0) {
> > +			if (get_user(idle_bitmap, in)) {
> > +				ret = -EFAULT;
> > +				break;
> > +			}
> > +			in++;
> > +		}
> > +		if (idle_bitmap >> bit & 1) {
> 
> Hate it when I have to go look up a C precedence table.  This is
> 
> 		if ((idle_bitmap >> bit) & 1) {

Fixed.

Here goes the incremental patch with all the fixes:
---
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7ff7cba8617b..9daa6e92450f 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -362,7 +362,11 @@ static int kpageidle_clear_pte_refs_one(struct page *page,
 
 static void kpageidle_clear_pte_refs(struct page *page)
 {
-	struct rmap_walk_control rwc = {
+	/*
+	 * Since rwc.arg is unused, rwc is effectively immutable, so we
+	 * can make it static const to save some cycles and stack.
+	 */
+	static const struct rmap_walk_control rwc = {
 		.rmap_one = kpageidle_clear_pte_refs_one,
 		.anon_lock = page_lock_anon_vma_read,
 	};
@@ -376,7 +380,7 @@ static void kpageidle_clear_pte_refs(struct page *page)
 	if (need_lock && !trylock_page(page))
 		return;
 
-	rmap_walk(page, &rwc);
+	rmap_walk(page, (struct rmap_walk_control *)&rwc);
 
 	if (need_lock)
 		unlock_page(page);
@@ -466,7 +470,7 @@ static ssize_t kpageidle_write(struct file *file, const char __user *buf,
 			}
 			in++;
 		}
-		if (idle_bitmap >> bit & 1) {
+		if ((idle_bitmap >> bit) & 1) {
 			page = kpageidle_get_page(pfn);
 			if (page) {
 				kpageidle_clear_pte_refs(page);

  parent reply	other threads:[~2015-07-22 15:20 UTC|newest]

Thread overview: 135+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-19 12:31 [PATCH -mm v9 0/8] idle memory tracking Vladimir Davydov
2015-07-19 12:31 ` Vladimir Davydov
2015-07-19 12:31 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 1/8] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-07-19 12:31   ` Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
2015-07-21 23:34     ` Andrew Morton
2015-07-21 23:34     ` Andrew Morton
2015-07-22  9:21     ` Vladimir Davydov
2015-07-22  9:21       ` Vladimir Davydov
2015-07-22  9:21       ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 2/8] hwpoison: use page_cgroup_ino for filtering by memcg Vladimir Davydov
2015-07-19 12:31   ` Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
2015-07-21 23:34     ` Andrew Morton
     [not found]     ` <20150721163412.1b44e77f5ac3b742734d1ce6-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22  9:45       ` Vladimir Davydov
2015-07-22  9:45         ` Vladimir Davydov
2015-07-22  9:45         ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 3/8] memcg: zap try_get_mem_cgroup_from_page Vladimir Davydov
2015-07-19 12:31   ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 4/8] proc: add kpagecgroup file Vladimir Davydov
2015-07-19 12:31   ` Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
2015-07-21 23:34     ` Andrew Morton
     [not found]     ` <20150721163433.618855e1f61536a09dfac30b-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22 10:33       ` Vladimir Davydov
2015-07-22 10:33         ` Vladimir Davydov
2015-07-22 10:33         ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 5/8] mmu-notifier: add clear_young callback Vladimir Davydov
2015-07-19 12:31   ` Vladimir Davydov
2015-07-20 18:34   ` Andres Lagar-Cavilla
2015-07-21  8:51     ` Vladimir Davydov
2015-07-21  8:51       ` Vladimir Davydov
2015-07-22 16:33       ` Vladimir Davydov
2015-07-22 16:33         ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 6/8] proc: add kpageidle file Vladimir Davydov
2015-07-19 12:31   ` Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
2015-07-21 23:34     ` Andrew Morton
     [not found]     ` <20150721163452.c1e4075a2b193bcd325fad56-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22 15:20       ` Vladimir Davydov [this message]
2015-07-22 15:20         ` Vladimir Davydov
2015-07-22 15:20         ` Vladimir Davydov
     [not found]   ` <d7a78b72053cf529c0c9ff6cbc02ffbb3d58fe35.1437303956.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2015-07-24 14:08     ` Paul Gortmaker
2015-07-24 14:08       ` Paul Gortmaker
2015-07-24 14:08       ` Paul Gortmaker
     [not found]       ` <CAP=VYLqiNfQJ6oyQg2GszeHwdOmeY_uD3XPvw=++weJOKdx4_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-24 14:17         ` Vladimir Davydov
2015-07-24 14:17           ` Vladimir Davydov
2015-07-24 14:17           ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 7/8] proc: export idle flag via kpageflags Vladimir Davydov
2015-07-19 12:31   ` Vladimir Davydov
2015-07-21 23:35   ` Andrew Morton
2015-07-21 23:35     ` Andrew Morton
     [not found]     ` <20150721163500.528bd39bbbc71abc3c8d429b-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22 16:25       ` Vladimir Davydov
2015-07-22 16:25         ` Vladimir Davydov
2015-07-22 16:25         ` Vladimir Davydov
2015-07-22 19:44         ` Andrew Morton
2015-07-22 19:44           ` Andrew Morton
2015-07-22 19:44           ` Andrew Morton
2015-07-22 20:46           ` Andres Lagar-Cavilla
2015-07-23  7:57             ` Vladimir Davydov
2015-07-23  7:57               ` Vladimir Davydov
2015-07-23  7:57               ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 8/8] proc: add cond_resched to /proc/kpage* read/write loop Vladimir Davydov
2015-07-19 12:31   ` Vladimir Davydov
     [not found] ` <cover.1437303956.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2015-07-19 12:37   ` [PATCH -mm v9 0/8] idle memory tracking Vladimir Davydov
2015-07-19 12:37     ` Vladimir Davydov
2015-07-19 12:37     ` Vladimir Davydov
2015-07-21 21:39 ` Andres Lagar-Cavilla
2015-07-21 23:34 ` Andrew Morton
2015-07-21 23:34   ` Andrew Morton
2015-07-22 16:23   ` Vladimir Davydov
2015-07-22 16:23     ` Vladimir Davydov
2015-07-22 16:23     ` Vladimir Davydov
2015-07-25 16:24     ` Vladimir Davydov
2015-07-25 16:24       ` Vladimir Davydov
2015-07-25 16:24       ` Vladimir Davydov
2015-07-27 19:18   ` Kees Cook
2015-07-27 19:18     ` Kees Cook
2015-07-27 19:25     ` Andrew Morton
2015-07-27 19:25       ` Andrew Morton
2015-07-29 12:36 ` Michal Hocko
2015-07-29 12:36   ` Michal Hocko
     [not found]   ` <20150729123629.GI15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-29 13:59     ` Vladimir Davydov
2015-07-29 13:59       ` Vladimir Davydov
2015-07-29 13:59       ` Vladimir Davydov
2015-07-29 14:12       ` Michel Lespinasse
     [not found]         ` <CANN689HJX2ZL891uOd8TW9ct4PNH9d5odQZm86WMxkpkCWhA-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-29 14:13           ` Michel Lespinasse
2015-07-29 14:13             ` Michel Lespinasse
2015-07-29 14:13             ` Michel Lespinasse
2015-07-29 14:45           ` Vladimir Davydov
2015-07-29 14:45             ` Vladimir Davydov
2015-07-29 14:45             ` Vladimir Davydov
2015-07-29 15:08             ` Michel Lespinasse
     [not found]               ` <CANN689Euq3Y-CHQo8q88vzFAYZX4S6rK+rZRfbuSKfS74u=gcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-29 15:31                 ` Vladimir Davydov
2015-07-29 15:31                   ` Vladimir Davydov
2015-07-29 15:31                   ` Vladimir Davydov
2015-07-29 15:34                   ` Michel Lespinasse
2015-07-29 15:08             ` Michal Hocko
2015-07-29 15:08               ` Michal Hocko
     [not found]               ` <20150729150855.GM15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-29 15:36                 ` Vladimir Davydov
2015-07-29 15:36                   ` Vladimir Davydov
2015-07-29 15:36                   ` Vladimir Davydov
2015-07-29 15:58                   ` Michal Hocko
2015-07-29 15:58                     ` Michal Hocko
2015-07-29 14:26       ` Michal Hocko
2015-07-29 14:26         ` Michal Hocko
2015-07-29 15:28         ` Vladimir Davydov
2015-07-29 15:28           ` Vladimir Davydov
2015-07-29 15:47           ` Michal Hocko
2015-07-29 15:47             ` Michal Hocko
2015-07-29 15:47             ` Michal Hocko
     [not found]             ` <20150729154718.GN15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-29 16:29               ` Vladimir Davydov
2015-07-29 16:29                 ` Vladimir Davydov
2015-07-29 16:29                 ` Vladimir Davydov
2015-07-29 21:30                 ` Andrew Morton
2015-07-29 21:30                   ` Andrew Morton
2015-07-29 21:30                   ` Andrew Morton
2015-07-30  9:12                   ` Vladimir Davydov
2015-07-30  9:12                     ` Vladimir Davydov
2015-07-30 13:01                     ` Vladimir Davydov
2015-07-30 13:01                       ` Vladimir Davydov
2015-07-30 13:01                       ` Vladimir Davydov
2015-07-31  9:34                       ` Vladimir Davydov
2015-07-31  9:34                         ` Vladimir Davydov
2015-07-31  9:34                         ` Vladimir Davydov
2015-07-30  9:07                 ` Michal Hocko
2015-07-30  9:07                   ` Michal Hocko
2015-07-30  9:07                   ` Michal Hocko
     [not found]                   ` <20150730090708.GE9387-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-30  9:31                     ` Vladimir Davydov
2015-07-30  9:31                       ` Vladimir Davydov
2015-07-30  9:31                       ` Vladimir Davydov
2015-07-29 15:55           ` Andres Lagar-Cavilla
2015-07-29 15:55             ` Andres Lagar-Cavilla
     [not found]             ` <CAJu=L59RdowYjTyVM0Vhz79A4d=d8=ZmU7PB59CmEj5B0_c48Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-29 16:37               ` Vladimir Davydov
2015-07-29 16:37                 ` Vladimir Davydov
2015-07-29 16:37                 ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150722152029.GL23374@esperanza \
    --to=vdavydov-bzqdu9zft3wakbo8gow8eq@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=andreslc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=corbet-T1hC0tSOHrs@public.gmane.org \
    --cc=gorcunov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    --cc=gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
    --cc=linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=mhocko-AlSwsSmVLrQ@public.gmane.org \
    --cc=minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=raghavendra.kt-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org \
    --cc=rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=walken-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.