From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Mel Gorman <mgorman@suse.de>,
Rob van der Heij <rvdheij@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: [ 22/53] mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
Date: Tue, 26 Feb 2013 15:57:51 -0800 [thread overview]
Message-ID: <20130226235622.126822927@linuxfoundation.org> (raw)
In-Reply-To: <20130226235619.844721947@linuxfoundation.org>
3.0-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mel Gorman <mgorman@suse.de>
commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 upstream.
Rob van der Heij reported the following (paraphrased) on private mail.
The scenario is that I want to avoid backups to fill up the page
cache and purge stuff that is more likely to be used again (this is
with s390x Linux on z/VM, so I don't give it as much memory that
we don't care anymore). So I have something with LD_PRELOAD that
intercepts the close() call (from tar, in this case) and issues
a posix_fadvise() just before closing the file.
This mostly works, except for small files (less than 14 pages)
that remains in page cache after the face.
Unfortunately Rob has not had a chance to test this exact patch but the
test program below should be reproducing the problem he described.
The issue is the per-cpu pagevecs for LRU additions. If the pages are
added by one CPU but fadvise() is called on another then the pages
remain resident as the invalidate_mapping_pages() only drains the local
pagevecs via its call to pagevec_release(). The user-visible effect is
that a program that uses fadvise() properly is not obeyed.
A possible fix for this is to put the necessary smarts into
invalidate_mapping_pages() to globally drain the LRU pagevecs if a
pagevec page could not be discarded. The downside with this is that an
inode cache shrink would send a global IPI and memory pressure
potentially causing global IPI storms is very undesirable.
Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
check if invalidate_mapping_pages() discarded all the requested pages.
If a subset of pages are discarded it drains the LRU pagevecs and tries
again. If the second attempt fails, it assumes it is due to the pages
being mapped, locked or dirty and does not care. With this patch, an
application using fadvise() correctly will be obeyed but there is a
downside that a malicious application can force the kernel to send
global IPIs and increase overhead.
If accepted, I would like this to be considered as a -stable candidate.
It's not an urgent issue but it's a system call that is not working as
advertised which is weak.
The following test program demonstrates the problem. It should never
report that pages are still resident but will without this patch. It
assumes that CPU 0 and 1 exist.
int main() {
int fd;
int pagesize = getpagesize();
ssize_t written = 0, expected;
char *buf;
unsigned char *vec;
int resident, i;
cpu_set_t set;
/* Prepare a buffer for writing */
expected = FILESIZE_PAGES * pagesize;
buf = malloc(expected + 1);
if (buf == NULL) {
printf("ENOMEM\n");
exit(EXIT_FAILURE);
}
buf[expected] = 0;
memset(buf, 'a', expected);
/* Prepare the mincore vec */
vec = malloc(FILESIZE_PAGES);
if (vec == NULL) {
printf("ENOMEM\n");
exit(EXIT_FAILURE);
}
/* Bind ourselves to CPU 0 */
CPU_ZERO(&set);
CPU_SET(0, &set);
if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);
}
/* open file, unlink and write buffer */
fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
if (fd == -1) {
perror("open");
exit(EXIT_FAILURE);
}
unlink("fadvise-test-file");
while (written < expected) {
ssize_t this_write;
this_write = write(fd, buf + written, expected - written);
if (this_write == -1) {
perror("write");
exit(EXIT_FAILURE);
}
written += this_write;
}
free(buf);
/*
* Force ourselves to another CPU. If fadvise only flushes the local
* CPUs pagevecs then the fadvise will fail to discard all file pages
*/
CPU_ZERO(&set);
CPU_SET(1, &set);
if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);
}
/* sync and fadvise to discard the page cache */
fsync(fd);
if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
perror("posix_fadvise");
exit(EXIT_FAILURE);
}
/* map the file and use mincore to see which parts of it are resident */
buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
if (buf == NULL) {
perror("mmap");
exit(EXIT_FAILURE);
}
if (mincore(buf, expected, vec) == -1) {
perror("mincore");
exit(EXIT_FAILURE);
}
/* Check residency */
for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
if (vec[i])
resident++;
}
if (resident != 0) {
printf("Nr unexpected pages resident: %d\n", resident);
exit(EXIT_FAILURE);
}
munmap(buf, expected);
close(fd);
free(vec);
exit(EXIT_SUCCESS);
}
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Rob van der Heij <rvdheij@gmail.com>
Tested-by: Rob van der Heij <rvdheij@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
mm/fadvise.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -17,6 +17,7 @@
#include <linux/fadvise.h>
#include <linux/writeback.h>
#include <linux/syscalls.h>
+#include <linux/swap.h>
#include <asm/unistd.h>
@@ -123,9 +124,22 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
end_index = (endbyte >> PAGE_CACHE_SHIFT);
- if (end_index >= start_index)
- invalidate_mapping_pages(mapping, start_index,
+ if (end_index >= start_index) {
+ unsigned long count = invalidate_mapping_pages(mapping,
+ start_index, end_index);
+
+ /*
+ * If fewer pages were invalidated than expected then
+ * it is possible that some of the pages were on
+ * a per-cpu pagevec for a remote CPU. Drain all
+ * pagevecs and try again.
+ */
+ if (count < (end_index - start_index + 1)) {
+ lru_add_drain_all();
+ invalidate_mapping_pages(mapping, start_index,
end_index);
+ }
+ }
break;
default:
ret = -EINVAL;
next prev parent reply other threads:[~2013-02-27 0:06 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-02-26 23:57 [ 00/53] 3.0.67-stable review Greg Kroah-Hartman
2013-02-26 23:57 ` [ 01/53] x86-32, mm: Remove reference to resume_map_numa_kva() Greg Kroah-Hartman
2013-02-26 23:57 ` [ 02/53] mm: fix pageblock bitmap allocation Greg Kroah-Hartman
2013-02-26 23:57 ` [ 03/53] timeconst.pl: Eliminate Perl warning Greg Kroah-Hartman
2013-02-27 2:46 ` Rob Landley
2013-02-26 23:57 ` [ 04/53] genirq: Avoid deadlock in spurious handling Greg Kroah-Hartman
2013-02-26 23:57 ` [ 05/53] posix-cpu-timers: Fix nanosleep task_struct leak Greg Kroah-Hartman
2013-02-26 23:57 ` [ 06/53] hrtimer: Prevent hrtimer_enqueue_reprogram race Greg Kroah-Hartman
2013-02-26 23:57 ` [ 07/53] ALSA: ali5451: remove irq enabling in pointer callback Greg Kroah-Hartman
2013-02-26 23:57 ` [ 08/53] ALSA: rme32.c irq enabling after spin_lock_irq Greg Kroah-Hartman
2013-02-26 23:57 ` [ 09/53] tty: set_termios/set_termiox should not return -EINTR Greg Kroah-Hartman
2013-02-26 23:57 ` [ 10/53] xen/netback: check correct frag when looking for head frag Greg Kroah-Hartman
2013-02-26 23:57 ` [ 11/53] xen: Send spinlock IPI to all waiters Greg Kroah-Hartman
2013-02-26 23:57 ` [ 12/53] Driver core: treat unregistered bus_types as having no devices Greg Kroah-Hartman
2013-02-26 23:57 ` [ 13/53] mm: mmu_notifier: have mmu_notifiers use a global SRCU so they may safely schedule Greg Kroah-Hartman
2013-02-26 23:57 ` [ 14/53] mm: mmu_notifier: make the mmu_notifier srcu static Greg Kroah-Hartman
2013-02-26 23:57 ` [ 15/53] mmu_notifier_unregister NULL Pointer deref and multiple ->release() callouts Greg Kroah-Hartman
2013-02-26 23:57 ` [ 16/53] KVM: s390: Handle hosts not supporting s390-virtio Greg Kroah-Hartman
2013-02-26 23:57 ` [ 17/53] s390/kvm: Fix store status for ACRS/FPRS Greg Kroah-Hartman
2013-02-28 22:26 ` Jiri Slaby
2013-03-01 7:50 ` Christian Borntraeger
2013-03-01 9:22 ` Jiri Slaby
2013-03-01 19:16 ` Greg Kroah-Hartman
2013-02-26 23:57 ` [ 18/53] inotify: remove broken mask checks causing unmount to be EINVAL Greg Kroah-Hartman
2013-02-26 23:57 ` [ 19/53] ocfs2: unlock super lock if lockres refresh failed Greg Kroah-Hartman
2013-02-26 23:57 ` [ 20/53] drivers/video/backlight/adp88?0_bl.c: fix resume Greg Kroah-Hartman
2013-02-26 23:57 ` [ 21/53] tmpfs: fix use-after-free of mempolicy object Greg Kroah-Hartman
2013-02-26 23:57 ` Greg Kroah-Hartman [this message]
2013-02-26 23:57 ` [ 23/53] NLM: Ensure that we resend all pending blocking locks after a reclaim Greg Kroah-Hartman
2013-02-26 23:57 ` [ 24/53] p54usb: corrected USB ID for T-Com Sinus 154 data II Greg Kroah-Hartman
2013-02-26 23:57 ` [ 25/53] ALSA: usb-audio: fix Roland A-PRO support Greg Kroah-Hartman
2013-02-26 23:57 ` [ 26/53] ALSA: usb: Fix Processing Unit Descriptor parsers Greg Kroah-Hartman
2013-02-26 23:57 ` [ 27/53] ext4: Free resources in some error path in ext4_fill_super Greg Kroah-Hartman
2013-02-26 23:57 ` [ 28/53] ext4: add missing kfree() on error return path in add_new_gdb() Greg Kroah-Hartman
2013-02-26 23:57 ` [ 29/53] sunvdc: Fix off-by-one in generic_request() Greg Kroah-Hartman
2013-02-26 23:57 ` [ 30/53] drm/usb: bind driver to correct device Greg Kroah-Hartman
2013-02-26 23:58 ` [ 31/53] NLS: improve UTF8 -> UTF16 string conversion routine Greg Kroah-Hartman
2013-02-26 23:58 ` [ 32/53] drm/i915: disable shared panel fitter for pipe Greg Kroah-Hartman
2013-02-26 23:58 ` [ 33/53] staging: comedi: disallow COMEDI_DEVCONFIG on non-board minors Greg Kroah-Hartman
2013-02-26 23:58 ` [ 34/53] staging: vt6656: Fix URB submitted while active warning Greg Kroah-Hartman
2013-02-26 23:58 ` [ 35/53] ARM: PXA3xx: program the CSMSADRCFG register Greg Kroah-Hartman
2013-02-26 23:58 ` [ 36/53] powerpc/kexec: Disable hard IRQ before kexec Greg Kroah-Hartman
2013-02-26 23:58 ` [ 37/53] [PARISC] Purge existing TLB entries in set_pte_at and ptep_set_wrprotect Greg Kroah-Hartman
2013-02-26 23:58 ` [ 38/53] pcmcia/vrc4171: Add missing spinlock init Greg Kroah-Hartman
2013-02-26 23:58 ` [ 39/53] fbcon: dont lose the console font across generic->chip driver switch Greg Kroah-Hartman
2013-02-26 23:58 ` [ 40/53] fb: rework locking to fix lock ordering on takeover Greg Kroah-Hartman
2013-02-26 23:58 ` [ 41/53] fb: Yet another band-aid for fixing lockdep mess Greg Kroah-Hartman
2013-02-26 23:58 ` [ 42/53] bridge: set priority of STP packets Greg Kroah-Hartman
2013-02-26 23:58 ` [ 43/53] xen-netback: correctly return errors from netbk_count_requests() Greg Kroah-Hartman
2013-02-26 23:58 ` [ 44/53] xen-netback: cancel the credit timer when taking the vif down Greg Kroah-Hartman
2013-02-26 23:58 ` [ 45/53] ipv4: fix a bug in ping_err() Greg Kroah-Hartman
2013-02-26 23:58 ` [ 46/53] ipv6: use a stronger hash for tcp Greg Kroah-Hartman
2013-02-26 23:58 ` [ 47/53] dca: check against empty dca_domains list before unregister provider Greg Kroah-Hartman
2013-02-28 22:04 ` Jiri Slaby
2013-02-28 22:17 ` Jiri Slaby
2013-03-01 19:17 ` Greg Kroah-Hartman
2013-02-26 23:58 ` [ 48/53] USB: option: add and update Alcatel modems Greg Kroah-Hartman
2013-02-26 23:58 ` [ 49/53] USB: option: add Yota / Megafon M100-1 4g modem Greg Kroah-Hartman
2013-02-26 23:58 ` [ 50/53] USB: option: add Huawei "ACM" devices using protocol = vendor Greg Kroah-Hartman
2013-02-26 23:58 ` [ 51/53] USB: ehci-omap: Fix autoloading of module Greg Kroah-Hartman
2013-02-26 23:58 ` [ 52/53] USB: storage: properly handle the endian issues of idProduct Greg Kroah-Hartman
2013-02-26 23:58 ` [ 53/53] USB: usb-storage: unusual_devs update for Super TOP SATA bridge Greg Kroah-Hartman
2013-02-27 16:51 ` [ 00/53] 3.0.67-stable review Shuah Khan
2013-02-27 16:54 ` Greg Kroah-Hartman
2013-02-28 14:56 ` Satoru Takeuchi
2013-02-28 14:59 ` Greg Kroah-Hartman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130226235622.126822927@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=rvdheij@gmail.com \
--cc=stable@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox