All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fengguang Wu <fengguang.wu@intel.com>
To: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
Cc: metin d <metdos@yahoo.com>, Jan Kara <jack@suse.cz>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: Problem in Page Cache Replacement
Date: Wed, 21 Nov 2012 17:02:04 +0800	[thread overview]
Message-ID: <20121121090204.GA9064@localhost> (raw)
In-Reply-To: <50AC9220.70202@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3200 bytes --]

On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
> Cc Fengguang Wu.
> 
> On 11/21/2012 04:13 PM, metin d wrote:
> >>   Curious. Added linux-mm list to CC to catch more attention. If you run
> >>echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
> >I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.
> >
> >We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.

> >My understanding was that under memory pressure from heavily
> >accessed pages, unused pages would eventually get evicted. Is there
> >anything else we can try on this host to understand why this is
> >happening?

We may debug it this way.

1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
   (please double check via /proc/vmstat whether it does the expected work)

2) run 'page-types -r' with root, to view the page status for the
   remaining pages of data-1

The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"

page-types can be found in the kernel source tree tools/vm/page-types.c

Sorry that sounds a bit twisted.. I do have a patch to directly dump
page cache status of a user specified file, however it's not
upstreamed yet.

Thanks,
Fengguang

> >On Tue 20-11-12 09:42:42, metin d wrote:
> >>I have two PostgreSQL databases named data-1 and data-2 that sit on the
> >>same machine. Both databases keep 40 GB of data, and the total memory
> >>available on the machine is 68GB.
> >>
> >>I started data-1 and data-2, and ran several queries to go over all their
> >>data. Then, I shut down data-1 and kept issuing queries against data-2.
> >>For some reason, the OS still holds on to large parts of data-1's pages
> >>in its page cache, and reserves about 35 GB of RAM to data-2's files. As
> >>a result, my queries on data-2 keep hitting disk.
> >>
> >>I'm checking page cache usage with fincore. When I run a table scan query
> >>against data-2, I see that data-2's pages get evicted and put back into
> >>the cache in a round-robin manner. Nothing happens to data-1's pages,
> >>although they haven't been touched for days.
> >>
> >>Does anybody know why data-1's pages aren't evicted from the page cache?
> >>I'm open to all kind of suggestions you think it might relate to problem.
> >   Curious. Added linux-mm list to CC to catch more attention. If you run
> >echo 1 >/proc/sys/vm/drop_caches
> >   does it evict data-1 pages from memory?
> >
> >>This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
> >>swap space. The kernel version is:
> >>
> >>$ uname -r
> >>3.2.28-45.62.amzn1.x86_64
> >>Edit:
> >>
> >>and it seems that I use one NUMA instance, if  you think that it can a problem.
> >>
> >>$ numactl --hardware
> >>available: 1 nodes (0)
> >>node 0 cpus: 0 1 2 3 4 5 6 7
> >>node 0 size: 70007 MB
> >>node 0 free: 360 MB
> >>node distances:
> >>node   0
> >>    0:  10

[-- Attachment #2: fadvise.c --]
[-- Type: text/x-csrc, Size: 1904 bytes --]

#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>

#include "fadvise.h"

char *progname;

static void usage(void)
{
	fprintf(stderr, "Usage: %s filename offset length advice [loops]\n", progname);
	fprintf(stderr, "      advice: normal sequential willneed noreuse "
					"dontneed asyncwrite writewait\n");
	exit(1);
}

int
main(int argc, char *argv[])
{
	int c;
	int fd;
	char *sadvice;
	char *filename;
	loff_t offset;
	unsigned long length;
	int advice = 0;
	int ret;
	int loops = 1;

	progname = argv[0];

	while ((c = getopt(argc, argv, "")) != -1) {
		switch (c) {
		}
	}

	if (optind == argc)
		usage();
	filename = argv[optind++];

	if (optind == argc)
		usage();
	offset = strtoull(argv[optind++], NULL, 0);

	if (optind == argc)
		usage();
	length = strtol(argv[optind++], NULL, 0);

	if (optind == argc)
		usage();
	sadvice = argv[optind++];

	if (optind != argc)
		loops = strtol(argv[optind++], NULL, 0);

	if (optind != argc)
		usage();

	if (!strcmp(sadvice, "normal"))
		advice = POSIX_FADV_NORMAL;
	else if (!strcmp(sadvice, "sequential"))
		advice = POSIX_FADV_SEQUENTIAL;
	else if (!strcmp(sadvice, "willneed"))
		advice = POSIX_FADV_WILLNEED;
	else if (!strcmp(sadvice, "noreuse"))
		advice = POSIX_FADV_NOREUSE;
	else if (!strcmp(sadvice, "dontneed"))
		advice = POSIX_FADV_DONTNEED;
	else if (!strcmp(sadvice, "asyncwrite"))
		advice = LINUX_FADV_ASYNC_WRITE;
	else if (!strcmp(sadvice, "writewait"))
		advice = LINUX_FADV_WRITE_WAIT;
	else
		usage();

	fd = open(filename, O_RDONLY);
	if (fd < 0) {
		fprintf(stderr, "%s: cannot open `%s': %s\n",
			progname, filename, strerror(errno));
		exit(1);
	}

	while (loops--) {
		ret = __posix_fadvise64(fd, offset, length, advice);
		if (ret) {
			fprintf(stderr, "%s: fadvise() failed: %s\n",
				progname, strerror(errno));
			exit(1);
		}
	}
	close(fd);
	exit(0);
}

[-- Attachment #3: fadvise.h --]
[-- Type: text/x-chdr, Size: 2375 bytes --]

#include <asm/unistd.h>
#include <sys/errno.h>

#ifndef __NR_fadvise64
#if defined (__i386__)
#define __NR_fadvise64          250
#elif defined(__powerpc__)
#define __NR_fadvise64          233
#elif defined(__ia64__)
#define __NR_fadvise64		1234
#elif defined(__x86_64__)
#define __NR_fadvise64		221
#endif
#endif

#ifndef LINUX_FADV_ASYNC_WRITE
#define LINUX_FADV_ASYNC_WRITE 32
#endif

#ifndef LINUX_FADV_WRITE_WAIT
#define LINUX_FADV_WRITE_WAIT 33
#endif

#ifndef __x86_64__
_syscall5(int,fadvise64, int,fd, long,offset_lo,
		long,offset_hi, size_t,len, int,advice)
#endif

/* Works by luck on ppc32, fails on ppc64 */
#if defined(__i386__)
int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, 0, len, advice);
}

int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, offset >> 32, len, advice);
}
#elif defined(__powerpc64__)
int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, len, advice);
}

int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, len, advice);
}
#elif defined(__powerpc__)

/* 
 * long longs are passed in an odd even register pair on ppc32 so
 * we need to pad before offset
 *
 * Note also the glibc syscall() function for ppc has been broken for
 * 6 argument syscalls until recently (~2.3.1 CVS)
 */
#define ppc_fadvise64(fd, offset_hi, offset_lo, len, advice) \
	syscall(__NR_fadvise64, fd, 0, offset_hi, offset_lo, len, advice)

int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
{
	return ppc_fadvise64(fd, 0, offset, len, advice);
}

/* big endian, akpm. */
int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
{
	return ppc_fadvise64(fd, (unsigned int)(offset >> 32),
			(unsigned int)(offset & 0xffffffff), len, advice);
}
#elif defined(__ia64__)
int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, len, advice);
}

int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
{
	return fadvise64(fd, offset, len, advice);
}
#elif defined(__x86_64__)
int __posix_fadvise(int fd, off_t offset, size_t len, int advice)
{
	return -1;
}

int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice)
{
	return syscall(__NR_fadvise64, fd, offset, len, advice);
}
#endif

  reply	other threads:[~2012-11-21  9:02 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-20 17:42 Problem in Page Cache Replacement metin d
2012-11-20 18:25 ` Jan Kara
2012-11-20 18:25   ` Jan Kara
2012-11-21  8:03   ` metin d
2012-11-21  8:03     ` metin d
2012-11-21  8:13     ` metin d
2012-11-21  8:13       ` metin d
2012-11-21  8:34       ` Jaegeuk Hanse
2012-11-21  8:34         ` Jaegeuk Hanse
2012-11-21  9:02         ` Fengguang Wu [this message]
2012-11-21  9:10           ` Fengguang Wu
2012-11-21  9:10             ` Fengguang Wu
2012-11-21  9:42           ` Jaegeuk Hanse
2012-11-21  9:42             ` Jaegeuk Hanse
2012-11-21  9:58             ` metin d
2012-11-21 10:00               ` Jaegeuk Hanse
2012-11-21 10:07                 ` Metin Döşlü
2012-11-21 10:07                   ` Metin Döşlü
2012-11-22 15:41                   ` Fengguang Wu
2012-11-22 15:41                     ` Fengguang Wu
2012-11-22 15:53                     ` Fengguang Wu
2012-11-22 15:53                       ` Fengguang Wu
2012-11-23  2:10                       ` Jaegeuk Hanse
2012-11-23  2:10                         ` Jaegeuk Hanse
2012-11-25 20:08                       ` Rik van Riel
2012-11-25 20:08                         ` Rik van Riel
2012-11-24 15:06                     ` Metin Döşlü
2012-11-24 15:06                       ` Metin Döşlü
2012-11-21 10:00             ` metin d
2012-11-22 15:26             ` Fengguang Wu
2012-11-22 15:26               ` Fengguang Wu
2012-11-23  1:32               ` Jaegeuk Hanse
2012-11-23  1:32                 ` Jaegeuk Hanse
2012-11-23  2:25                 ` Fengguang Wu
2012-11-23  2:25                   ` Fengguang Wu
     [not found]           ` <50ACA166.70705@gmail.com>
2012-11-22 13:00             ` Jaegeuk Hanse
2012-11-21 21:34   ` Johannes Weiner
2012-11-21 21:34     ` Johannes Weiner
2012-11-21 22:01     ` metin d
2012-11-21 22:01       ` metin d
2012-11-22  0:48     ` Jaegeuk Hanse
2012-11-22  0:48       ` Jaegeuk Hanse
2012-11-22  1:09       ` Johannes Weiner
2012-11-22  1:09         ` Johannes Weiner
2012-11-22  9:37         ` metin d
2012-11-22  9:37           ` metin d
2012-11-22 13:16         ` Jaegeuk Hanse
2012-11-22 13:16           ` Jaegeuk Hanse
2012-11-22 16:17           ` Johannes Weiner
2012-11-22 16:17             ` Johannes Weiner
2012-11-23  2:14             ` Jaegeuk Hanse
2012-11-23  2:14               ` Jaegeuk Hanse
2012-11-23  1:58   ` Jaegeuk Hanse
2012-11-23  1:58     ` Jaegeuk Hanse
2012-11-23  8:08     ` metin d
2012-11-23  8:08       ` metin d
2012-11-23  8:17       ` Jaegeuk Hanse
2012-11-23  8:17         ` Jaegeuk Hanse
2012-11-23  8:25         ` metin d
2012-11-23  8:25           ` metin d

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121121090204.GA9064@localhost \
    --to=fengguang.wu@intel.com \
    --cc=jack@suse.cz \
    --cc=jaegeuk.hanse@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=metdos@yahoo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.