From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx201.postini.com [74.125.245.201]) by kanga.kvack.org (Postfix) with SMTP id 9621F6B0073 for ; Tue, 20 Nov 2012 13:25:21 -0500 (EST) Date: Tue, 20 Nov 2012 19:25:00 +0100 From: Jan Kara Subject: Re: Problem in Page Cache Replacement Message-ID: <20121120182500.GH1408@quack.suse.cz> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> Sender: owner-linux-mm@kvack.org List-ID: To: metin d Cc: "linux-kernel@vger.kernel.org" , linux-mm@kvack.org On Tue 20-11-12 09:42:42, metin d wrote: > I have two PostgreSQL databases named data-1 and data-2 that sit on the > same machine. Both databases keep 40 GB of data, and the total memory > available on the machine is 68GB. > > I started data-1 and data-2, and ran several queries to go over all their > data. Then, I shut down data-1 and kept issuing queries against data-2. > For some reason, the OS still holds on to large parts of data-1's pages > in its page cache, and reserves about 35 GB of RAM to data-2's files. As > a result, my queries on data-2 keep hitting disk. > > I'm checking page cache usage with fincore. When I run a table scan query > against data-2, I see that data-2's pages get evicted and put back into > the cache in a round-robin manner. Nothing happens to data-1's pages, > although they haven't been touched for days. > > Does anybody know why data-1's pages aren't evicted from the page cache? > I'm open to all kind of suggestions you think it might relate to problem. Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? > This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no > swap space. The kernel version is: > > $ uname -r > 3.2.28-45.62.amzn1.x86_64 > Edit: > > and it seems that I use one NUMA instance, if you think that it can a problem. > > $ numactl --hardware > available: 1 nodes (0) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 70007 MB > node 0 free: 360 MB > node distances: > node 0 > 0: 10 Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx168.postini.com [74.125.245.168]) by kanga.kvack.org (Postfix) with SMTP id C960C6B0062 for ; Wed, 21 Nov 2012 03:03:41 -0500 (EST) References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> Message-ID: <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> Date: Wed, 21 Nov 2012 00:03:40 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement In-Reply-To: <20121120182500.GH1408@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" =0A=0A> =A0Curious. Added linux-mm list to CC to catch more attention. If y= ou run=0A> echo 1 >/proc/sys/vm/drop_caches=A0does it evict data-1 pages fr= om memory?=0A=0A=0AI'm guessing it'd evict the entries, but am wondering if= we could run any more diagnostics before trying this.=0A=0AWe regularly us= e a setup where we have two databases; one gets used frequently and the oth= er one about once a month. It seems like the memory manager keeps unused pa= ges in memory at the expense of frequently used database's performance.=0A= =0AMy understanding was that under memory pressure from heavily accessed pa= ges, unused pages would eventually get evicted. Is there anything else we c= an try on this host to understand why this is happening?=0A=0AThank you,=0A= =0AMetin=0A=0A=0A----- Original Message -----=0AFrom: Jan Kara =0ATo: metin d =0ACc: "linux-kernel@vger.kernel.org" ; linux-mm@kvack.org=0ASent: Tuesday, November = 20, 2012 8:25 PM=0ASubject: Re: Problem in Page Cache Replacement=0A=0AOn T= ue 20-11-12 09:42:42, metin d wrote:=0A> I have two PostgreSQL databases na= med data-1 and data-2 that sit on the=0A> same machine. Both databases keep= 40 GB of data, and the total memory=0A> available on the machine is 68GB.= =0A> =0A> I started data-1 and data-2, and ran several queries to go over a= ll their=0A> data. Then, I shut down data-1 and kept issuing queries agains= t data-2.=0A> For some reason, the OS still holds on to large parts of data= -1's pages=0A> in its page cache, and reserves about 35 GB of RAM to data-2= 's files. As=0A> a result, my queries on data-2 keep hitting disk.=0A> =0A>= I'm checking page cache usage with fincore. When I run a table scan query= =0A> against data-2, I see that data-2's pages get evicted and put back int= o=0A> the cache in a round-robin manner. Nothing happens to data-1's pages,= =0A> although they haven't been touched for days.=0A> =0A> Does anybody kno= w why data-1's pages aren't evicted from the page cache?=0A> I'm open to al= l kind of suggestions you think it might relate to problem.=0A=A0 Curious. = Added linux-mm list to CC to catch more attention. If you run=0Aecho 1 >/pr= oc/sys/vm/drop_caches=0A=A0 does it evict data-1 pages from memory?=0A=0A> = This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no=0A> s= wap space. The kernel version is:=0A> =0A> $ uname -r=0A> 3.2.28-45.62.amzn= 1.x86_64=0A> Edit:=0A> =0A> and it seems that I use one NUMA instance, if = =A0you think that it can a problem.=0A> =0A> $ numactl --hardware=0A> avail= able: 1 nodes (0)=0A> node 0 cpus: 0 1 2 3 4 5 6 7=0A> node 0 size: 70007 M= B=0A> node 0 free: 360 MB=0A> node distances:=0A> node =A0 0=0A> =A0 0: =A0= 10=0A=0A=A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0= =A0 =A0=A0=A0 Honza=0A-- =0AJan Kara =0ASUSE Labs, CR=0A -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx137.postini.com [74.125.245.137]) by kanga.kvack.org (Postfix) with SMTP id 1E0146B006C for ; Wed, 21 Nov 2012 03:13:52 -0500 (EST) References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> Message-ID: <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> Date: Wed, 21 Nov 2012 00:13:50 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement In-Reply-To: <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" >=A0 Curious. Added linux-mm list to CC to catch more attention. If you run= =0A> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memor= y?=0A=0AI'm guessing it'd evict the entries, but am wondering if we could r= un any more diagnostics before trying this.=0A=0AWe regularly use a setup w= here we have two databases; one gets used frequently and the other one abou= t once a month. It seems like the memory manager keeps unused pages in memo= ry at the expense of frequently used database's performance.=0A=0AMy unders= tanding was that under memory pressure from heavily accessed pages, unused = pages would eventually get evicted. Is there anything else we can try on th= is host to understand why this is happening?=0A=0AThank you,=0A=0AMetin=0A= =0AOn Tue 20-11-12 09:42:42, metin d wrote:=0A> I have two PostgreSQL datab= ases named data-1 and data-2 that sit on the=0A> same machine. Both databas= es keep 40 GB of data, and the total memory=0A> available on the machine is= 68GB.=0A> =0A> I started data-1 and data-2, and ran several queries to go = over all their=0A> data. Then, I shut down data-1 and kept issuing queries = against data-2.=0A> For some reason, the OS still holds on to large parts o= f data-1's pages=0A> in its page cache, and reserves about 35 GB of RAM to = data-2's files. As=0A> a result, my queries on data-2 keep hitting disk.=0A= > =0A> I'm checking page cache usage with fincore. When I run a table scan = query=0A> against data-2, I see that data-2's pages get evicted and put bac= k into=0A> the cache in a round-robin manner. Nothing happens to data-1's p= ages,=0A> although they haven't been touched for days.=0A> =0A> Does anybod= y know why data-1's pages aren't evicted from the page cache?=0A> I'm open = to all kind of suggestions you think it might relate to problem.=0A=A0 Curi= ous. Added linux-mm list to CC to catch more attention. If you run=0Aecho 1= >/proc/sys/vm/drop_caches=0A=A0 does it evict data-1 pages from memory?=0A= =0A> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no= =0A> swap space. The kernel version is:=0A> =0A> $ uname -r=0A> 3.2.28-45.6= 2.amzn1.x86_64=0A> Edit:=0A> =0A> and it seems that I use one NUMA instance= , if=A0 you think that it can a problem.=0A> =0A> $ numactl --hardware=0A> = available: 1 nodes (0)=0A> node 0 cpus: 0 1 2 3 4 5 6 7=0A> node 0 size: 70= 007 MB=0A> node 0 free: 360 MB=0A> node distances:=0A> node=A0=A0 0=0A>=A0= =A0 0:=A0 10=0A=0A-- =0AJan Kara =0ASUSE Labs, CR=0A -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id 420DA6B0071 for ; Wed, 21 Nov 2012 03:34:47 -0500 (EST) Received: by mail-ob0-f169.google.com with SMTP id lz20so8566939obb.14 for ; Wed, 21 Nov 2012 00:34:46 -0800 (PST) Message-ID: <50AC9220.70202@gmail.com> Date: Wed, 21 Nov 2012 16:34:40 +0800 From: Jaegeuk Hanse MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> In-Reply-To: <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: metin d , Fengguang Wu Cc: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Cc Fengguang Wu. On 11/21/2012 04:13 PM, metin d wrote: >> Curious. Added linux-mm list to CC to catch more attention. If you run >> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? > I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. > > We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. > > My understanding was that under memory pressure from heavily accessed pages, unused pages would eventually get evicted. Is there anything else we can try on this host to understand why this is happening? > > Thank you, > > Metin > > On Tue 20-11-12 09:42:42, metin d wrote: >> I have two PostgreSQL databases named data-1 and data-2 that sit on the >> same machine. Both databases keep 40 GB of data, and the total memory >> available on the machine is 68GB. >> >> I started data-1 and data-2, and ran several queries to go over all their >> data. Then, I shut down data-1 and kept issuing queries against data-2. >> For some reason, the OS still holds on to large parts of data-1's pages >> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >> a result, my queries on data-2 keep hitting disk. >> >> I'm checking page cache usage with fincore. When I run a table scan query >> against data-2, I see that data-2's pages get evicted and put back into >> the cache in a round-robin manner. Nothing happens to data-1's pages, >> although they haven't been touched for days. >> >> Does anybody know why data-1's pages aren't evicted from the page cache? >> I'm open to all kind of suggestions you think it might relate to problem. > Curious. Added linux-mm list to CC to catch more attention. If you run > echo 1 >/proc/sys/vm/drop_caches > does it evict data-1 pages from memory? > >> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >> swap space. The kernel version is: >> >> $ uname -r >> 3.2.28-45.62.amzn1.x86_64 >> Edit: >> >> and it seems that I use one NUMA instance, if you think that it can a problem. >> >> $ numactl --hardware >> available: 1 nodes (0) >> node 0 cpus: 0 1 2 3 4 5 6 7 >> node 0 size: 70007 MB >> node 0 free: 360 MB >> node distances: >> node 0 >> 0: 10 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id 07BB86B0070 for ; Wed, 21 Nov 2012 04:02:44 -0500 (EST) Date: Wed, 21 Nov 2012 17:02:04 +0800 From: Fengguang Wu Subject: Re: Problem in Page Cache Replacement Message-ID: <20121121090204.GA9064@localhost> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="FCuugMFkClbJLl1L" Content-Disposition: inline In-Reply-To: <50AC9220.70202@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse Cc: metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" --FCuugMFkClbJLl1L Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: > Cc Fengguang Wu. > > On 11/21/2012 04:13 PM, metin d wrote: > >> Curious. Added linux-mm list to CC to catch more attention. If you run > >>echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? > >I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. > > > >We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. > >My understanding was that under memory pressure from heavily > >accessed pages, unused pages would eventually get evicted. Is there > >anything else we can try on this host to understand why this is > >happening? We may debug it this way. 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages (please double check via /proc/vmstat whether it does the expected work) 2) run 'page-types -r' with root, to view the page status for the remaining pages of data-1 The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" page-types can be found in the kernel source tree tools/vm/page-types.c Sorry that sounds a bit twisted.. I do have a patch to directly dump page cache status of a user specified file, however it's not upstreamed yet. Thanks, Fengguang > >On Tue 20-11-12 09:42:42, metin d wrote: > >>I have two PostgreSQL databases named data-1 and data-2 that sit on the > >>same machine. Both databases keep 40 GB of data, and the total memory > >>available on the machine is 68GB. > >> > >>I started data-1 and data-2, and ran several queries to go over all their > >>data. Then, I shut down data-1 and kept issuing queries against data-2. > >>For some reason, the OS still holds on to large parts of data-1's pages > >>in its page cache, and reserves about 35 GB of RAM to data-2's files. As > >>a result, my queries on data-2 keep hitting disk. > >> > >>I'm checking page cache usage with fincore. When I run a table scan query > >>against data-2, I see that data-2's pages get evicted and put back into > >>the cache in a round-robin manner. Nothing happens to data-1's pages, > >>although they haven't been touched for days. > >> > >>Does anybody know why data-1's pages aren't evicted from the page cache? > >>I'm open to all kind of suggestions you think it might relate to problem. > > Curious. Added linux-mm list to CC to catch more attention. If you run > >echo 1 >/proc/sys/vm/drop_caches > > does it evict data-1 pages from memory? > > > >>This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no > >>swap space. The kernel version is: > >> > >>$ uname -r > >>3.2.28-45.62.amzn1.x86_64 > >>Edit: > >> > >>and it seems that I use one NUMA instance, if you think that it can a problem. > >> > >>$ numactl --hardware > >>available: 1 nodes (0) > >>node 0 cpus: 0 1 2 3 4 5 6 7 > >>node 0 size: 70007 MB > >>node 0 free: 360 MB > >>node distances: > >>node 0 > >> 0: 10 --FCuugMFkClbJLl1L Content-Type: text/x-csrc; charset=us-ascii Content-Disposition: attachment; filename="fadvise.c" #include #include #include #include #include #include #include "fadvise.h" char *progname; static void usage(void) { fprintf(stderr, "Usage: %s filename offset length advice [loops]\n", progname); fprintf(stderr, " advice: normal sequential willneed noreuse " "dontneed asyncwrite writewait\n"); exit(1); } int main(int argc, char *argv[]) { int c; int fd; char *sadvice; char *filename; loff_t offset; unsigned long length; int advice = 0; int ret; int loops = 1; progname = argv[0]; while ((c = getopt(argc, argv, "")) != -1) { switch (c) { } } if (optind == argc) usage(); filename = argv[optind++]; if (optind == argc) usage(); offset = strtoull(argv[optind++], NULL, 0); if (optind == argc) usage(); length = strtol(argv[optind++], NULL, 0); if (optind == argc) usage(); sadvice = argv[optind++]; if (optind != argc) loops = strtol(argv[optind++], NULL, 0); if (optind != argc) usage(); if (!strcmp(sadvice, "normal")) advice = POSIX_FADV_NORMAL; else if (!strcmp(sadvice, "sequential")) advice = POSIX_FADV_SEQUENTIAL; else if (!strcmp(sadvice, "willneed")) advice = POSIX_FADV_WILLNEED; else if (!strcmp(sadvice, "noreuse")) advice = POSIX_FADV_NOREUSE; else if (!strcmp(sadvice, "dontneed")) advice = POSIX_FADV_DONTNEED; else if (!strcmp(sadvice, "asyncwrite")) advice = LINUX_FADV_ASYNC_WRITE; else if (!strcmp(sadvice, "writewait")) advice = LINUX_FADV_WRITE_WAIT; else usage(); fd = open(filename, O_RDONLY); if (fd < 0) { fprintf(stderr, "%s: cannot open `%s': %s\n", progname, filename, strerror(errno)); exit(1); } while (loops--) { ret = __posix_fadvise64(fd, offset, length, advice); if (ret) { fprintf(stderr, "%s: fadvise() failed: %s\n", progname, strerror(errno)); exit(1); } } close(fd); exit(0); } --FCuugMFkClbJLl1L Content-Type: text/x-chdr; charset=us-ascii Content-Disposition: attachment; filename="fadvise.h" #include #include #ifndef __NR_fadvise64 #if defined (__i386__) #define __NR_fadvise64 250 #elif defined(__powerpc__) #define __NR_fadvise64 233 #elif defined(__ia64__) #define __NR_fadvise64 1234 #elif defined(__x86_64__) #define __NR_fadvise64 221 #endif #endif #ifndef LINUX_FADV_ASYNC_WRITE #define LINUX_FADV_ASYNC_WRITE 32 #endif #ifndef LINUX_FADV_WRITE_WAIT #define LINUX_FADV_WRITE_WAIT 33 #endif #ifndef __x86_64__ _syscall5(int,fadvise64, int,fd, long,offset_lo, long,offset_hi, size_t,len, int,advice) #endif /* Works by luck on ppc32, fails on ppc64 */ #if defined(__i386__) int __posix_fadvise(int fd, off_t offset, size_t len, int advice) { return fadvise64(fd, offset, 0, len, advice); } int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) { return fadvise64(fd, offset, offset >> 32, len, advice); } #elif defined(__powerpc64__) int __posix_fadvise(int fd, off_t offset, size_t len, int advice) { return fadvise64(fd, offset, len, advice); } int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) { return fadvise64(fd, offset, len, advice); } #elif defined(__powerpc__) /* * long longs are passed in an odd even register pair on ppc32 so * we need to pad before offset * * Note also the glibc syscall() function for ppc has been broken for * 6 argument syscalls until recently (~2.3.1 CVS) */ #define ppc_fadvise64(fd, offset_hi, offset_lo, len, advice) \ syscall(__NR_fadvise64, fd, 0, offset_hi, offset_lo, len, advice) int __posix_fadvise(int fd, off_t offset, size_t len, int advice) { return ppc_fadvise64(fd, 0, offset, len, advice); } /* big endian, akpm. */ int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) { return ppc_fadvise64(fd, (unsigned int)(offset >> 32), (unsigned int)(offset & 0xffffffff), len, advice); } #elif defined(__ia64__) int __posix_fadvise(int fd, off_t offset, size_t len, int advice) { return fadvise64(fd, offset, len, advice); } int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) { return fadvise64(fd, offset, len, advice); } #elif defined(__x86_64__) int __posix_fadvise(int fd, off_t offset, size_t len, int advice) { return -1; } int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) { return syscall(__NR_fadvise64, fd, offset, len, advice); } #endif --FCuugMFkClbJLl1L-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx174.postini.com [74.125.245.174]) by kanga.kvack.org (Postfix) with SMTP id BF4B46B006C for ; Wed, 21 Nov 2012 04:10:42 -0500 (EST) Date: Wed, 21 Nov 2012 17:10:02 +0800 From: Fengguang Wu Subject: Re: Problem in Page Cache Replacement Message-ID: <20121121091002.GA10255@localhost> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121121090204.GA9064@localhost> Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse Cc: metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" On Wed, Nov 21, 2012 at 05:02:04PM +0800, Fengguang Wu wrote: > On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: > > Cc Fengguang Wu. > > > > On 11/21/2012 04:13 PM, metin d wrote: > > >> Curious. Added linux-mm list to CC to catch more attention. If you run > > >>echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? > > >I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. > > > > > >We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. > > > >My understanding was that under memory pressure from heavily > > >accessed pages, unused pages would eventually get evicted. Is there > > >anything else we can try on this host to understand why this is > > >happening? > > We may debug it this way. Better to add a step 0) run 'page-types -r' to get an initial view of the page cache status. Thanks, Fengguang > 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages > (please double check via /proc/vmstat whether it does the expected work) > > 2) run 'page-types -r' with root, to view the page status for the > remaining pages of data-1 > > The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) > Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" > > page-types can be found in the kernel source tree tools/vm/page-types.c > > Sorry that sounds a bit twisted.. I do have a patch to directly dump > page cache status of a user specified file, however it's not > upstreamed yet. > > Thanks, > Fengguang > > > >On Tue 20-11-12 09:42:42, metin d wrote: > > >>I have two PostgreSQL databases named data-1 and data-2 that sit on the > > >>same machine. Both databases keep 40 GB of data, and the total memory > > >>available on the machine is 68GB. > > >> > > >>I started data-1 and data-2, and ran several queries to go over all their > > >>data. Then, I shut down data-1 and kept issuing queries against data-2. > > >>For some reason, the OS still holds on to large parts of data-1's pages > > >>in its page cache, and reserves about 35 GB of RAM to data-2's files. As > > >>a result, my queries on data-2 keep hitting disk. > > >> > > >>I'm checking page cache usage with fincore. When I run a table scan query > > >>against data-2, I see that data-2's pages get evicted and put back into > > >>the cache in a round-robin manner. Nothing happens to data-1's pages, > > >>although they haven't been touched for days. > > >> > > >>Does anybody know why data-1's pages aren't evicted from the page cache? > > >>I'm open to all kind of suggestions you think it might relate to problem. > > > Curious. Added linux-mm list to CC to catch more attention. If you run > > >echo 1 >/proc/sys/vm/drop_caches > > > does it evict data-1 pages from memory? > > > > > >>This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no > > >>swap space. The kernel version is: > > >> > > >>$ uname -r > > >>3.2.28-45.62.amzn1.x86_64 > > >>Edit: > > >> > > >>and it seems that I use one NUMA instance, if you think that it can a problem. > > >> > > >>$ numactl --hardware > > >>available: 1 nodes (0) > > >>node 0 cpus: 0 1 2 3 4 5 6 7 > > >>node 0 size: 70007 MB > > >>node 0 free: 360 MB > > >>node distances: > > >>node 0 > > >> 0: 10 > #include > #include > #include > #include > #include > #include > > #include "fadvise.h" > > char *progname; > > static void usage(void) > { > fprintf(stderr, "Usage: %s filename offset length advice [loops]\n", progname); > fprintf(stderr, " advice: normal sequential willneed noreuse " > "dontneed asyncwrite writewait\n"); > exit(1); > } > > int > main(int argc, char *argv[]) > { > int c; > int fd; > char *sadvice; > char *filename; > loff_t offset; > unsigned long length; > int advice = 0; > int ret; > int loops = 1; > > progname = argv[0]; > > while ((c = getopt(argc, argv, "")) != -1) { > switch (c) { > } > } > > if (optind == argc) > usage(); > filename = argv[optind++]; > > if (optind == argc) > usage(); > offset = strtoull(argv[optind++], NULL, 0); > > if (optind == argc) > usage(); > length = strtol(argv[optind++], NULL, 0); > > if (optind == argc) > usage(); > sadvice = argv[optind++]; > > if (optind != argc) > loops = strtol(argv[optind++], NULL, 0); > > if (optind != argc) > usage(); > > if (!strcmp(sadvice, "normal")) > advice = POSIX_FADV_NORMAL; > else if (!strcmp(sadvice, "sequential")) > advice = POSIX_FADV_SEQUENTIAL; > else if (!strcmp(sadvice, "willneed")) > advice = POSIX_FADV_WILLNEED; > else if (!strcmp(sadvice, "noreuse")) > advice = POSIX_FADV_NOREUSE; > else if (!strcmp(sadvice, "dontneed")) > advice = POSIX_FADV_DONTNEED; > else if (!strcmp(sadvice, "asyncwrite")) > advice = LINUX_FADV_ASYNC_WRITE; > else if (!strcmp(sadvice, "writewait")) > advice = LINUX_FADV_WRITE_WAIT; > else > usage(); > > fd = open(filename, O_RDONLY); > if (fd < 0) { > fprintf(stderr, "%s: cannot open `%s': %s\n", > progname, filename, strerror(errno)); > exit(1); > } > > while (loops--) { > ret = __posix_fadvise64(fd, offset, length, advice); > if (ret) { > fprintf(stderr, "%s: fadvise() failed: %s\n", > progname, strerror(errno)); > exit(1); > } > } > close(fd); > exit(0); > } > #include > #include > > #ifndef __NR_fadvise64 > #if defined (__i386__) > #define __NR_fadvise64 250 > #elif defined(__powerpc__) > #define __NR_fadvise64 233 > #elif defined(__ia64__) > #define __NR_fadvise64 1234 > #elif defined(__x86_64__) > #define __NR_fadvise64 221 > #endif > #endif > > #ifndef LINUX_FADV_ASYNC_WRITE > #define LINUX_FADV_ASYNC_WRITE 32 > #endif > > #ifndef LINUX_FADV_WRITE_WAIT > #define LINUX_FADV_WRITE_WAIT 33 > #endif > > #ifndef __x86_64__ > _syscall5(int,fadvise64, int,fd, long,offset_lo, > long,offset_hi, size_t,len, int,advice) > #endif > > /* Works by luck on ppc32, fails on ppc64 */ > #if defined(__i386__) > int __posix_fadvise(int fd, off_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, 0, len, advice); > } > > int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, offset >> 32, len, advice); > } > #elif defined(__powerpc64__) > int __posix_fadvise(int fd, off_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, len, advice); > } > > int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, len, advice); > } > #elif defined(__powerpc__) > > /* > * long longs are passed in an odd even register pair on ppc32 so > * we need to pad before offset > * > * Note also the glibc syscall() function for ppc has been broken for > * 6 argument syscalls until recently (~2.3.1 CVS) > */ > #define ppc_fadvise64(fd, offset_hi, offset_lo, len, advice) \ > syscall(__NR_fadvise64, fd, 0, offset_hi, offset_lo, len, advice) > > int __posix_fadvise(int fd, off_t offset, size_t len, int advice) > { > return ppc_fadvise64(fd, 0, offset, len, advice); > } > > /* big endian, akpm. */ > int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) > { > return ppc_fadvise64(fd, (unsigned int)(offset >> 32), > (unsigned int)(offset & 0xffffffff), len, advice); > } > #elif defined(__ia64__) > int __posix_fadvise(int fd, off_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, len, advice); > } > > int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, len, advice); > } > #elif defined(__x86_64__) > int __posix_fadvise(int fd, off_t offset, size_t len, int advice) > { > return -1; > } > > int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) > { > return syscall(__NR_fadvise64, fd, offset, len, advice); > } > #endif -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx124.postini.com [74.125.245.124]) by kanga.kvack.org (Postfix) with SMTP id 0A3DB6B0044 for ; Wed, 21 Nov 2012 04:42:41 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id e20so1621705dak.14 for ; Wed, 21 Nov 2012 01:42:41 -0800 (PST) Message-ID: <50ACA209.9000101@gmail.com> Date: Wed, 21 Nov 2012 17:42:33 +0800 From: Jaegeuk Hanse MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> In-Reply-To: <20121121090204.GA9064@localhost> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Fengguang Wu Cc: metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" On 11/21/2012 05:02 PM, Fengguang Wu wrote: > On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: >> Cc Fengguang Wu. >> >> On 11/21/2012 04:13 PM, metin d wrote: >>>> Curious. Added linux-mm list to CC to catch more attention. If you run >>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? >>> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. >>> >>> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. >>> My understanding was that under memory pressure from heavily >>> accessed pages, unused pages would eventually get evicted. Is there >>> anything else we can try on this host to understand why this is >>> happening? > We may debug it this way. > > 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages > (please double check via /proc/vmstat whether it does the expected work) > > 2) run 'page-types -r' with root, to view the page status for the > remaining pages of data-1 > > The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) > Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" > > page-types can be found in the kernel source tree tools/vm/page-types.c > > Sorry that sounds a bit twisted.. I do have a patch to directly dump > page cache status of a user specified file, however it's not > upstreamed yet. Hi Fengguang, Thanks for you detail steps, I think metin can have a try. flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 607699 2373 ___________________________________ 0x0000000100000000 343227 1340 _______________________r___________ reserved But I have some questions of the print of page-type: Is 2373MB here mean total memory in used include page cache? I don't think so. Which kind of pages will be marked reserved? Which line of long-symbolic-flags is for page cache? Regards, Jaegeuk > > Thanks, > Fengguang > >>> On Tue 20-11-12 09:42:42, metin d wrote: >>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>>> same machine. Both databases keep 40 GB of data, and the total memory >>>> available on the machine is 68GB. >>>> >>>> I started data-1 and data-2, and ran several queries to go over all their >>>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>>> For some reason, the OS still holds on to large parts of data-1's pages >>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>>> a result, my queries on data-2 keep hitting disk. >>>> >>>> I'm checking page cache usage with fincore. When I run a table scan query >>>> against data-2, I see that data-2's pages get evicted and put back into >>>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>>> although they haven't been touched for days. >>>> >>>> Does anybody know why data-1's pages aren't evicted from the page cache? >>>> I'm open to all kind of suggestions you think it might relate to problem. >>> Curious. Added linux-mm list to CC to catch more attention. If you run >>> echo 1 >/proc/sys/vm/drop_caches >>> does it evict data-1 pages from memory? >>> >>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >>>> swap space. The kernel version is: >>>> >>>> $ uname -r >>>> 3.2.28-45.62.amzn1.x86_64 >>>> Edit: >>>> >>>> and it seems that I use one NUMA instance, if you think that it can a problem. >>>> >>>> $ numactl --hardware >>>> available: 1 nodes (0) >>>> node 0 cpus: 0 1 2 3 4 5 6 7 >>>> node 0 size: 70007 MB >>>> node 0 free: 360 MB >>>> node distances: >>>> node 0 >>>> 0: 10 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx109.postini.com [74.125.245.109]) by kanga.kvack.org (Postfix) with SMTP id E7C7E6B0044 for ; Wed, 21 Nov 2012 04:58:01 -0500 (EST) References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> Message-ID: <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> Date: Wed, 21 Nov 2012 01:58:00 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement In-Reply-To: <50ACA209.9000101@gmail.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="-2140344373-809632473-1353491880=:11679" Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse , Fengguang Wu Cc: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , =?utf-8?B?TWV0aW4gRMO2xZ9sw7w=?= ---2140344373-809632473-1353491880=:11679 Content-Type: multipart/alternative; boundary="-2140344373-2060836916-1353491880=:11679" ---2140344373-2060836916-1353491880=:11679 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi Fengguang,=0A=0AI run tests and attached the results. The line below I g= uess shows the data-1 page caches.=0A=0A0x000000080000006c=C2=A0=C2=A0=C2= =A0 =C2=A0=C2=A0 6584051=C2=A0=C2=A0=C2=A0 25718=C2=A0 __RU_lA_____________= ______P________=C2=A0=C2=A0=C2=A0 referenced,uptodate,lru,active,private=0A= =0AMetin=0A=0A=0A=0A=0A----- Original Message -----=0AFrom: Jaegeuk Hanse <= jaegeuk.hanse@gmail.com>=0ATo: Fengguang Wu =0ACc: = metin d ; Jan Kara ; "linux-kernel@vger.ker= nel.org" ; "linux-mm@kvack.org" =0ASent: Wednesday, November 21, 2012 11:42 AM=0ASubject: Re: Proble= m in Page Cache Replacement=0A=0AOn 11/21/2012 05:02 PM, Fengguang Wu wrote= :=0A> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:=0A>> C= c Fengguang Wu.=0A>>=0A>> On 11/21/2012 04:13 PM, metin d wrote:=0A>>>>=C2= =A0 =C2=A0 Curious. Added linux-mm list to CC to catch more attention. If y= ou run=0A>>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages f= rom memory?=0A>>> I'm guessing it'd evict the entries, but am wondering if = we could run any more diagnostics before trying this.=0A>>>=0A>>> We regula= rly use a setup where we have two databases; one gets used frequently and t= he other one about once a month. It seems like the memory manager keeps unu= sed pages in memory at the expense of frequently used database's performanc= e.=0A>>> My understanding was that under memory pressure from heavily=0A>>>= accessed pages, unused pages would eventually get evicted. Is there=0A>>> = anything else we can try on this host to understand why this is=0A>>> happe= ning?=0A> We may debug it this way.=0A>=0A> 1) run 'fadvise data-2 0 0 dont= need' to drop data-2 cached pages=0A>=C2=A0 =C2=A0 (please double check vi= a /proc/vmstat whether it does the expected work)=0A>=0A> 2) run 'page-type= s -r' with root, to view the page status for the=0A>=C2=A0 =C2=A0 remainin= g pages of data-1=0A>=0A> The fadvise tool comes from Andrew Morton's ext3-= tools. (source code attached)=0A> Please compile them with options "-Dlinux= -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=3D64 -D_LARGEFILE64_SOURCE"=0A>=0A> = page-types can be found in the kernel source tree tools/vm/page-types.c=0A>= =0A> Sorry that sounds a bit twisted.. I do have a patch to directly dump= =0A> page cache status of a user specified file, however it's not=0A> upstr= eamed yet.=0A=0AHi Fengguang,=0A=0AThanks for you detail steps, I think met= in can have a try.=0A=0A=C2=A0 =C2=A0 =C2=A0 =C2=A0 flags=C2=A0 =C2=A0 pag= e-count=C2=A0 =C2=A0 =C2=A0 MB=C2=A0 symbolic-flags long-symbolic-flags=0A= 0x0000000000000000=C2=A0 =C2=A0 =C2=A0 =C2=A0 607699=C2=A0 =C2=A0 2373 =0A= ___________________________________=0A0x0000000100000000=C2=A0 =C2=A0 =C2= =A0 =C2=A0 343227=C2=A0 =C2=A0 1340 =0A_______________________r___________= =C2=A0 =C2=A0 reserved=0A=0ABut I have some questions of the print of page-= type:=0A=0AIs 2373MB here mean total memory in used include page cache? I d= on't =0Athink so.=0AWhich kind of pages will be marked reserved?=0AWhich li= ne of long-symbolic-flags is for page cache?=0A=0ARegards,=0AJaegeuk=0A=0A>= =0A> Thanks,=0A> Fengguang=0A>=0A>>> On Tue 20-11-12 09:42:42, metin d wrot= e:=0A>>>> I have two PostgreSQL databases named data-1 and data-2 that sit = on the=0A>>>> same machine. Both databases keep 40 GB of data, and the tota= l memory=0A>>>> available on the machine is 68GB.=0A>>>>=0A>>>> I started d= ata-1 and data-2, and ran several queries to go over all their=0A>>>> data.= Then, I shut down data-1 and kept issuing queries against data-2.=0A>>>> F= or some reason, the OS still holds on to large parts of data-1's pages=0A>>= >> in its page cache, and reserves about 35 GB of RAM to data-2's files. As= =0A>>>> a result, my queries on data-2 keep hitting disk.=0A>>>>=0A>>>> I'm= checking page cache usage with fincore. When I run a table scan query=0A>>= >> against data-2, I see that data-2's pages get evicted and put back into= =0A>>>> the cache in a round-robin manner. Nothing happens to data-1's page= s,=0A>>>> although they haven't been touched for days.=0A>>>>=0A>>>> Does a= nybody know why data-1's pages aren't evicted from the page cache?=0A>>>> I= 'm open to all kind of suggestions you think it might relate to problem.=0A= >>>=C2=A0 =C2=A0 Curious. Added linux-mm list to CC to catch more attention= . If you run=0A>>> echo 1 >/proc/sys/vm/drop_caches=0A>>>=C2=A0 =C2=A0 does= it evict data-1 pages from memory?=0A>>>=0A>>>> This is an EC2 m2.4xlarge = instance on Amazon with 68 GB of RAM and no=0A>>>> swap space. The kernel v= ersion is:=0A>>>>=0A>>>> $ uname -r=0A>>>> 3.2.28-45.62.amzn1.x86_64=0A>>>>= Edit:=0A>>>>=0A>>>> and it seems that I use one NUMA instance, if=C2=A0 yo= u think that it can a problem.=0A>>>>=0A>>>> $ numactl --hardware=0A>>>> av= ailable: 1 nodes (0)=0A>>>> node 0 cpus: 0 1 2 3 4 5 6 7=0A>>>> node 0 size= : 70007 MB=0A>>>> node 0 free: 360 MB=0A>>>> node distances:=0A>>>> node=C2= =A0 0=0A>>>>=C2=A0 =C2=A0 0:=C2=A0 10 ---2140344373-2060836916-1353491880=:11679 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Hi = Fengguang,

I run tests and attached the results. The line be= low I guess shows the data-1 page caches.

0x000000080000006c=        6584051    25718  __RU_lA___________________P________    referenced,uptodate,= lru,active,private

Metin


----- Original Message -----
From: Jaegeuk Hanse <jaegeuk.hanse@gma= il.com>
To: Fengguang Wu <fengguang.wu@intel.com>
Cc: meti= n d <metdos@yahoo.com>; Jan Kara <jack@suse.cz>; "linux-kernel@= vger.kernel.org" <linux-kernel@vger.kernel.org>; "linux-mm@kvack.org"= <linux-mm@kvack.org>
Sent: Wednesday, November 21, 2012 11:42 AM
Subject: Re: Problem in Page Cache Replacement

=0AOn 11/21/= 2012 05:02 PM, Fengguang Wu wrote:
> On Wed, Nov 21, 2012 at 04:34:40= PM +0800, Jaegeuk Hanse wrote:
>> Cc Fengguang Wu.
>>
= >> On 11/21/2012 04:13 PM, metin d wrote:
>>>>  &= nbsp; Curious. Added linux-mm list to CC to catch more attention. If you ru= n
>>>> echo 1 >/proc/sys/vm/drop_caches does it evict dat= a-1 pages from memory?
>>> I'm guessing it'd evict the entries,= but am wondering if we could run any more diagnostics before trying this.<= br>>>>
>>> We regularly use a setup where we have two = databases; one gets used frequently and the other one about once a month. I= t seems like the memory manager keeps unused pages in memory at the expense= of frequently used database's performance.
>>> My understandin= g was that under memory pressure from heavily
>>> accessed page= s, unused pages would eventually get evicted. Is there
>>> anything else we can try on this host to understand = why this is
>>> happening?
> We may debug it this way.>
> 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pa= ges
>    (please double check via /proc/vmstat whether it = does the expected work)
>
> 2) run 'page-types -r' with root, t= o view the page status for the
>    remaining pages of dat= a-1
>
> The fadvise tool comes from Andrew Morton's ext3-tools.= (source code attached)
> Please compile them with options "-Dlinux -= I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=3D64 -D_LARGEFILE64_SOURCE"
>> page-types can be found in the kernel source tree tools/vm/page-types= .c
>
> Sorry that sounds a bit twisted.. I do have a patch to d= irectly dump
> page cache status of a user specified file, however it= 's not
> upstreamed yet.

Hi Fengguang,

Thanks for you detail steps, I think metin can have a try.

     = ;   flags    page-count      MB  symbo= lic-flags long-symbolic-flags
0x0000000000000000      &nb= sp; 607699    2373
___________________________________
0x= 0000000100000000        343227    1340
_= ______________________r___________    reserved

But I have = some questions of the print of page-type:

Is 2373MB here mean total = memory in used include page cache? I don't
think so.
Which kind of p= ages will be marked reserved?
Which line of long-symbolic-flags is for p= age cache?

Regards,
Jaegeuk

>
> Thanks,
> F= engguang
>
>>> On Tue 20-11-12 09:42:42, metin d wrote:>>>> I have two PostgreSQL databases named data-1 and data-2 = that sit on the
>>>> same machine. Both databases keep 40 GB of data, and the total memory
>>>> available on = the machine is 68GB.
>>>>
>>>> I started data= -1 and data-2, and ran several queries to go over all their
>>>= > data. Then, I shut down data-1 and kept issuing queries against data-2= .
>>>> For some reason, the OS still holds on to large parts= of data-1's pages
>>>> in its page cache, and reserves abou= t 35 GB of RAM to data-2's files. As
>>>> a result, my queri= es on data-2 keep hitting disk.
>>>>
>>>> I'm= checking page cache usage with fincore. When I run a table scan query
&= gt;>>> against data-2, I see that data-2's pages get evicted and p= ut back into
>>>> the cache in a round-robin manner. Nothing= happens to data-1's pages,
>>>> although they haven't been = touched for days.
>>>>
>>>> Does anybody know why data-1's pages aren't evicted from the page cache?
>= ;>>> I'm open to all kind of suggestions you think it might relate= to problem.
>>>    Curious. Added linux-mm list to C= C to catch more attention. If you run
>>> echo 1 >/proc/sys/= vm/drop_caches
>>>    does it evict data-1 pages from= memory?
>>>
>>>> This is an EC2 m2.4xlarge inst= ance on Amazon with 68 GB of RAM and no
>>>> swap space. The= kernel version is:
>>>>
>>>> $ uname -r
&= gt;>>> 3.2.28-45.62.amzn1.x86_64
>>>> Edit:
>= >>>
>>>> and it seems that I use one NUMA instance,= if  you think that it can a problem.
>>>>
>>&= gt;> $ numactl --hardware
>>>> available: 1 nodes (0)
= >>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>> node 0 size: 70007 MB
>>>> node 0 free: 360 MB
>>&= gt;> node distances:
>>>> node  0
>>>&g= t;    0:  10

---2140344373-2060836916-1353491880=:11679-- ---2140344373-809632473-1353491880=:11679 Content-Type: text/plain; name="page-types_after.txt" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="page-types_after.txt" ICAgICAgICAgICAgIGZsYWdzCXBhZ2UtY291bnQgICAgICAgTUIgIHN5bWJv bGljLWZsYWdzCQkJbG9uZy1zeW1ib2xpYy1mbGFncwoweDAwMDAwMDAwMDAw MDAwMDAJICAgNTUwODMxNyAgICAyMTUxNiAgX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX18JCjB4MDAwMDAwMDEwMDAwMDAwMAkgICAgMzM1 OTkzICAgICAxMzEyICBfX19fX19fX19fX19fX19fX19fX19fX3JfX19fX19f X19fXwlyZXNlcnZlZAoweDAwMDAwMDIxMDAwMDAwMDAJICAgICAzNTYzNCAg ICAgIDEzOSAgX19fX19fX19fX19fX19fX19fX19fX19yX19fX09fX19fX18J cmVzZXJ2ZWQsb3duZXJfcHJpdmF0ZQoweDAwMDAwMDAwMDAwMTAwMDAJICAg ICA0NTA2OSAgICAgIDE3NiAgX19fX19fX19fX19fX19fX1RfX19fX19fX19f X19fX19fX18JY29tcG91bmRfdGFpbAoweDAwMDAwMDIwMDAwMDAwMDAJICAg ICAgMTUxNiAgICAgICAgNSAgX19fX19fX19fX19fX19fX19fX19fX19fX19f X09fX19fX18Jb3duZXJfcHJpdmF0ZQoweDAwMDAwMDA4MDAwMDAwMDQJICAg ICAgICAgMSAgICAgICAgMCAgX19SX19fX19fX19fX19fX19fX19fX19fX19Q X19fX19fX18JcmVmZXJlbmNlZCxwcml2YXRlCjB4MDAwMDAwMDAwMDAwODAw MAkgICAgICAgIDEwICAgICAgICAwICBfX19fX19fX19fX19fX19IX19fX19f X19fX19fX19fX19fXwljb21wb3VuZF9oZWFkCjB4MDAwMDAwMDAwMDAwMDAw NAkgICAgICAgICAxICAgICAgICAwICBfX1JfX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fXwlyZWZlcmVuY2VkCjB4MDAwMDAwMDgwMDAwMDAyNAkg ICAgICAgMTY2ICAgICAgICAwICBfX1JfX2xfX19fX19fX19fX19fX19fX19f X1BfX19fX19fXwlyZWZlcmVuY2VkLGxydSxwcml2YXRlCjB4MDAwMDAwMDQw MDAwMDAyOAkgICAgICAgMjk1ICAgICAgICAxICBfX19VX2xfX19fX19fX19f X19fX19fX19fZF9fX19fX19fXwl1cHRvZGF0ZSxscnUsbWFwcGVkdG9kaXNr CjB4MDAwMTAwMDQwMDAwMDAyOAkgICAgICAgICAzICAgICAgICAwICBfX19V X2xfX19fX19fX19fX19fX19fX19fZF9fX19fSV9fXwl1cHRvZGF0ZSxscnUs bWFwcGVkdG9kaXNrLHJlYWRhaGVhZAoweDAwMDAwMDAwMDAwMDAwMjgJICAg ICAgICAgMSAgICAgICAgMCAgX19fVV9sX19fX19fX19fX19fX19fX19fX19f X19fX19fX18JdXB0b2RhdGUsbHJ1CjB4MDAwMDAwMDQwMDAwMDAyYwkgICAg MjYyMTQ0ICAgICAxMDI0ICBfX1JVX2xfX19fX19fX19fX19fX19fX19fZF9f X19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxtYXBwZWR0b2Rpc2sK MHgwMDAwMDAwODAwMDAwMDJjCSAgICAgICAgIDUgICAgICAgIDAgIF9fUlVf bF9fX19fX19fX19fX19fX19fX19fUF9fX19fX19fCXJlZmVyZW5jZWQsdXB0 b2RhdGUsbHJ1LHByaXZhdGUKMHgwMDAwMDAwMDAwMDA0MDNjCSAgICAgICAx ODUgICAgICAgIDAgIF9fUlVEbF9fX19fX19fYl9fX19fX19fX19fX19fX19f X19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsZGlydHksbHJ1LHN3YXBiYWNrZWQK MHgwMDAwMDAwODAwMDAwMDYwCSAgICAgICAxNjMgICAgICAgIDAgIF9fX19f bEFfX19fX19fX19fX19fX19fX19fUF9fX19fX19fCWxydSxhY3RpdmUscHJp dmF0ZQoweDAwMDAwMDA4MDAwMDAwNjQJICAgICAzNjczOSAgICAgIDE0MyAg X19SX19sQV9fX19fX19fX19fX19fX19fX19QX19fX19fX18JcmVmZXJlbmNl ZCxscnUsYWN0aXZlLHByaXZhdGUKMHgwMDAwMDAwNDAwMDAwMDY4CSAgICA1 Mjc4MTAgICAgIDIwNjEgIF9fX1VfbEFfX19fX19fX19fX19fX19fX19kX19f X19fX19fCXVwdG9kYXRlLGxydSxhY3RpdmUsbWFwcGVkdG9kaXNrCjB4MDAw MDAwMDgwMDAwMDA2OAkgICAgICAgNTc2ICAgICAgICAyICBfX19VX2xBX19f X19fX19fX19fX19fX19fX1BfX19fX19fXwl1cHRvZGF0ZSxscnUsYWN0aXZl LHByaXZhdGUKMHgwMDAwMDAwYzAwMDAwMDY4CSAgICAgICAxMTYgICAgICAg IDAgIF9fX1VfbEFfX19fX19fX19fX19fX19fX19kUF9fX19fX19fCXVwdG9k YXRlLGxydSxhY3RpdmUsbWFwcGVkdG9kaXNrLHByaXZhdGUKMHgwMDAwMDAw ODAwMDAwMDZjCSAgIDY1ODQwNTEgICAgMjU3MTggIF9fUlVfbEFfX19fX19f X19fX19fX19fX19fUF9fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1 LGFjdGl2ZSxwcml2YXRlCjB4MDAwMDAwMDQwMDAwMDA2YwkgICAxMzAyMjEx ICAgICA1MDg2ICBfX1JVX2xBX19fX19fX19fX19fX19fX19fZF9fX19fX19f XwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxhY3RpdmUsbWFwcGVkdG9kaXNr CjB4MDAwMDAwMGMwMDAwMDA2YwkgICAgICAgNDMxICAgICAgICAxICBfX1JV X2xBX19fX19fX19fX19fX19fX19fZFBfX19fX19fXwlyZWZlcmVuY2VkLHVw dG9kYXRlLGxydSxhY3RpdmUsbWFwcGVkdG9kaXNrLHByaXZhdGUKMHgwMDAw MDAwMDAwMDAwMDZjCSAgICAgICAxMjggICAgICAgIDAgIF9fUlVfbEFfX19f X19fX19fX19fX19fX19fX19fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUs bHJ1LGFjdGl2ZQoweDAwMDAwMDA4MDAwMDAwNzQJICAgICAgICAgMiAgICAg ICAgMCAgX19SX0RsQV9fX19fX19fX19fX19fX19fX19QX19fX19fX18JcmVm ZXJlbmNlZCxkaXJ0eSxscnUsYWN0aXZlLHByaXZhdGUKMHgwMDAwMDAwMDAw MDA0MDc4CSAgICAgICAgNTYgICAgICAgIDAgIF9fX1VEbEFfX19fX19fYl9f X19fX19fX19fX19fX19fX19fCXVwdG9kYXRlLGRpcnR5LGxydSxhY3RpdmUs c3dhcGJhY2tlZAoweDAwMDAwMDAwMDAwMDQwN2MJICAgICAgIDEyMiAgICAg ICAgMCAgX19SVURsQV9fX19fX19iX19fX19fX19fX19fX19fX19fX18JcmVm ZXJlbmNlZCx1cHRvZGF0ZSxkaXJ0eSxscnUsYWN0aXZlLHN3YXBiYWNrZWQK MHgwMDAwMDAwODAwMDAwMDdjCSAgICAgICAgIDEgICAgICAgIDAgIF9fUlVE bEFfX19fX19fX19fX19fX19fX19fUF9fX19fX19fCXJlZmVyZW5jZWQsdXB0 b2RhdGUsZGlydHksbHJ1LGFjdGl2ZSxwcml2YXRlCjB4MDAwMDAwMDAwMDAw ODA4MAkgICAgIDE0NDk1ICAgICAgIDU2ICBfX19fX19fU19fX19fX19IX19f X19fX19fX19fX19fX19fXwlzbGFiLGNvbXBvdW5kX2hlYWQKMHgwMDAwMDAw MDAwMDAwMDgwCSAgICAyNTA0OTggICAgICA5NzggIF9fX19fX19TX19fX19f X19fX19fX19fX19fX19fX19fX19fCXNsYWIKMHgwMDAwMDAwMDAwMDAwNDAw CSAgIDI5OTA5MDggICAgMTE2ODMgIF9fX19fX19fX19CX19fX19fX19fX19f X19fX19fX19fX19fCWJ1ZGR5CjB4MDAwMDAwMDAwMDAwMDgwMAkgICAgICAg IDE2ICAgICAgICAwICBfX19fX19fX19fX01fX19fX19fX19fX19fX19fX19f X19fXwltbWFwCjB4MDAwMDAwMDEwMDAwMDgwNAkgICAgICAgICAxICAgICAg ICAwICBfX1JfX19fX19fX01fX19fX19fX19fX3JfX19fX19fX19fXwlyZWZl cmVuY2VkLG1tYXAscmVzZXJ2ZWQKMHgwMDAwMDAwNjAwMDQwODJjCSAgICAg ICAzOTEgICAgICAgIDEgIF9fUlVfbF9fX19fTV9fX19fX3VfX19fX21kX19f X19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LG1tYXAsdW5ldmljdGFi bGUsbWxvY2tlZCxtYXBwZWR0b2Rpc2sKMHgwMDAwMDAwYTAwMDQwODJjCSAg ICAgICAzMjEgICAgICAgIDEgIF9fUlVfbF9fX19fTV9fX19fX3VfX19fX21f UF9fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LG1tYXAsdW5ldmlj dGFibGUsbWxvY2tlZCxwcml2YXRlCjB4MDAwMDAwMDAwMDAwNDgzOAkgICAg ICA4NDUwICAgICAgIDMzICBfX19VRGxfX19fX01fX2JfX19fX19fX19fX19f X19fX19fXwl1cHRvZGF0ZSxkaXJ0eSxscnUsbW1hcCxzd2FwYmFja2VkCjB4 MDAwMDAwMDAwMDAwNDgzYwkgICAgICAyMDQ1ICAgICAgICA3ICBfX1JVRGxf X19fX01fX2JfX19fX19fX19fX19fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9k YXRlLGRpcnR5LGxydSxtbWFwLHN3YXBiYWNrZWQKMHgwMDAwMDAwODAwMDAw ODY4CSAgICAgICAgMTkgICAgICAgIDAgIF9fX1VfbEFfX19fTV9fX19fX19f X19fX19fUF9fX19fX19fCXVwdG9kYXRlLGxydSxhY3RpdmUsbW1hcCxwcml2 YXRlCjB4MDAwMDAwMDQwMDAwMDg2OAkgICAgICAgICA1ICAgICAgICAwICBf X19VX2xBX19fX01fX19fX19fX19fX19fZF9fX19fX19fXwl1cHRvZGF0ZSxs cnUsYWN0aXZlLG1tYXAsbWFwcGVkdG9kaXNrCjB4MDAwMDAwMDQwMDAwMDg2 YwkgICAgICAxODkxICAgICAgICA3ICBfX1JVX2xBX19fX01fX19fX19fX19f X19fZF9fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxhY3RpdmUs bW1hcCxtYXBwZWR0b2Rpc2sKMHgwMDAwMDAwODAwMDAwODZjCSAgICAgICAx MjYgICAgICAgIDAgIF9fUlVfbEFfX19fTV9fX19fX19fX19fX19fUF9fX19f X19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LGFjdGl2ZSxtbWFwLHByaXZh dGUKMHgwMDAwMDAwMDAwMDA0ODc4CSAgICAgICAgODUgICAgICAgIDAgIF9f X1VEbEFfX19fTV9fYl9fX19fX19fX19fX19fX19fX19fCXVwdG9kYXRlLGRp cnR5LGxydSxhY3RpdmUsbW1hcCxzd2FwYmFja2VkCjB4MDAwMDAwMDAwMDAw NDg3YwkgICAgICAyMjYzICAgICAgICA4ICBfX1JVRGxBX19fX01fX2JfX19f X19fX19fX19fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGRpcnR5LGxy dSxhY3RpdmUsbW1hcCxzd2FwYmFja2VkCjB4MDAwMDAwMDAwMDAwNTAwOAkg ICAgICAgIDEzICAgICAgICAwICBfX19VX19fX19fX19hX2JfX19fX19fX19f X19fX19fX19fXwl1cHRvZGF0ZSxhbm9ueW1vdXMsc3dhcGJhY2tlZAoweDAw MDAwMDAwMDAwMDU4MDgJICAgICAgICAxNiAgICAgICAgMCAgX19fVV9fX19f X19NYV9iX19fX19fX19fX19fX19fX19fX18JdXB0b2RhdGUsbW1hcCxhbm9u eW1vdXMsc3dhcGJhY2tlZAoweDAwMDAwMDAyMDAwNDU4MjgJICAgICAgICAg OCAgICAgICAgMCAgX19fVV9sX19fX19NYV9iX19fdV9fX19fbV9fX19fX19f X18JdXB0b2RhdGUsbHJ1LG1tYXAsYW5vbnltb3VzLHN3YXBiYWNrZWQsdW5l dmljdGFibGUsbWxvY2tlZAoweDAwMDAwMDAyMDAwNDU4MmMJICAgICAgIDY1 MSAgICAgICAgMiAgX19SVV9sX19fX19NYV9iX19fdV9fX19fbV9fX19fX19f X18JcmVmZXJlbmNlZCx1cHRvZGF0ZSxscnUsbW1hcCxhbm9ueW1vdXMsc3dh cGJhY2tlZCx1bmV2aWN0YWJsZSxtbG9ja2VkCjB4MDAwMDAwMDAwMDAwNTg2 OAkgICAgICA4MDU4ICAgICAgIDMxICBfX19VX2xBX19fX01hX2JfX19fX19f X19fX19fX19fX19fXwl1cHRvZGF0ZSxscnUsYWN0aXZlLG1tYXAsYW5vbnlt b3VzLHN3YXBiYWNrZWQKMHgwMDAwMDAwMDAwMDA1ODZjCSAgICAgICAgNDIg ICAgICAgIDAgIF9fUlVfbEFfX19fTWFfYl9fX19fX19fX19fX19fX19fX19f CXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LGFjdGl2ZSxtbWFwLGFub255bW91 cyxzd2FwYmFja2VkCiAgICAgICAgICAgICB0b3RhbAkgIDE3OTIyMDQ4ICAg IDcwMDA4Cgo= ---2140344373-809632473-1353491880=:11679 Content-Type: text/plain; name="page-types_before.txt" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="page-types_before.txt" ICAgICAgICAgICAgIGZsYWdzCXBhZ2UtY291bnQgICAgICAgTUIgIHN5bWJv bGljLWZsYWdzCQkJbG9uZy1zeW1ib2xpYy1mbGFncwoweDAwMDAwMDAwMDAw MDAwMDAJICAgIDEyMTYyOCAgICAgIDQ3NSAgX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX18JCjB4MDAwMDAwMDEwMDAwMDAwMAkgICAgMzM1 OTkzICAgICAxMzEyICBfX19fX19fX19fX19fX19fX19fX19fX3JfX19fX19f X19fXwlyZXNlcnZlZAoweDAwMDAwMDIxMDAwMDAwMDAJICAgICAzNTYzNCAg ICAgIDEzOSAgX19fX19fX19fX19fX19fX19fX19fX19yX19fX09fX19fX18J cmVzZXJ2ZWQsb3duZXJfcHJpdmF0ZQoweDAwMDAwMDAwMDAwMTAwMDAJICAg ICA0NTQyOSAgICAgIDE3NyAgX19fX19fX19fX19fX19fX1RfX19fX19fX19f X19fX19fX18JY29tcG91bmRfdGFpbAoweDAwMDAwMDIwMDAwMDAwMDAJICAg ICAgMTM4OSAgICAgICAgNSAgX19fX19fX19fX19fX19fX19fX19fX19fX19f X09fX19fX18Jb3duZXJfcHJpdmF0ZQoweDAwMDAwMDA0MDAwMDAwMDEJICAg ICAgICAgNiAgICAgICAgMCAgTF9fX19fX19fX19fX19fX19fX19fX19fX2Rf X19fX19fX18JbG9ja2VkLG1hcHBlZHRvZGlzawoweDAwMDAwMDAwMDAwMDgw MDAJICAgICAgICAxMCAgICAgICAgMCAgX19fX19fX19fX19fX19fSF9fX19f X19fX19fX19fX19fX18JY29tcG91bmRfaGVhZAoweDAwMDAwMDAwMDAwMDAw MDQJICAgICAgICAgMSAgICAgICAgMCAgX19SX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX18JcmVmZXJlbmNlZAoweDAwMDAwMDA0MDAwMDAwMjEJ ICAgICAgICA2NCAgICAgICAgMCAgTF9fX19sX19fX19fX19fX19fX19fX19f X2RfX19fX19fX18JbG9ja2VkLGxydSxtYXBwZWR0b2Rpc2sKMHgwMDAxMDAw NDAwMDAwMDIxCSAgICAgICAgIDEgICAgICAgIDAgIExfX19fbF9fX19fX19f X19fX19fX19fX19kX19fX19JX19fCWxvY2tlZCxscnUsbWFwcGVkdG9kaXNr LHJlYWRhaGVhZAoweDAwMDAwMDA4MDAwMDAwMjQJICAgICAgIDE3MSAgICAg ICAgMCAgX19SX19sX19fX19fX19fX19fX19fX19fX19QX19fX19fX18JcmVm ZXJlbmNlZCxscnUscHJpdmF0ZQoweDAwMDAwMDA0MDAwMDAwMjgJICAgICAg NDA5MyAgICAgICAxNSAgX19fVV9sX19fX19fX19fX19fX19fX19fX2RfX19f X19fX18JdXB0b2RhdGUsbHJ1LG1hcHBlZHRvZGlzawoweDAwMDEwMDA0MDAw MDAwMjgJICAgICAgICA1OSAgICAgICAgMCAgX19fVV9sX19fX19fX19fX19f X19fX19fX2RfX19fX0lfX18JdXB0b2RhdGUsbHJ1LG1hcHBlZHRvZGlzayxy ZWFkYWhlYWQKMHgwMDAwMDAwMDAwMDAwMDI4CSAgICAgICAgIDEgICAgICAg IDAgIF9fX1VfbF9fX19fX19fX19fX19fX19fX19fX19fX19fX19fCXVwdG9k YXRlLGxydQoweDAwMDAwMDA0MDAwMDAwMmMJICAgODU5ODAzMiAgICAzMzU4 NiAgX19SVV9sX19fX19fX19fX19fX19fX19fX2RfX19fX19fX18JcmVmZXJl bmNlZCx1cHRvZGF0ZSxscnUsbWFwcGVkdG9kaXNrCjB4MDAwMDAwMDgwMDAw MDAyYwkgICAgICAgIDEwICAgICAgICAwICBfX1JVX2xfX19fX19fX19fX19f X19fX19fX1BfX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxwcml2 YXRlCjB4MDAwMDAwMDAwMDAwNDAzYwkgICAgICAgMTg1ICAgICAgICAwICBf X1JVRGxfX19fX19fX2JfX19fX19fX19fX19fX19fX19fXwlyZWZlcmVuY2Vk LHVwdG9kYXRlLGRpcnR5LGxydSxzd2FwYmFja2VkCjB4MDAwMDAwMDgwMDAw MDA2MAkgICAgICAgMTYzICAgICAgICAwICBfX19fX2xBX19fX19fX19fX19f X19fX19fX1BfX19fX19fXwlscnUsYWN0aXZlLHByaXZhdGUKMHgwMDAwMDAw ODAwMDAwMDY0CSAgICAgMzY3NDEgICAgICAxNDMgIF9fUl9fbEFfX19fX19f X19fX19fX19fX19fUF9fX19fX19fCXJlZmVyZW5jZWQsbHJ1LGFjdGl2ZSxw cml2YXRlCjB4MDAwMDAwMDQwMDAwMDA2OAkgICAgNTI3ODM0ICAgICAyMDYx ICBfX19VX2xBX19fX19fX19fX19fX19fX19fZF9fX19fX19fXwl1cHRvZGF0 ZSxscnUsYWN0aXZlLG1hcHBlZHRvZGlzawoweDAwMDAwMDA4MDAwMDAwNjgJ ICAgICAgIDY5NSAgICAgICAgMiAgX19fVV9sQV9fX19fX19fX19fX19fX19f X19QX19fX19fX18JdXB0b2RhdGUsbHJ1LGFjdGl2ZSxwcml2YXRlCjB4MDAw MDAwMGMwMDAwMDA2OAkgICAgICAgMTE2ICAgICAgICAwICBfX19VX2xBX19f X19fX19fX19fX19fX19fZFBfX19fX19fXwl1cHRvZGF0ZSxscnUsYWN0aXZl LG1hcHBlZHRvZGlzayxwcml2YXRlCjB4MDAwMDAwMDgwMDAwMDA2YwkgICA2 NTg0MDY2ICAgIDI1NzE5ICBfX1JVX2xBX19fX19fX19fX19fX19fX19fX1Bf X19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxhY3RpdmUscHJpdmF0 ZQoweDAwMDAwMDA0MDAwMDAwNmMJICAgMTMyNTI3MyAgICAgNTE3NiAgX19S VV9sQV9fX19fX19fX19fX19fX19fX2RfX19fX19fX18JcmVmZXJlbmNlZCx1 cHRvZGF0ZSxscnUsYWN0aXZlLG1hcHBlZHRvZGlzawoweDAwMDAwMDBjMDAw MDAwNmMJICAgICAgIDQzMSAgICAgICAgMSAgX19SVV9sQV9fX19fX19fX19f X19fX19fX2RQX19fX19fX18JcmVmZXJlbmNlZCx1cHRvZGF0ZSxscnUsYWN0 aXZlLG1hcHBlZHRvZGlzayxwcml2YXRlCjB4MDAwMDAwMDAwMDAwMDA2Ywkg ICAgICAgMTI4ICAgICAgICAwICBfX1JVX2xBX19fX19fX19fX19fX19fX19f X19fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxhY3RpdmUKMHgw MDAwMDAwMDAwMDA0MDc4CSAgICAgICAgNTYgICAgICAgIDAgIF9fX1VEbEFf X19fX19fYl9fX19fX19fX19fX19fX19fX19fCXVwdG9kYXRlLGRpcnR5LGxy dSxhY3RpdmUsc3dhcGJhY2tlZAoweDAwMDAwMDAwMDAwMDQwN2MJICAgICAg IDEyMiAgICAgICAgMCAgX19SVURsQV9fX19fX19iX19fX19fX19fX19fX19f X19fX18JcmVmZXJlbmNlZCx1cHRvZGF0ZSxkaXJ0eSxscnUsYWN0aXZlLHN3 YXBiYWNrZWQKMHgwMDAwMDAwODAwMDAwMDdjCSAgICAgICAgIDEgICAgICAg IDAgIF9fUlVEbEFfX19fX19fX19fX19fX19fX19fUF9fX19fX19fCXJlZmVy ZW5jZWQsdXB0b2RhdGUsZGlydHksbHJ1LGFjdGl2ZSxwcml2YXRlCjB4MDAw MDAwMDAwMDAwODA4MAkgICAgIDE0NTcxICAgICAgIDU2ICBfX19fX19fU19f X19fX19IX19fX19fX19fX19fX19fX19fXwlzbGFiLGNvbXBvdW5kX2hlYWQK MHgwMDAwMDAwMDAwMDAwMDgwCSAgICAyNTA1NDYgICAgICA5NzggIF9fX19f X19TX19fX19fX19fX19fX19fX19fX19fX19fX19fCXNsYWIKMHgwMDAwMDAw MDAwMDAwNDAwCSAgICAgMTQ3MDEgICAgICAgNTcgIF9fX19fX19fX19CX19f X19fX19fX19fX19fX19fX19fX19fCWJ1ZGR5CjB4MDAwMDAwMDAwMDAwMDgw MAkgICAgICAgIDE2ICAgICAgICAwICBfX19fX19fX19fX01fX19fX19fX19f X19fX19fX19fX19fXwltbWFwCjB4MDAwMDAwMDEwMDAwMDgwNAkgICAgICAg ICAxICAgICAgICAwICBfX1JfX19fX19fX01fX19fX19fX19fX3JfX19fX19f X19fXwlyZWZlcmVuY2VkLG1tYXAscmVzZXJ2ZWQKMHgwMDAwMDAwNjAwMDQw ODJjCSAgICAgICAzOTEgICAgICAgIDEgIF9fUlVfbF9fX19fTV9fX19fX3Vf X19fX21kX19fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LG1tYXAs dW5ldmljdGFibGUsbWxvY2tlZCxtYXBwZWR0b2Rpc2sKMHgwMDAwMDAwYTAw MDQwODJjCSAgICAgICAzMjEgICAgICAgIDEgIF9fUlVfbF9fX19fTV9fX19f X3VfX19fX21fUF9fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LG1t YXAsdW5ldmljdGFibGUsbWxvY2tlZCxwcml2YXRlCjB4MDAwMDAwMDAwMDAw NDgzOAkgICAgICA4Mzg1ICAgICAgIDMyICBfX19VRGxfX19fX01fX2JfX19f X19fX19fX19fX19fX19fXwl1cHRvZGF0ZSxkaXJ0eSxscnUsbW1hcCxzd2Fw YmFja2VkCjB4MDAwMDAwMDAwMDAwNDgzYwkgICAgICAyMDQ1ICAgICAgICA3 ICBfX1JVRGxfX19fX01fX2JfX19fX19fX19fX19fX19fX19fXwlyZWZlcmVu Y2VkLHVwdG9kYXRlLGRpcnR5LGxydSxtbWFwLHN3YXBiYWNrZWQKMHgwMDAw MDAwODAwMDAwODY4CSAgICAgICAgMTkgICAgICAgIDAgIF9fX1VfbEFfX19f TV9fX19fX19fX19fX19fUF9fX19fX19fCXVwdG9kYXRlLGxydSxhY3RpdmUs bW1hcCxwcml2YXRlCjB4MDAwMDAwMDQwMDAwMDg2OAkgICAgICAgICA1ICAg ICAgICAwICBfX19VX2xBX19fX01fX19fX19fX19fX19fZF9fX19fX19fXwl1 cHRvZGF0ZSxscnUsYWN0aXZlLG1tYXAsbWFwcGVkdG9kaXNrCjB4MDAwMDAw MDQwMDAwMDg2YwkgICAgICAxODkxICAgICAgICA3ICBfX1JVX2xBX19fX01f X19fX19fX19fX19fZF9fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxy dSxhY3RpdmUsbW1hcCxtYXBwZWR0b2Rpc2sKMHgwMDAwMDAwODAwMDAwODZj CSAgICAgICAxMjYgICAgICAgIDAgIF9fUlVfbEFfX19fTV9fX19fX19fX19f X19fUF9fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LGFjdGl2ZSxt bWFwLHByaXZhdGUKMHgwMDAwMDAwMDAwMDA0ODc4CSAgICAgICAgODUgICAg ICAgIDAgIF9fX1VEbEFfX19fTV9fYl9fX19fX19fX19fX19fX19fX19fCXVw dG9kYXRlLGRpcnR5LGxydSxhY3RpdmUsbW1hcCxzd2FwYmFja2VkCjB4MDAw MDAwMDAwMDAwNDg3YwkgICAgICAyMjYzICAgICAgICA4ICBfX1JVRGxBX19f X01fX2JfX19fX19fX19fX19fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRl LGRpcnR5LGxydSxhY3RpdmUsbW1hcCxzd2FwYmFja2VkCjB4MDAwMDAwMDAw MDAwNTAwOAkgICAgICAgICA0ICAgICAgICAwICBfX19VX19fX19fX19hX2Jf X19fX19fX19fX19fX19fX19fXwl1cHRvZGF0ZSxhbm9ueW1vdXMsc3dhcGJh Y2tlZAoweDAwMDAwMDAwMDAwMDU4MDgJICAgICAgICAyNSAgICAgICAgMCAg X19fVV9fX19fX19NYV9iX19fX19fX19fX19fX19fX19fX18JdXB0b2RhdGUs bW1hcCxhbm9ueW1vdXMsc3dhcGJhY2tlZAoweDAwMDAwMDAyMDAwNDU4MjgJ ICAgICAgICAgOCAgICAgICAgMCAgX19fVV9sX19fX19NYV9iX19fdV9fX19f bV9fX19fX19fX18JdXB0b2RhdGUsbHJ1LG1tYXAsYW5vbnltb3VzLHN3YXBi YWNrZWQsdW5ldmljdGFibGUsbWxvY2tlZAoweDAwMDAwMDAyMDAwNDU4MmMJ ICAgICAgIDY1MSAgICAgICAgMiAgX19SVV9sX19fX19NYV9iX19fdV9fX19f bV9fX19fX19fX18JcmVmZXJlbmNlZCx1cHRvZGF0ZSxscnUsbW1hcCxhbm9u eW1vdXMsc3dhcGJhY2tlZCx1bmV2aWN0YWJsZSxtbG9ja2VkCjB4MDAwMDAw MDAwMDAwNTg2OAkgICAgICA3NjIzICAgICAgIDI5ICBfX19VX2xBX19fX01h X2JfX19fX19fX19fX19fX19fX19fXwl1cHRvZGF0ZSxscnUsYWN0aXZlLG1t YXAsYW5vbnltb3VzLHN3YXBiYWNrZWQKMHgwMDAwMDAwMDAwMDA1ODZjCSAg ICAgICAgMzkgICAgICAgIDAgIF9fUlVfbEFfX19fTWFfYl9fX19fX19fX19f X19fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LGFjdGl2ZSxtbWFw LGFub255bW91cyxzd2FwYmFja2VkCiAgICAgICAgICAgICB0b3RhbAkgIDE3 OTIyMDQ4ICAgIDcwMDA4Cg== ---2140344373-809632473-1353491880=:11679-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx182.postini.com [74.125.245.182]) by kanga.kvack.org (Postfix) with SMTP id 1A28E6B0062 for ; Wed, 21 Nov 2012 05:00:28 -0500 (EST) Received: by mail-ia0-f169.google.com with SMTP id r4so6100731iaj.14 for ; Wed, 21 Nov 2012 02:00:27 -0800 (PST) Message-ID: <50ACA634.5000007@gmail.com> Date: Wed, 21 Nov 2012 18:00:20 +0800 From: Jaegeuk Hanse MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> In-Reply-To: <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> Content-Type: multipart/alternative; boundary="------------000701020503000208040701" Sender: owner-linux-mm@kvack.org List-ID: To: metin d Cc: Fengguang Wu , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , =?UTF-8?B?TWV0aW4gRMO2xZ9sw7w=?= This is a multi-part message in MIME format. --------------000701020503000208040701 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 11/21/2012 05:58 PM, metin d wrote: > Hi Fengguang, > > I run tests and attached the results. The line below I guess shows the > data-1 page caches. > > 0x000000080000006c 6584051 25718 > __RU_lA___________________P________ referenced,uptodate,lru,active,private I thinks this is just one state of page cache pages. > > Metin > > > ----- Original Message ----- > From: Jaegeuk Hanse > To: Fengguang Wu > Cc: metin d ; Jan Kara ; > "linux-kernel@vger.kernel.org" ; > "linux-mm@kvack.org" > Sent: Wednesday, November 21, 2012 11:42 AM > Subject: Re: Problem in Page Cache Replacement > > On 11/21/2012 05:02 PM, Fengguang Wu wrote: > > On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: > >> Cc Fengguang Wu. > >> > >> On 11/21/2012 04:13 PM, metin d wrote: > >>>> Curious. Added linux-mm list to CC to catch more attention. If > you run > >>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from > memory? > >>> I'm guessing it'd evict the entries, but am wondering if we could > run any more diagnostics before trying this. > >>> > >>> We regularly use a setup where we have two databases; one gets > used frequently and the other one about once a month. It seems like > the memory manager keeps unused pages in memory at the expense of > frequently used database's performance. > >>> My understanding was that under memory pressure from heavily > >>> accessed pages, unused pages would eventually get evicted. Is there > >>> anything else we can try on this host to understand why this is > >>> happening? > > We may debug it this way. > > > > 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages > > (please double check via /proc/vmstat whether it does the > expected work) > > > > 2) run 'page-types -r' with root, to view the page status for the > > remaining pages of data-1 > > > > The fadvise tool comes from Andrew Morton's ext3-tools. (source code > attached) > > Please compile them with options "-Dlinux -I. -D_GNU_SOURCE > -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" > > > > page-types can be found in the kernel source tree tools/vm/page-types.c > > > > Sorry that sounds a bit twisted.. I do have a patch to directly dump > > page cache status of a user specified file, however it's not > > upstreamed yet. > > Hi Fengguang, > > Thanks for you detail steps, I think metin can have a try. > > flags page-count MB symbolic-flags long-symbolic-flags > 0x0000000000000000 607699 2373 > ___________________________________ > 0x0000000100000000 343227 1340 > _______________________r___________ reserved > > But I have some questions of the print of page-type: > > Is 2373MB here mean total memory in used include page cache? I don't > think so. > Which kind of pages will be marked reserved? > Which line of long-symbolic-flags is for page cache? > > Regards, > Jaegeuk > > > > > Thanks, > > Fengguang > > > >>> On Tue 20-11-12 09:42:42, metin d wrote: > >>>> I have two PostgreSQL databases named data-1 and data-2 that sit > on the > >>>> same machine. Both databases keep 40 GB of data, and the total memory > >>>> available on the machine is 68GB. > >>>> > >>>> I started data-1 and data-2, and ran several queries to go over > all their > >>>> data. Then, I shut down data-1 and kept issuing queries against > data-2. > >>>> For some reason, the OS still holds on to large parts of data-1's > pages > >>>> in its page cache, and reserves about 35 GB of RAM to data-2's > files. As > >>>> a result, my queries on data-2 keep hitting disk. > >>>> > >>>> I'm checking page cache usage with fincore. When I run a table > scan query > >>>> against data-2, I see that data-2's pages get evicted and put > back into > >>>> the cache in a round-robin manner. Nothing happens to data-1's pages, > >>>> although they haven't been touched for days. > >>>> > >>>> Does anybody know why data-1's pages aren't evicted from the page > cache? > >>>> I'm open to all kind of suggestions you think it might relate to > problem. > >>> Curious. Added linux-mm list to CC to catch more attention. If > you run > >>> echo 1 >/proc/sys/vm/drop_caches > >>> does it evict data-1 pages from memory? > >>> > >>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no > >>>> swap space. The kernel version is: > >>>> > >>>> $ uname -r > >>>> 3.2.28-45.62.amzn1.x86_64 > >>>> Edit: > >>>> > >>>> and it seems that I use one NUMA instance, if you think that it > can a problem. > >>>> > >>>> $ numactl --hardware > >>>> available: 1 nodes (0) > >>>> node 0 cpus: 0 1 2 3 4 5 6 7 > >>>> node 0 size: 70007 MB > >>>> node 0 free: 360 MB > >>>> node distances: > >>>> node 0 > >>>> 0: 10 > --------------000701020503000208040701 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
On 11/21/2012 05:58 PM, metin d wrote:
Hi Fengguang,

I run tests and attached the results. The line below I guess shows the data-1 page caches.

0x000000080000006cA A A A A 6584051A A A 25718A __RU_lA___________________P________A A A referenced,uptodate,lru,active,private

I thinks this is just one state of page cache pages.


Metin


----- Original Message -----
From: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
To: Fengguang Wu <fengguang.wu@intel.com>
Cc: metin d <metdos@yahoo.com>; Jan Kara <jack@suse.cz>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>; "linux-mm@kvack.org" <linux-mm@kvack.org>
Sent: Wednesday, November 21, 2012 11:42 AM
Subject: Re: Problem in Page Cache Replacement

On 11/21/2012 05:02 PM, Fengguang Wu wrote:
> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
>> Cc Fengguang Wu.
>>
>> On 11/21/2012 04:13 PM, metin d wrote:
>>>>A A Curious. Added linux-mm list to CC to catch more attention. If you run
>>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
>>> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.
>>>
>>> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.
>>> My understanding was that under memory pressure from heavily
>>> accessed pages, unused pages would eventually get evicted. Is there
>>> anything else we can try on this host to understand why this is
>>> happening?
> We may debug it this way.
>
> 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
>A A (please double check via /proc/vmstat whether it does the expected work)
>
> 2) run 'page-types -r' with root, to view the page status for the
>A A remaining pages of data-1
>
> The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
> Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"
>
> page-types can be found in the kernel source tree tools/vm/page-types.c
>
> Sorry that sounds a bit twisted.. I do have a patch to directly dump
> page cache status of a user specified file, however it's not
> upstreamed yet.

Hi Fengguang,

Thanks for you detail steps, I think metin can have a try.

A A A A flagsA A page-countA A A MBA symbolic-flags long-symbolic-flags
0x0000000000000000A A A A 607699A A 2373
___________________________________
0x0000000100000000A A A A 343227A A 1340
_______________________r___________A A reserved

But I have some questions of the print of page-type:

Is 2373MB here mean total memory in used include page cache? I don't
think so.
Which kind of pages will be marked reserved?
Which line of long-symbolic-flags is for page cache?

Regards,
Jaegeuk

>
> Thanks,
> Fengguang
>
>>> On Tue 20-11-12 09:42:42, metin d wrote:
>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the
>>>> same machine. Both databases keep 40 GB of data, and the total memory
>>>> available on the machine is 68GB.
>>>>
>>>> I started data-1 and data-2, and ran several queries to go over all their
>>>> data. Then, I shut down data-1 and kept issuing queries against data-2.
>>>> For some reason, the OS still holds on to large parts of data-1's pages
>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As
>>>> a result, my queries on data-2 keep hitting disk.
>>>>
>>>> I'm checking page cache usage with fincore. When I run a table scan query
>>>> against data-2, I see that data-2's pages get evicted and put back into
>>>> the cache in a round-robin manner. Nothing happens to data-1's pages,
>>>> although they haven't been touched for days.
>>>>
>>>> Does anybody know why data-1's pages aren't evicted from the page cache?
>>>> I'm open to all kind of suggestions you think it might relate to problem.
>>>A A Curious. Added linux-mm list to CC to catch more attention. If you run
>>> echo 1 >/proc/sys/vm/drop_caches
>>>A A does it evict data-1 pages from memory?
>>>
>>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
>>>> swap space. The kernel version is:
>>>>
>>>> $ uname -r
>>>> 3.2.28-45.62.amzn1.x86_64
>>>> Edit:
>>>>
>>>> and it seems that I use one NUMA instance, ifA you think that it can a problem.
>>>>
>>>> $ numactl --hardware
>>>> available: 1 nodes (0)
>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>> node 0 size: 70007 MB
>>>> node 0 free: 360 MB
>>>> node distances:
>>>> nodeA 0
>>>>A A 0:A 10


--------------000701020503000208040701-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx187.postini.com [74.125.245.187]) by kanga.kvack.org (Postfix) with SMTP id 4BA4B6B006C for ; Wed, 21 Nov 2012 05:00:28 -0500 (EST) References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> Message-ID: <1353492026.13449.YahooMailNeo@web141102.mail.bf1.yahoo.com> Date: Wed, 21 Nov 2012 02:00:26 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement In-Reply-To: <50ACA209.9000101@gmail.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="-2140344373-170584175-1353492026=:13449" Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse , Fengguang Wu Cc: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , =?utf-8?B?TWV0aW4gRMO2xZ9sw7w=?= ---2140344373-170584175-1353492026=:13449 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable =0A=0AHi Fengguang,=0A=0AI run tests and attached the results. The line bel= ow I guess shows the data-1 page caches.=0A=0A0x000000080000006c=C2=A0=C2= =A0=C2=A0 =C2=A0=C2=A0 6584051=C2=A0=C2=A0=C2=A0 25718=C2=A0 __RU_lA_______= ____________P________=C2=A0=C2=A0=C2=A0 referenced,uptodate,lru,active,priv= ate=0AMetin=0A=0A=0A________________________________=0AFrom: Jaegeuk Hanse = =0ATo: Fengguang Wu =0ACc= : metin d ; Jan Kara ; "linux-kernel@vger.k= ernel.org" ; "linux-mm@kvack.org" =0ASent: Wednesday, November 21, 2012 11:42 AM=0ASubject: Re: Pro= blem in Page Cache Replacement=0A=0AOn 11/21/2012 05:02 PM, Fengguang Wu wr= ote:=0A> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:=0A>= > Cc Fengguang Wu.=0A>>=0A>> On 11/21/2012 04:13 PM, metin d wrote:=0A>>>>= =C2=A0 =C2=A0 Curious. Added linux-mm list to CC to catch more attention. I= f you run=0A>>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 page= s from memory?=0A>>> I'm guessing it'd evict the entries, but am wondering = if we could run any more diagnostics before trying this.=0A>>>=0A>>> We reg= ularly use a setup where we have two databases; one gets used frequently an= d the other one about once a month. It seems like the memory manager keeps = unused pages in memory at the expense of frequently used database's perform= ance.=0A>>> My understanding was that under memory pressure from heavily=0A= >>> accessed pages, unused pages would eventually get evicted. Is there=0A>= >> anything else we can try on this host to understand why this is=0A>>> ha= ppening?=0A> We may debug it this way.=0A>=0A> 1) run 'fadvise data-2 0 0 d= ontneed' to drop data-2 cached pages=0A>=C2=A0 =C2=A0=C2=A0=C2=A0(please do= uble check via /proc/vmstat whether it does the expected work)=0A>=0A> 2) r= un 'page-types -r' with root, to view the page status for the=0A>=C2=A0 =C2= =A0=C2=A0=C2=A0remaining pages of data-1=0A>=0A> The fadvise tool comes fro= m Andrew Morton's ext3-tools. (source code attached)=0A> Please compile the= m with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=3D64 -D_LARGE= FILE64_SOURCE"=0A>=0A> page-types can be found in the kernel source tree to= ols/vm/page-types.c=0A>=0A> Sorry that sounds a bit twisted.. I do have a p= atch to directly dump=0A> page cache status of a user specified file, howev= er it's not=0A> upstreamed yet.=0A=0AHi Fengguang,=0A=0AThanks for you deta= il steps, I think metin can have a try.=0A=0A=C2=A0 =C2=A0 =C2=A0 =C2=A0=C2= =A0=C2=A0flags=C2=A0 =C2=A0 page-count=C2=A0 =C2=A0 =C2=A0=C2=A0=C2=A0MB=C2= =A0 symbolic-flags long-symbolic-flags=0A0x0000000000000000=C2=A0 =C2=A0 = =C2=A0 =C2=A0 607699=C2=A0 =C2=A0=C2=A0=C2=A02373 =0A______________________= _____________=0A0x0000000100000000=C2=A0 =C2=A0 =C2=A0 =C2=A0 343227=C2=A0 = =C2=A0=C2=A0=C2=A01340 =0A_______________________r___________=C2=A0 =C2=A0 = reserved=0A=0ABut I have some questions of the print of page-type:=0A=0AIs = 2373MB here mean total memory in used include page cache? I don't =0Athink = so.=0AWhich kind of pages will be marked reserved?=0AWhich line of long-sym= bolic-flags is for page cache?=0A=0ARegards,=0AJaegeuk=0A=0A>=0A> Thanks,= =0A> Fengguang=0A>=0A>>> On Tue 20-11-12 09:42:42, metin d wrote:=0A>>>> I = have two PostgreSQL databases named data-1 and data-2 that sit on the=0A>>>= > same machine. Both databases keep 40 GB of data, and the total memory=0A>= >>> available on the machine is 68GB.=0A>>>>=0A>>>> I started data-1 and da= ta-2, and ran several queries to go over all their=0A>>>> data. Then, I shu= t down data-1 and kept issuing queries against data-2.=0A>>>> For some reas= on, the OS still holds on to large parts of data-1's pages=0A>>>> in its pa= ge cache, and reserves about 35 GB of RAM to data-2's files. As=0A>>>> a re= sult, my queries on data-2 keep hitting disk.=0A>>>>=0A>>>> I'm checking pa= ge cache usage with fincore. When I run a table scan query=0A>>>> against d= ata-2, I see that data-2's pages get evicted and put back into=0A>>>> the c= ache in a round-robin manner. Nothing happens to data-1's pages,=0A>>>> alt= hough they haven't been touched for days.=0A>>>>=0A>>>> Does anybody know w= hy data-1's pages aren't evicted from the page cache?=0A>>>> I'm open to al= l kind of suggestions you think it might relate to problem.=0A>>>=C2=A0 =C2= =A0 Curious. Added linux-mm list to CC to catch more attention. If you run= =0A>>> echo 1 >/proc/sys/vm/drop_caches=0A>>>=C2=A0 =C2=A0 does it evict da= ta-1 pages from memory?=0A>>>=0A>>>> This is an EC2 m2.4xlarge instance on = Amazon with 68 GB of RAM and no=0A>>>> swap space. The kernel version is:= =0A>>>>=0A>>>> $ uname -r=0A>>>> 3.2.28-45.62.amzn1.x86_64=0A>>>> Edit:=0A>= >>>=0A>>>> and it seems that I use one NUMA instance, if=C2=A0 you think th= at it can a problem.=0A>>>>=0A>>>> $ numactl --hardware=0A>>>> available: 1= nodes (0)=0A>>>> node 0 cpus: 0 1 2 3 4 5 6 7=0A>>>> node 0 size: 70007 MB= =0A>>>> node 0 free: 360 MB=0A>>>> node distances:=0A>>>> node=C2=A0=C2=A0= =C2=A00=0A>>>>=C2=A0 =C2=A0=C2=A0=C2=A00:=C2=A0 10 ---2140344373-170584175-1353492026=:13449 Content-Type: text/plain; name="page-types_after.txt" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="page-types_after.txt" ICAgICAgICAgICAgIGZsYWdzCXBhZ2UtY291bnQgICAgICAgTUIgIHN5bWJv bGljLWZsYWdzCQkJbG9uZy1zeW1ib2xpYy1mbGFncwoweDAwMDAwMDAwMDAw MDAwMDAJICAgNTUwODMxNyAgICAyMTUxNiAgX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX18JCjB4MDAwMDAwMDEwMDAwMDAwMAkgICAgMzM1 OTkzICAgICAxMzEyICBfX19fX19fX19fX19fX19fX19fX19fX3JfX19fX19f X19fXwlyZXNlcnZlZAoweDAwMDAwMDIxMDAwMDAwMDAJICAgICAzNTYzNCAg ICAgIDEzOSAgX19fX19fX19fX19fX19fX19fX19fX19yX19fX09fX19fX18J cmVzZXJ2ZWQsb3duZXJfcHJpdmF0ZQoweDAwMDAwMDAwMDAwMTAwMDAJICAg ICA0NTA2OSAgICAgIDE3NiAgX19fX19fX19fX19fX19fX1RfX19fX19fX19f X19fX19fX18JY29tcG91bmRfdGFpbAoweDAwMDAwMDIwMDAwMDAwMDAJICAg ICAgMTUxNiAgICAgICAgNSAgX19fX19fX19fX19fX19fX19fX19fX19fX19f X09fX19fX18Jb3duZXJfcHJpdmF0ZQoweDAwMDAwMDA4MDAwMDAwMDQJICAg ICAgICAgMSAgICAgICAgMCAgX19SX19fX19fX19fX19fX19fX19fX19fX19Q X19fX19fX18JcmVmZXJlbmNlZCxwcml2YXRlCjB4MDAwMDAwMDAwMDAwODAw MAkgICAgICAgIDEwICAgICAgICAwICBfX19fX19fX19fX19fX19IX19fX19f X19fX19fX19fX19fXwljb21wb3VuZF9oZWFkCjB4MDAwMDAwMDAwMDAwMDAw NAkgICAgICAgICAxICAgICAgICAwICBfX1JfX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fXwlyZWZlcmVuY2VkCjB4MDAwMDAwMDgwMDAwMDAyNAkg ICAgICAgMTY2ICAgICAgICAwICBfX1JfX2xfX19fX19fX19fX19fX19fX19f X1BfX19fX19fXwlyZWZlcmVuY2VkLGxydSxwcml2YXRlCjB4MDAwMDAwMDQw MDAwMDAyOAkgICAgICAgMjk1ICAgICAgICAxICBfX19VX2xfX19fX19fX19f X19fX19fX19fZF9fX19fX19fXwl1cHRvZGF0ZSxscnUsbWFwcGVkdG9kaXNr CjB4MDAwMTAwMDQwMDAwMDAyOAkgICAgICAgICAzICAgICAgICAwICBfX19V X2xfX19fX19fX19fX19fX19fX19fZF9fX19fSV9fXwl1cHRvZGF0ZSxscnUs bWFwcGVkdG9kaXNrLHJlYWRhaGVhZAoweDAwMDAwMDAwMDAwMDAwMjgJICAg ICAgICAgMSAgICAgICAgMCAgX19fVV9sX19fX19fX19fX19fX19fX19fX19f X19fX19fX18JdXB0b2RhdGUsbHJ1CjB4MDAwMDAwMDQwMDAwMDAyYwkgICAg MjYyMTQ0ICAgICAxMDI0ICBfX1JVX2xfX19fX19fX19fX19fX19fX19fZF9f X19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxtYXBwZWR0b2Rpc2sK MHgwMDAwMDAwODAwMDAwMDJjCSAgICAgICAgIDUgICAgICAgIDAgIF9fUlVf bF9fX19fX19fX19fX19fX19fX19fUF9fX19fX19fCXJlZmVyZW5jZWQsdXB0 b2RhdGUsbHJ1LHByaXZhdGUKMHgwMDAwMDAwMDAwMDA0MDNjCSAgICAgICAx ODUgICAgICAgIDAgIF9fUlVEbF9fX19fX19fYl9fX19fX19fX19fX19fX19f X19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsZGlydHksbHJ1LHN3YXBiYWNrZWQK MHgwMDAwMDAwODAwMDAwMDYwCSAgICAgICAxNjMgICAgICAgIDAgIF9fX19f bEFfX19fX19fX19fX19fX19fX19fUF9fX19fX19fCWxydSxhY3RpdmUscHJp dmF0ZQoweDAwMDAwMDA4MDAwMDAwNjQJICAgICAzNjczOSAgICAgIDE0MyAg X19SX19sQV9fX19fX19fX19fX19fX19fX19QX19fX19fX18JcmVmZXJlbmNl ZCxscnUsYWN0aXZlLHByaXZhdGUKMHgwMDAwMDAwNDAwMDAwMDY4CSAgICA1 Mjc4MTAgICAgIDIwNjEgIF9fX1VfbEFfX19fX19fX19fX19fX19fX19kX19f X19fX19fCXVwdG9kYXRlLGxydSxhY3RpdmUsbWFwcGVkdG9kaXNrCjB4MDAw MDAwMDgwMDAwMDA2OAkgICAgICAgNTc2ICAgICAgICAyICBfX19VX2xBX19f X19fX19fX19fX19fX19fX1BfX19fX19fXwl1cHRvZGF0ZSxscnUsYWN0aXZl LHByaXZhdGUKMHgwMDAwMDAwYzAwMDAwMDY4CSAgICAgICAxMTYgICAgICAg IDAgIF9fX1VfbEFfX19fX19fX19fX19fX19fX19kUF9fX19fX19fCXVwdG9k YXRlLGxydSxhY3RpdmUsbWFwcGVkdG9kaXNrLHByaXZhdGUKMHgwMDAwMDAw ODAwMDAwMDZjCSAgIDY1ODQwNTEgICAgMjU3MTggIF9fUlVfbEFfX19fX19f X19fX19fX19fX19fUF9fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1 LGFjdGl2ZSxwcml2YXRlCjB4MDAwMDAwMDQwMDAwMDA2YwkgICAxMzAyMjEx ICAgICA1MDg2ICBfX1JVX2xBX19fX19fX19fX19fX19fX19fZF9fX19fX19f XwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxhY3RpdmUsbWFwcGVkdG9kaXNr CjB4MDAwMDAwMGMwMDAwMDA2YwkgICAgICAgNDMxICAgICAgICAxICBfX1JV X2xBX19fX19fX19fX19fX19fX19fZFBfX19fX19fXwlyZWZlcmVuY2VkLHVw dG9kYXRlLGxydSxhY3RpdmUsbWFwcGVkdG9kaXNrLHByaXZhdGUKMHgwMDAw MDAwMDAwMDAwMDZjCSAgICAgICAxMjggICAgICAgIDAgIF9fUlVfbEFfX19f X19fX19fX19fX19fX19fX19fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUs bHJ1LGFjdGl2ZQoweDAwMDAwMDA4MDAwMDAwNzQJICAgICAgICAgMiAgICAg ICAgMCAgX19SX0RsQV9fX19fX19fX19fX19fX19fX19QX19fX19fX18JcmVm ZXJlbmNlZCxkaXJ0eSxscnUsYWN0aXZlLHByaXZhdGUKMHgwMDAwMDAwMDAw MDA0MDc4CSAgICAgICAgNTYgICAgICAgIDAgIF9fX1VEbEFfX19fX19fYl9f X19fX19fX19fX19fX19fX19fCXVwdG9kYXRlLGRpcnR5LGxydSxhY3RpdmUs c3dhcGJhY2tlZAoweDAwMDAwMDAwMDAwMDQwN2MJICAgICAgIDEyMiAgICAg ICAgMCAgX19SVURsQV9fX19fX19iX19fX19fX19fX19fX19fX19fX18JcmVm ZXJlbmNlZCx1cHRvZGF0ZSxkaXJ0eSxscnUsYWN0aXZlLHN3YXBiYWNrZWQK MHgwMDAwMDAwODAwMDAwMDdjCSAgICAgICAgIDEgICAgICAgIDAgIF9fUlVE bEFfX19fX19fX19fX19fX19fX19fUF9fX19fX19fCXJlZmVyZW5jZWQsdXB0 b2RhdGUsZGlydHksbHJ1LGFjdGl2ZSxwcml2YXRlCjB4MDAwMDAwMDAwMDAw ODA4MAkgICAgIDE0NDk1ICAgICAgIDU2ICBfX19fX19fU19fX19fX19IX19f X19fX19fX19fX19fX19fXwlzbGFiLGNvbXBvdW5kX2hlYWQKMHgwMDAwMDAw MDAwMDAwMDgwCSAgICAyNTA0OTggICAgICA5NzggIF9fX19fX19TX19fX19f X19fX19fX19fX19fX19fX19fX19fCXNsYWIKMHgwMDAwMDAwMDAwMDAwNDAw CSAgIDI5OTA5MDggICAgMTE2ODMgIF9fX19fX19fX19CX19fX19fX19fX19f X19fX19fX19fX19fCWJ1ZGR5CjB4MDAwMDAwMDAwMDAwMDgwMAkgICAgICAg IDE2ICAgICAgICAwICBfX19fX19fX19fX01fX19fX19fX19fX19fX19fX19f X19fXwltbWFwCjB4MDAwMDAwMDEwMDAwMDgwNAkgICAgICAgICAxICAgICAg ICAwICBfX1JfX19fX19fX01fX19fX19fX19fX3JfX19fX19fX19fXwlyZWZl cmVuY2VkLG1tYXAscmVzZXJ2ZWQKMHgwMDAwMDAwNjAwMDQwODJjCSAgICAg ICAzOTEgICAgICAgIDEgIF9fUlVfbF9fX19fTV9fX19fX3VfX19fX21kX19f X19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LG1tYXAsdW5ldmljdGFi bGUsbWxvY2tlZCxtYXBwZWR0b2Rpc2sKMHgwMDAwMDAwYTAwMDQwODJjCSAg ICAgICAzMjEgICAgICAgIDEgIF9fUlVfbF9fX19fTV9fX19fX3VfX19fX21f UF9fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LG1tYXAsdW5ldmlj dGFibGUsbWxvY2tlZCxwcml2YXRlCjB4MDAwMDAwMDAwMDAwNDgzOAkgICAg ICA4NDUwICAgICAgIDMzICBfX19VRGxfX19fX01fX2JfX19fX19fX19fX19f X19fX19fXwl1cHRvZGF0ZSxkaXJ0eSxscnUsbW1hcCxzd2FwYmFja2VkCjB4 MDAwMDAwMDAwMDAwNDgzYwkgICAgICAyMDQ1ICAgICAgICA3ICBfX1JVRGxf X19fX01fX2JfX19fX19fX19fX19fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9k YXRlLGRpcnR5LGxydSxtbWFwLHN3YXBiYWNrZWQKMHgwMDAwMDAwODAwMDAw ODY4CSAgICAgICAgMTkgICAgICAgIDAgIF9fX1VfbEFfX19fTV9fX19fX19f X19fX19fUF9fX19fX19fCXVwdG9kYXRlLGxydSxhY3RpdmUsbW1hcCxwcml2 YXRlCjB4MDAwMDAwMDQwMDAwMDg2OAkgICAgICAgICA1ICAgICAgICAwICBf X19VX2xBX19fX01fX19fX19fX19fX19fZF9fX19fX19fXwl1cHRvZGF0ZSxs cnUsYWN0aXZlLG1tYXAsbWFwcGVkdG9kaXNrCjB4MDAwMDAwMDQwMDAwMDg2 YwkgICAgICAxODkxICAgICAgICA3ICBfX1JVX2xBX19fX01fX19fX19fX19f X19fZF9fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxhY3RpdmUs bW1hcCxtYXBwZWR0b2Rpc2sKMHgwMDAwMDAwODAwMDAwODZjCSAgICAgICAx MjYgICAgICAgIDAgIF9fUlVfbEFfX19fTV9fX19fX19fX19fX19fUF9fX19f X19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LGFjdGl2ZSxtbWFwLHByaXZh dGUKMHgwMDAwMDAwMDAwMDA0ODc4CSAgICAgICAgODUgICAgICAgIDAgIF9f X1VEbEFfX19fTV9fYl9fX19fX19fX19fX19fX19fX19fCXVwdG9kYXRlLGRp cnR5LGxydSxhY3RpdmUsbW1hcCxzd2FwYmFja2VkCjB4MDAwMDAwMDAwMDAw NDg3YwkgICAgICAyMjYzICAgICAgICA4ICBfX1JVRGxBX19fX01fX2JfX19f X19fX19fX19fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGRpcnR5LGxy dSxhY3RpdmUsbW1hcCxzd2FwYmFja2VkCjB4MDAwMDAwMDAwMDAwNTAwOAkg ICAgICAgIDEzICAgICAgICAwICBfX19VX19fX19fX19hX2JfX19fX19fX19f X19fX19fX19fXwl1cHRvZGF0ZSxhbm9ueW1vdXMsc3dhcGJhY2tlZAoweDAw MDAwMDAwMDAwMDU4MDgJICAgICAgICAxNiAgICAgICAgMCAgX19fVV9fX19f X19NYV9iX19fX19fX19fX19fX19fX19fX18JdXB0b2RhdGUsbW1hcCxhbm9u eW1vdXMsc3dhcGJhY2tlZAoweDAwMDAwMDAyMDAwNDU4MjgJICAgICAgICAg OCAgICAgICAgMCAgX19fVV9sX19fX19NYV9iX19fdV9fX19fbV9fX19fX19f X18JdXB0b2RhdGUsbHJ1LG1tYXAsYW5vbnltb3VzLHN3YXBiYWNrZWQsdW5l dmljdGFibGUsbWxvY2tlZAoweDAwMDAwMDAyMDAwNDU4MmMJICAgICAgIDY1 MSAgICAgICAgMiAgX19SVV9sX19fX19NYV9iX19fdV9fX19fbV9fX19fX19f X18JcmVmZXJlbmNlZCx1cHRvZGF0ZSxscnUsbW1hcCxhbm9ueW1vdXMsc3dh cGJhY2tlZCx1bmV2aWN0YWJsZSxtbG9ja2VkCjB4MDAwMDAwMDAwMDAwNTg2 OAkgICAgICA4MDU4ICAgICAgIDMxICBfX19VX2xBX19fX01hX2JfX19fX19f X19fX19fX19fX19fXwl1cHRvZGF0ZSxscnUsYWN0aXZlLG1tYXAsYW5vbnlt b3VzLHN3YXBiYWNrZWQKMHgwMDAwMDAwMDAwMDA1ODZjCSAgICAgICAgNDIg ICAgICAgIDAgIF9fUlVfbEFfX19fTWFfYl9fX19fX19fX19fX19fX19fX19f CXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LGFjdGl2ZSxtbWFwLGFub255bW91 cyxzd2FwYmFja2VkCiAgICAgICAgICAgICB0b3RhbAkgIDE3OTIyMDQ4ICAg IDcwMDA4Cgo= ---2140344373-170584175-1353492026=:13449 Content-Type: text/plain; name="page-types_before.txt" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="page-types_before.txt" ICAgICAgICAgICAgIGZsYWdzCXBhZ2UtY291bnQgICAgICAgTUIgIHN5bWJv bGljLWZsYWdzCQkJbG9uZy1zeW1ib2xpYy1mbGFncwoweDAwMDAwMDAwMDAw MDAwMDAJICAgIDEyMTYyOCAgICAgIDQ3NSAgX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX18JCjB4MDAwMDAwMDEwMDAwMDAwMAkgICAgMzM1 OTkzICAgICAxMzEyICBfX19fX19fX19fX19fX19fX19fX19fX3JfX19fX19f X19fXwlyZXNlcnZlZAoweDAwMDAwMDIxMDAwMDAwMDAJICAgICAzNTYzNCAg ICAgIDEzOSAgX19fX19fX19fX19fX19fX19fX19fX19yX19fX09fX19fX18J cmVzZXJ2ZWQsb3duZXJfcHJpdmF0ZQoweDAwMDAwMDAwMDAwMTAwMDAJICAg ICA0NTQyOSAgICAgIDE3NyAgX19fX19fX19fX19fX19fX1RfX19fX19fX19f X19fX19fX18JY29tcG91bmRfdGFpbAoweDAwMDAwMDIwMDAwMDAwMDAJICAg ICAgMTM4OSAgICAgICAgNSAgX19fX19fX19fX19fX19fX19fX19fX19fX19f X09fX19fX18Jb3duZXJfcHJpdmF0ZQoweDAwMDAwMDA0MDAwMDAwMDEJICAg ICAgICAgNiAgICAgICAgMCAgTF9fX19fX19fX19fX19fX19fX19fX19fX2Rf X19fX19fX18JbG9ja2VkLG1hcHBlZHRvZGlzawoweDAwMDAwMDAwMDAwMDgw MDAJICAgICAgICAxMCAgICAgICAgMCAgX19fX19fX19fX19fX19fSF9fX19f X19fX19fX19fX19fX18JY29tcG91bmRfaGVhZAoweDAwMDAwMDAwMDAwMDAw MDQJICAgICAgICAgMSAgICAgICAgMCAgX19SX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX18JcmVmZXJlbmNlZAoweDAwMDAwMDA0MDAwMDAwMjEJ ICAgICAgICA2NCAgICAgICAgMCAgTF9fX19sX19fX19fX19fX19fX19fX19f X2RfX19fX19fX18JbG9ja2VkLGxydSxtYXBwZWR0b2Rpc2sKMHgwMDAxMDAw NDAwMDAwMDIxCSAgICAgICAgIDEgICAgICAgIDAgIExfX19fbF9fX19fX19f X19fX19fX19fX19kX19fX19JX19fCWxvY2tlZCxscnUsbWFwcGVkdG9kaXNr LHJlYWRhaGVhZAoweDAwMDAwMDA4MDAwMDAwMjQJICAgICAgIDE3MSAgICAg ICAgMCAgX19SX19sX19fX19fX19fX19fX19fX19fX19QX19fX19fX18JcmVm ZXJlbmNlZCxscnUscHJpdmF0ZQoweDAwMDAwMDA0MDAwMDAwMjgJICAgICAg NDA5MyAgICAgICAxNSAgX19fVV9sX19fX19fX19fX19fX19fX19fX2RfX19f X19fX18JdXB0b2RhdGUsbHJ1LG1hcHBlZHRvZGlzawoweDAwMDEwMDA0MDAw MDAwMjgJICAgICAgICA1OSAgICAgICAgMCAgX19fVV9sX19fX19fX19fX19f X19fX19fX2RfX19fX0lfX18JdXB0b2RhdGUsbHJ1LG1hcHBlZHRvZGlzayxy ZWFkYWhlYWQKMHgwMDAwMDAwMDAwMDAwMDI4CSAgICAgICAgIDEgICAgICAg IDAgIF9fX1VfbF9fX19fX19fX19fX19fX19fX19fX19fX19fX19fCXVwdG9k YXRlLGxydQoweDAwMDAwMDA0MDAwMDAwMmMJICAgODU5ODAzMiAgICAzMzU4 NiAgX19SVV9sX19fX19fX19fX19fX19fX19fX2RfX19fX19fX18JcmVmZXJl bmNlZCx1cHRvZGF0ZSxscnUsbWFwcGVkdG9kaXNrCjB4MDAwMDAwMDgwMDAw MDAyYwkgICAgICAgIDEwICAgICAgICAwICBfX1JVX2xfX19fX19fX19fX19f X19fX19fX1BfX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxwcml2 YXRlCjB4MDAwMDAwMDAwMDAwNDAzYwkgICAgICAgMTg1ICAgICAgICAwICBf X1JVRGxfX19fX19fX2JfX19fX19fX19fX19fX19fX19fXwlyZWZlcmVuY2Vk LHVwdG9kYXRlLGRpcnR5LGxydSxzd2FwYmFja2VkCjB4MDAwMDAwMDgwMDAw MDA2MAkgICAgICAgMTYzICAgICAgICAwICBfX19fX2xBX19fX19fX19fX19f X19fX19fX1BfX19fX19fXwlscnUsYWN0aXZlLHByaXZhdGUKMHgwMDAwMDAw ODAwMDAwMDY0CSAgICAgMzY3NDEgICAgICAxNDMgIF9fUl9fbEFfX19fX19f X19fX19fX19fX19fUF9fX19fX19fCXJlZmVyZW5jZWQsbHJ1LGFjdGl2ZSxw cml2YXRlCjB4MDAwMDAwMDQwMDAwMDA2OAkgICAgNTI3ODM0ICAgICAyMDYx ICBfX19VX2xBX19fX19fX19fX19fX19fX19fZF9fX19fX19fXwl1cHRvZGF0 ZSxscnUsYWN0aXZlLG1hcHBlZHRvZGlzawoweDAwMDAwMDA4MDAwMDAwNjgJ ICAgICAgIDY5NSAgICAgICAgMiAgX19fVV9sQV9fX19fX19fX19fX19fX19f X19QX19fX19fX18JdXB0b2RhdGUsbHJ1LGFjdGl2ZSxwcml2YXRlCjB4MDAw MDAwMGMwMDAwMDA2OAkgICAgICAgMTE2ICAgICAgICAwICBfX19VX2xBX19f X19fX19fX19fX19fX19fZFBfX19fX19fXwl1cHRvZGF0ZSxscnUsYWN0aXZl LG1hcHBlZHRvZGlzayxwcml2YXRlCjB4MDAwMDAwMDgwMDAwMDA2YwkgICA2 NTg0MDY2ICAgIDI1NzE5ICBfX1JVX2xBX19fX19fX19fX19fX19fX19fX1Bf X19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxhY3RpdmUscHJpdmF0 ZQoweDAwMDAwMDA0MDAwMDAwNmMJICAgMTMyNTI3MyAgICAgNTE3NiAgX19S VV9sQV9fX19fX19fX19fX19fX19fX2RfX19fX19fX18JcmVmZXJlbmNlZCx1 cHRvZGF0ZSxscnUsYWN0aXZlLG1hcHBlZHRvZGlzawoweDAwMDAwMDBjMDAw MDAwNmMJICAgICAgIDQzMSAgICAgICAgMSAgX19SVV9sQV9fX19fX19fX19f X19fX19fX2RQX19fX19fX18JcmVmZXJlbmNlZCx1cHRvZGF0ZSxscnUsYWN0 aXZlLG1hcHBlZHRvZGlzayxwcml2YXRlCjB4MDAwMDAwMDAwMDAwMDA2Ywkg ICAgICAgMTI4ICAgICAgICAwICBfX1JVX2xBX19fX19fX19fX19fX19fX19f X19fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxydSxhY3RpdmUKMHgw MDAwMDAwMDAwMDA0MDc4CSAgICAgICAgNTYgICAgICAgIDAgIF9fX1VEbEFf X19fX19fYl9fX19fX19fX19fX19fX19fX19fCXVwdG9kYXRlLGRpcnR5LGxy dSxhY3RpdmUsc3dhcGJhY2tlZAoweDAwMDAwMDAwMDAwMDQwN2MJICAgICAg IDEyMiAgICAgICAgMCAgX19SVURsQV9fX19fX19iX19fX19fX19fX19fX19f X19fX18JcmVmZXJlbmNlZCx1cHRvZGF0ZSxkaXJ0eSxscnUsYWN0aXZlLHN3 YXBiYWNrZWQKMHgwMDAwMDAwODAwMDAwMDdjCSAgICAgICAgIDEgICAgICAg IDAgIF9fUlVEbEFfX19fX19fX19fX19fX19fX19fUF9fX19fX19fCXJlZmVy ZW5jZWQsdXB0b2RhdGUsZGlydHksbHJ1LGFjdGl2ZSxwcml2YXRlCjB4MDAw MDAwMDAwMDAwODA4MAkgICAgIDE0NTcxICAgICAgIDU2ICBfX19fX19fU19f X19fX19IX19fX19fX19fX19fX19fX19fXwlzbGFiLGNvbXBvdW5kX2hlYWQK MHgwMDAwMDAwMDAwMDAwMDgwCSAgICAyNTA1NDYgICAgICA5NzggIF9fX19f X19TX19fX19fX19fX19fX19fX19fX19fX19fX19fCXNsYWIKMHgwMDAwMDAw MDAwMDAwNDAwCSAgICAgMTQ3MDEgICAgICAgNTcgIF9fX19fX19fX19CX19f X19fX19fX19fX19fX19fX19fX19fCWJ1ZGR5CjB4MDAwMDAwMDAwMDAwMDgw MAkgICAgICAgIDE2ICAgICAgICAwICBfX19fX19fX19fX01fX19fX19fX19f X19fX19fX19fX19fXwltbWFwCjB4MDAwMDAwMDEwMDAwMDgwNAkgICAgICAg ICAxICAgICAgICAwICBfX1JfX19fX19fX01fX19fX19fX19fX3JfX19fX19f X19fXwlyZWZlcmVuY2VkLG1tYXAscmVzZXJ2ZWQKMHgwMDAwMDAwNjAwMDQw ODJjCSAgICAgICAzOTEgICAgICAgIDEgIF9fUlVfbF9fX19fTV9fX19fX3Vf X19fX21kX19fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LG1tYXAs dW5ldmljdGFibGUsbWxvY2tlZCxtYXBwZWR0b2Rpc2sKMHgwMDAwMDAwYTAw MDQwODJjCSAgICAgICAzMjEgICAgICAgIDEgIF9fUlVfbF9fX19fTV9fX19f X3VfX19fX21fUF9fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LG1t YXAsdW5ldmljdGFibGUsbWxvY2tlZCxwcml2YXRlCjB4MDAwMDAwMDAwMDAw NDgzOAkgICAgICA4Mzg1ICAgICAgIDMyICBfX19VRGxfX19fX01fX2JfX19f X19fX19fX19fX19fX19fXwl1cHRvZGF0ZSxkaXJ0eSxscnUsbW1hcCxzd2Fw YmFja2VkCjB4MDAwMDAwMDAwMDAwNDgzYwkgICAgICAyMDQ1ICAgICAgICA3 ICBfX1JVRGxfX19fX01fX2JfX19fX19fX19fX19fX19fX19fXwlyZWZlcmVu Y2VkLHVwdG9kYXRlLGRpcnR5LGxydSxtbWFwLHN3YXBiYWNrZWQKMHgwMDAw MDAwODAwMDAwODY4CSAgICAgICAgMTkgICAgICAgIDAgIF9fX1VfbEFfX19f TV9fX19fX19fX19fX19fUF9fX19fX19fCXVwdG9kYXRlLGxydSxhY3RpdmUs bW1hcCxwcml2YXRlCjB4MDAwMDAwMDQwMDAwMDg2OAkgICAgICAgICA1ICAg ICAgICAwICBfX19VX2xBX19fX01fX19fX19fX19fX19fZF9fX19fX19fXwl1 cHRvZGF0ZSxscnUsYWN0aXZlLG1tYXAsbWFwcGVkdG9kaXNrCjB4MDAwMDAw MDQwMDAwMDg2YwkgICAgICAxODkxICAgICAgICA3ICBfX1JVX2xBX19fX01f X19fX19fX19fX19fZF9fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRlLGxy dSxhY3RpdmUsbW1hcCxtYXBwZWR0b2Rpc2sKMHgwMDAwMDAwODAwMDAwODZj CSAgICAgICAxMjYgICAgICAgIDAgIF9fUlVfbEFfX19fTV9fX19fX19fX19f X19fUF9fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LGFjdGl2ZSxt bWFwLHByaXZhdGUKMHgwMDAwMDAwMDAwMDA0ODc4CSAgICAgICAgODUgICAg ICAgIDAgIF9fX1VEbEFfX19fTV9fYl9fX19fX19fX19fX19fX19fX19fCXVw dG9kYXRlLGRpcnR5LGxydSxhY3RpdmUsbW1hcCxzd2FwYmFja2VkCjB4MDAw MDAwMDAwMDAwNDg3YwkgICAgICAyMjYzICAgICAgICA4ICBfX1JVRGxBX19f X01fX2JfX19fX19fX19fX19fX19fX19fXwlyZWZlcmVuY2VkLHVwdG9kYXRl LGRpcnR5LGxydSxhY3RpdmUsbW1hcCxzd2FwYmFja2VkCjB4MDAwMDAwMDAw MDAwNTAwOAkgICAgICAgICA0ICAgICAgICAwICBfX19VX19fX19fX19hX2Jf X19fX19fX19fX19fX19fX19fXwl1cHRvZGF0ZSxhbm9ueW1vdXMsc3dhcGJh Y2tlZAoweDAwMDAwMDAwMDAwMDU4MDgJICAgICAgICAyNSAgICAgICAgMCAg X19fVV9fX19fX19NYV9iX19fX19fX19fX19fX19fX19fX18JdXB0b2RhdGUs bW1hcCxhbm9ueW1vdXMsc3dhcGJhY2tlZAoweDAwMDAwMDAyMDAwNDU4MjgJ ICAgICAgICAgOCAgICAgICAgMCAgX19fVV9sX19fX19NYV9iX19fdV9fX19f bV9fX19fX19fX18JdXB0b2RhdGUsbHJ1LG1tYXAsYW5vbnltb3VzLHN3YXBi YWNrZWQsdW5ldmljdGFibGUsbWxvY2tlZAoweDAwMDAwMDAyMDAwNDU4MmMJ ICAgICAgIDY1MSAgICAgICAgMiAgX19SVV9sX19fX19NYV9iX19fdV9fX19f bV9fX19fX19fX18JcmVmZXJlbmNlZCx1cHRvZGF0ZSxscnUsbW1hcCxhbm9u eW1vdXMsc3dhcGJhY2tlZCx1bmV2aWN0YWJsZSxtbG9ja2VkCjB4MDAwMDAw MDAwMDAwNTg2OAkgICAgICA3NjIzICAgICAgIDI5ICBfX19VX2xBX19fX01h X2JfX19fX19fX19fX19fX19fX19fXwl1cHRvZGF0ZSxscnUsYWN0aXZlLG1t YXAsYW5vbnltb3VzLHN3YXBiYWNrZWQKMHgwMDAwMDAwMDAwMDA1ODZjCSAg ICAgICAgMzkgICAgICAgIDAgIF9fUlVfbEFfX19fTWFfYl9fX19fX19fX19f X19fX19fX19fCXJlZmVyZW5jZWQsdXB0b2RhdGUsbHJ1LGFjdGl2ZSxtbWFw LGFub255bW91cyxzd2FwYmFja2VkCiAgICAgICAgICAgICB0b3RhbAkgIDE3 OTIyMDQ4ICAgIDcwMDA4Cg== ---2140344373-170584175-1353492026=:13449-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id 6AA196B0044 for ; Wed, 21 Nov 2012 05:07:44 -0500 (EST) Received: by mail-qa0-f48.google.com with SMTP id s11so1674587qaa.14 for ; Wed, 21 Nov 2012 02:07:43 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <50ACA634.5000007@gmail.com> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> From: =?UTF-8?B?TWV0aW4gRMO2xZ9sw7w=?= Date: Wed, 21 Nov 2012 12:07:22 +0200 Message-ID: Subject: Re: Problem in Page Cache Replacement Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse Cc: Fengguang Wu , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse w= rote: > > On 11/21/2012 05:58 PM, metin d wrote: > > Hi Fengguang, > > I run tests and attached the results. The line below I guess shows the da= ta-1 page caches. > > 0x000000080000006c 6584051 25718 __RU_lA___________________P___= _____ referenced,uptodate,lru,active,private > > > I thinks this is just one state of page cache pages. But why these page caches are in this state as opposed to other page caches. From the results I conclude that: data-1 pages are in state : referenced,uptodate,lru,active,private data-2 pages are in state : referenced,uptodate,lru,mappedtodisk > > > > > Metin > > > ----- Original Message ----- > From: Jaegeuk Hanse > To: Fengguang Wu > Cc: metin d ; Jan Kara ; "linux-kernel@vg= er.kernel.org" ; "linux-mm@kvack.org" > Sent: Wednesday, November 21, 2012 11:42 AM > Subject: Re: Problem in Page Cache Replacement > > On 11/21/2012 05:02 PM, Fengguang Wu wrote: > > On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: > >> Cc Fengguang Wu. > >> > >> On 11/21/2012 04:13 PM, metin d wrote: > >>>> Curious. Added linux-mm list to CC to catch more attention. If yo= u run > >>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from mem= ory? > >>> I'm guessing it'd evict the entries, but am wondering if we could run= any more diagnostics before trying this. > >>> > >>> We regularly use a setup where we have two databases; one gets used f= requently and the other one about once a month. It seems like the memory ma= nager keeps unused pages in memory at the expense of frequently used databa= se's performance. > >>> My understanding was that under memory pressure from heavily > >>> accessed pages, unused pages would eventually get evicted. Is there > >>> anything else we can try on this host to understand why this is > >>> happening? > > We may debug it this way. > > > > 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages > > (please double check via /proc/vmstat whether it does the expected w= ork) > > > > 2) run 'page-types -r' with root, to view the page status for the > > remaining pages of data-1 > > > > The fadvise tool comes from Andrew Morton's ext3-tools. (source code at= tached) > > Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFF= SET_BITS=3D64 -D_LARGEFILE64_SOURCE" > > > > page-types can be found in the kernel source tree tools/vm/page-types.c > > > > Sorry that sounds a bit twisted.. I do have a patch to directly dump > > page cache status of a user specified file, however it's not > > upstreamed yet. > > Hi Fengguang, > > Thanks for you detail steps, I think metin can have a try. > > flags page-count MB symbolic-flags long-symbolic-flags > 0x0000000000000000 607699 2373 > ___________________________________ > 0x0000000100000000 343227 1340 > _______________________r___________ reserved > > But I have some questions of the print of page-type: > > Is 2373MB here mean total memory in used include page cache? I don't > think so. > Which kind of pages will be marked reserved? > Which line of long-symbolic-flags is for page cache? > > Regards, > Jaegeuk > > > > > Thanks, > > Fengguang > > > >>> On Tue 20-11-12 09:42:42, metin d wrote: > >>>> I have two PostgreSQL databases named data-1 and data-2 that sit on = the > >>>> same machine. Both databases keep 40 GB of data, and the total memor= y > >>>> available on the machine is 68GB. > >>>> > >>>> I started data-1 and data-2, and ran several queries to go over all = their > >>>> data. Then, I shut down data-1 and kept issuing queries against data= -2. > >>>> For some reason, the OS still holds on to large parts of data-1's pa= ges > >>>> in its page cache, and reserves about 35 GB of RAM to data-2's files= . As > >>>> a result, my queries on data-2 keep hitting disk. > >>>> > >>>> I'm checking page cache usage with fincore. When I run a table scan = query > >>>> against data-2, I see that data-2's pages get evicted and put back i= nto > >>>> the cache in a round-robin manner. Nothing happens to data-1's pages= , > >>>> although they haven't been touched for days. > >>>> > >>>> Does anybody know why data-1's pages aren't evicted from the page ca= che? > >>>> I'm open to all kind of suggestions you think it might relate to pro= blem. > >>> Curious. Added linux-mm list to CC to catch more attention. If you= run > >>> echo 1 >/proc/sys/vm/drop_caches > >>> does it evict data-1 pages from memory? > >>> > >>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and n= o > >>>> swap space. The kernel version is: > >>>> > >>>> $ uname -r > >>>> 3.2.28-45.62.amzn1.x86_64 > >>>> Edit: > >>>> > >>>> and it seems that I use one NUMA instance, if you think that it can= a problem. > >>>> > >>>> $ numactl --hardware > >>>> available: 1 nodes (0) > >>>> node 0 cpus: 0 1 2 3 4 5 6 7 > >>>> node 0 size: 70007 MB > >>>> node 0 free: 360 MB > >>>> node distances: > >>>> node 0 > >>>> 0: 10 > > -- Metin D=C3=B6=C5=9Fl=C3=BC -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id 40EC86B0070 for ; Wed, 21 Nov 2012 16:35:06 -0500 (EST) Date: Wed, 21 Nov 2012 16:34:18 -0500 From: Johannes Weiner Subject: Re: Problem in Page Cache Replacement Message-ID: <20121121213417.GC24381@cmpxchg.org> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121120182500.GH1408@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org Hi, On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: > On Tue 20-11-12 09:42:42, metin d wrote: > > I have two PostgreSQL databases named data-1 and data-2 that sit on the > > same machine. Both databases keep 40 GB of data, and the total memory > > available on the machine is 68GB. > > > > I started data-1 and data-2, and ran several queries to go over all their > > data. Then, I shut down data-1 and kept issuing queries against data-2. > > For some reason, the OS still holds on to large parts of data-1's pages > > in its page cache, and reserves about 35 GB of RAM to data-2's files. As > > a result, my queries on data-2 keep hitting disk. > > > > I'm checking page cache usage with fincore. When I run a table scan query > > against data-2, I see that data-2's pages get evicted and put back into > > the cache in a round-robin manner. Nothing happens to data-1's pages, > > although they haven't been touched for days. > > > > Does anybody know why data-1's pages aren't evicted from the page cache? > > I'm open to all kind of suggestions you think it might relate to problem. This might be because we do not deactive pages as long as there is cache on the inactive list. I'm guessing that the inter-reference distance of data-2 is bigger than half of memory, so it's never getting activated and data-1 is never challenged. I have a series of patches that detects a thrashing inactive list and handles working set changes up to the size of memory. Would you be willing to test them? They are currently based on 3.4, let me know what version works best for you. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx112.postini.com [74.125.245.112]) by kanga.kvack.org (Postfix) with SMTP id B0BCF6B005D for ; Wed, 21 Nov 2012 17:01:30 -0500 (EST) References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> Message-ID: <1353535288.94916.YahooMailNeo@web141101.mail.bf1.yahoo.com> Date: Wed, 21 Nov 2012 14:01:28 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement In-Reply-To: <20121121213417.GC24381@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner , Jan Kara Cc: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , =?utf-8?B?TWV0aW4gRMO2xZ9sw7w=?= Hi,=0A=0AYes data-2 is bigger than half of memory. I'm willing to try those= patches. =0A=0AThis is the version of this machine:=0A=0A$ uname -r=0A3.2.= 28-45.62.amzn1.x86_64=0A=0A=0A=0A----- Original Message -----=0AFrom: Johan= nes Weiner =0ATo: Jan Kara =0ACc: metin d= ; "linux-kernel@vger.kernel.org" ; linux-mm@kvack.org=0ASent: Wednesday, November 21, 2012 11:34 PM= =0ASubject: Re: Problem in Page Cache Replacement=0A=0AHi,=0A=0AOn Tue, Nov= 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:=0A> On Tue 20-11-12 09:42:42= , metin d wrote:=0A> > I have two PostgreSQL databases named data-1 and dat= a-2 that sit on the=0A> > same machine. Both databases keep 40 GB of data, = and the total memory=0A> > available on the machine is 68GB.=0A> > =0A> > I= started data-1 and data-2, and ran several queries to go over all their=0A= > > data. Then, I shut down data-1 and kept issuing queries against data-2.= =0A> > For some reason, the OS still holds on to large parts of data-1's pa= ges=0A> > in its page cache, and reserves about 35 GB of RAM to data-2's fi= les. As=0A> > a result, my queries on data-2 keep hitting disk.=0A> > =0A> = > I'm checking page cache usage with fincore. When I run a table scan query= =0A> > against data-2, I see that data-2's pages get evicted and put back i= nto=0A> > the cache in a round-robin manner. Nothing happens to data-1's pa= ges,=0A> > although they haven't been touched for days.=0A> > =0A> > Does a= nybody know why data-1's pages aren't evicted from the page cache?=0A> > I'= m open to all kind of suggestions you think it might relate to problem.=0A= =0AThis might be because we do not deactive pages as long as there is=0Acac= he on the inactive list.=C2=A0 I'm guessing that the inter-reference=0Adist= ance of data-2 is bigger than half of memory, so it's never=0Agetting activ= ated and data-1 is never challenged.=0A=0AI have a series of patches that d= etects a thrashing inactive list and=0Ahandles working set changes up to th= e size of memory.=C2=A0 Would you be=0Awilling to test them?=C2=A0 They are= currently based on 3.4, let me know=0Awhat version works best for you.=0A -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx150.postini.com [74.125.245.150]) by kanga.kvack.org (Postfix) with SMTP id 395FC6B0070 for ; Wed, 21 Nov 2012 19:48:14 -0500 (EST) Received: by mail-ob0-f169.google.com with SMTP id lz20so9679061obb.14 for ; Wed, 21 Nov 2012 16:48:13 -0800 (PST) Message-ID: <50AD7647.7050200@gmail.com> Date: Thu, 22 Nov 2012 08:48:07 +0800 From: Jaegeuk Hanse MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> In-Reply-To: <20121121213417.GC24381@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Jan Kara , metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org On 11/22/2012 05:34 AM, Johannes Weiner wrote: > Hi, > > On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: >> On Tue 20-11-12 09:42:42, metin d wrote: >>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>> same machine. Both databases keep 40 GB of data, and the total memory >>> available on the machine is 68GB. >>> >>> I started data-1 and data-2, and ran several queries to go over all their >>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>> For some reason, the OS still holds on to large parts of data-1's pages >>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>> a result, my queries on data-2 keep hitting disk. >>> >>> I'm checking page cache usage with fincore. When I run a table scan query >>> against data-2, I see that data-2's pages get evicted and put back into >>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>> although they haven't been touched for days. >>> >>> Does anybody know why data-1's pages aren't evicted from the page cache? >>> I'm open to all kind of suggestions you think it might relate to problem. > This might be because we do not deactive pages as long as there is > cache on the inactive list. I'm guessing that the inter-reference > distance of data-2 is bigger than half of memory, so it's never > getting activated and data-1 is never challenged. Hi Johannes, What's the meaning of "inter-reference distance" and why compare it with half of memoy, what's the trick? Regards, Jaegeuk > > I have a series of patches that detects a thrashing inactive list and > handles working set changes up to the size of memory. Would you be > willing to test them? They are currently based on 3.4, let me know > what version works best for you. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx128.postini.com [74.125.245.128]) by kanga.kvack.org (Postfix) with SMTP id 155C96B0070 for ; Wed, 21 Nov 2012 20:10:48 -0500 (EST) Date: Wed, 21 Nov 2012 20:09:59 -0500 From: Johannes Weiner Subject: Re: Problem in Page Cache Replacement Message-ID: <20121122010959.GF24381@cmpxchg.org> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> <50AD7647.7050200@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50AD7647.7050200@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse Cc: Jan Kara , metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote: > On 11/22/2012 05:34 AM, Johannes Weiner wrote: > >Hi, > > > >On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: > >>On Tue 20-11-12 09:42:42, metin d wrote: > >>>I have two PostgreSQL databases named data-1 and data-2 that sit on the > >>>same machine. Both databases keep 40 GB of data, and the total memory > >>>available on the machine is 68GB. > >>> > >>>I started data-1 and data-2, and ran several queries to go over all their > >>>data. Then, I shut down data-1 and kept issuing queries against data-2. > >>>For some reason, the OS still holds on to large parts of data-1's pages > >>>in its page cache, and reserves about 35 GB of RAM to data-2's files. As > >>>a result, my queries on data-2 keep hitting disk. > >>> > >>>I'm checking page cache usage with fincore. When I run a table scan query > >>>against data-2, I see that data-2's pages get evicted and put back into > >>>the cache in a round-robin manner. Nothing happens to data-1's pages, > >>>although they haven't been touched for days. > >>> > >>>Does anybody know why data-1's pages aren't evicted from the page cache? > >>>I'm open to all kind of suggestions you think it might relate to problem. > >This might be because we do not deactive pages as long as there is > >cache on the inactive list. I'm guessing that the inter-reference > >distance of data-2 is bigger than half of memory, so it's never > >getting activated and data-1 is never challenged. > > Hi Johannes, > > What's the meaning of "inter-reference distance" It's the number of memory accesses between two accesses to the same page: A B C D A B C E ... |_______| | | > and why compare it with half of memoy, what's the trick? If B gets accessed twice, it gets activated. If it gets evicted in between, the second access will be a fresh page fault and B will not be recognized as frequently used. Our cutoff for scanning the active list is cache size / 2 right now (inactive_file_is_low), leaving 50% of memory to the inactive list. If the inter-reference distance for pages on the inactive list is bigger than that, they get evicted before their second access. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx152.postini.com [74.125.245.152]) by kanga.kvack.org (Postfix) with SMTP id CC4276B004D for ; Thu, 22 Nov 2012 04:37:49 -0500 (EST) References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> <50AD7647.7050200@gmail.com> <20121122010959.GF24381@cmpxchg.org> Message-ID: <1353577068.19982.YahooMailNeo@web141101.mail.bf1.yahoo.com> Date: Thu, 22 Nov 2012 01:37:48 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement In-Reply-To: <20121122010959.GF24381@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner , Jaegeuk Hanse Cc: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , =?utf-8?B?TWV0aW4gRMO2xZ9sw7w=?= Hi Johannes,=0A=0AYes, problem was as you projected. I tried to make "activ= e" data-2 pages by manually reading them twice, and finally data-1 are got = out of page cache.=0A=0AWe have large files in PostgreSQL and Hadoop that w= e sequentially scan over; and try to fit our working set into total memory.= So I hope your patches will take place in the soonest linux kernel version= .=0A=0AThanks,=0AMetin=0A=0A=0A----- Original Message -----=0AFrom: Johanne= s Weiner =0ATo: Jaegeuk Hanse = =0ACc: Jan Kara ; metin d ; "linux-kernel@v= ger.kernel.org" ; linux-mm@kvack.org=0ASent: = Thursday, November 22, 2012 3:09 AM=0ASubject: Re: Problem in Page Cache Re= placement=0A=0AOn Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrot= e:=0A> On 11/22/2012 05:34 AM, Johannes Weiner wrote:=0A> >Hi,=0A> >=0A> >O= n Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:=0A> >>On Tue 20-11= -12 09:42:42, metin d wrote:=0A> >>>I have two PostgreSQL databases named d= ata-1 and data-2 that sit on the=0A> >>>same machine. Both databases keep 4= 0 GB of data, and the total memory=0A> >>>available on the machine is 68GB.= =0A> >>>=0A> >>>I started data-1 and data-2, and ran several queries to go = over all their=0A> >>>data. Then, I shut down data-1 and kept issuing queri= es against data-2.=0A> >>>For some reason, the OS still holds on to large p= arts of data-1's pages=0A> >>>in its page cache, and reserves about 35 GB o= f RAM to data-2's files. As=0A> >>>a result, my queries on data-2 keep hitt= ing disk.=0A> >>>=0A> >>>I'm checking page cache usage with fincore. When I= run a table scan query=0A> >>>against data-2, I see that data-2's pages ge= t evicted and put back into=0A> >>>the cache in a round-robin manner. Nothi= ng happens to data-1's pages,=0A> >>>although they haven't been touched for= days.=0A> >>>=0A> >>>Does anybody know why data-1's pages aren't evicted f= rom the page cache?=0A> >>>I'm open to all kind of suggestions you think it= might relate to problem.=0A> >This might be because we do not deactive pag= es as long as there is=0A> >cache on the inactive list.=C2=A0 I'm guessing = that the inter-reference=0A> >distance of data-2 is bigger than half of mem= ory, so it's never=0A> >getting activated and data-1 is never challenged.= =0A> =0A> Hi Johannes,=0A> =0A> What's the meaning of "inter-reference dist= ance"=0A=0AIt's the number of memory accesses between two accesses to the s= ame=0Apage:=0A=0A=C2=A0 A B C D A B C E ...=0A=C2=A0 =C2=A0 |_______|=0A=C2= =A0 =C2=A0 |=C2=A0 =C2=A0 =C2=A0 |=0A=0A> and why compare it with half of = memoy, what's the trick?=0A=0AIf B gets accessed twice, it gets activated.= =C2=A0 If it gets evicted in=0Abetween, the second access will be a fresh p= age fault and B will not=0Abe recognized as frequently used.=0A=0AOur cutof= f for scanning the active list is cache size / 2 right now=0A(inactive_file= _is_low), leaving 50% of memory to the inactive list.=0AIf the inter-refere= nce distance for pages on the inactive list is=0Abigger than that, they get= evicted before their second access.=0A -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx197.postini.com [74.125.245.197]) by kanga.kvack.org (Postfix) with SMTP id BDB896B0070 for ; Thu, 22 Nov 2012 08:00:12 -0500 (EST) Received: by mail-ie0-f169.google.com with SMTP id 10so14807286ied.14 for ; Thu, 22 Nov 2012 05:00:12 -0800 (PST) Message-ID: <50AE21D2.5070105@gmail.com> Date: Thu, 22 Nov 2012 21:00:02 +0800 From: Jaegeuk Hanse MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA166.70705@gmail.com> In-Reply-To: <50ACA166.70705@gmail.com> Content-Type: multipart/alternative; boundary="------------020109050305020204070302" Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse <" jaegeuk.hanse"@gmail.com> Cc: Fengguang Wu , metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" This is a multi-part message in MIME format. --------------020109050305020204070302 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit On 11/21/2012 05:39 PM, Jaegeuk Hanse wrote: > On 11/21/2012 05:02 PM, Fengguang Wu wrote: >> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: >>> Cc Fengguang Wu. >>> >>> On 11/21/2012 04:13 PM, metin d wrote: >>>>> Curious. Added linux-mm list to CC to catch more attention. If you run >>>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? >>>> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. >>>> >>>> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. >>>> My understanding was that under memory pressure from heavily >>>> accessed pages, unused pages would eventually get evicted. Is there >>>> anything else we can try on this host to understand why this is >>>> happening? >> We may debug it this way. >> >> 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages >> (please double check via /proc/vmstat whether it does the expected work) >> >> 2) run 'page-types -r' with root, to view the page status for the >> remaining pages of data-1 >> >> The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) >> Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" >> >> page-types can be found in the kernel source tree tools/vm/page-types.c >> >> Sorry that sounds a bit twisted.. I do have a patch to directly dump >> page cache status of a user specified file, however it's not >> upstreamed yet. > > Hi Fengguang, > > Thanks for you detail steps, I think metin can have a try. > > flags page-count MB symbolic-flags long-symbolic-flags > 0x0000000000000000 607699 2373 > ___________________________________ > 0x0000000100000000 343227 1340 > _______________________r___________ reserved > > But I have some questions of page-type Hi Fengguang, Could you explain confusion mentioned above? thanks in advance. Regards, Jaegeuk > >> Thanks, >> Fengguang >> >>>> On Tue 20-11-12 09:42:42, metin d wrote: >>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>>>> same machine. Both databases keep 40 GB of data, and the total memory >>>>> available on the machine is 68GB. >>>>> >>>>> I started data-1 and data-2, and ran several queries to go over all their >>>>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>>>> For some reason, the OS still holds on to large parts of data-1's pages >>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>>>> a result, my queries on data-2 keep hitting disk. >>>>> >>>>> I'm checking page cache usage with fincore. When I run a table scan query >>>>> against data-2, I see that data-2's pages get evicted and put back into >>>>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>>>> although they haven't been touched for days. >>>>> >>>>> Does anybody know why data-1's pages aren't evicted from the page cache? >>>>> I'm open to all kind of suggestions you think it might relate to problem. >>>> Curious. Added linux-mm list to CC to catch more attention. If you run >>>> echo 1 >/proc/sys/vm/drop_caches >>>> does it evict data-1 pages from memory? >>>> >>>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >>>>> swap space. The kernel version is: >>>>> >>>>> $ uname -r >>>>> 3.2.28-45.62.amzn1.x86_64 >>>>> Edit: >>>>> >>>>> and it seems that I use one NUMA instance, if you think that it can a problem. >>>>> >>>>> $ numactl --hardware >>>>> available: 1 nodes (0) >>>>> node 0 cpus: 0 1 2 3 4 5 6 7 >>>>> node 0 size: 70007 MB >>>>> node 0 free: 360 MB >>>>> node distances: >>>>> node 0 >>>>> 0: 10 > --------------020109050305020204070302 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
On 11/21/2012 05:39 PM, Jaegeuk Hanse wrote:
On 11/21/2012 05:02 PM, Fengguang Wu wrote:
On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:
Cc Fengguang Wu.

On 11/21/2012 04:13 PM, metin d wrote:
  Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?
I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this.

We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance.
My understanding was that under memory pressure from heavily
accessed pages, unused pages would eventually get evicted. Is there
anything else we can try on this host to understand why this is
happening?
We may debug it this way.

1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
   (please double check via /proc/vmstat whether it does the expected work)

2) run 'page-types -r' with root, to view the page status for the
   remaining pages of data-1

The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE"

page-types can be found in the kernel source tree tools/vm/page-types.c

Sorry that sounds a bit twisted.. I do have a patch to directly dump
page cache status of a user specified file, however it's not
upstreamed yet.

Hi Fengguang,

Thanks for you detail steps, I think metin can have a try.

        flags    page-count       MB  symbolic-flags            long-symbolic-flags
0x0000000000000000        607699     2373  ___________________________________   
0x0000000100000000        343227     1340  _______________________r___________    reserved

But I have some questions of page-type

Hi Fengguang,

Could you explain confusion mentioned above? thanks in advance.

Regards,
Jaegeuk


Thanks,
Fengguang

On Tue 20-11-12 09:42:42, metin d wrote:
I have two PostgreSQL databases named data-1 and data-2 that sit on the
same machine. Both databases keep 40 GB of data, and the total memory
available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their
data. Then, I shut down data-1 and kept issuing queries against data-2.
For some reason, the OS still holds on to large parts of data-1's pages
in its page cache, and reserves about 35 GB of RAM to data-2's files. As
a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query
against data-2, I see that data-2's pages get evicted and put back into
the cache in a round-robin manner. Nothing happens to data-1's pages,
although they haven't been touched for days.

Does anybody know why data-1's pages aren't evicted from the page cache?
I'm open to all kind of suggestions you think it might relate to problem.
  Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
  does it evict data-1 pages from memory?

This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
swap space. The kernel version is:

$ uname -r
3.2.28-45.62.amzn1.x86_64
Edit:

and it seems that I use one NUMA instance, if  you think that it can a problem.

$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 70007 MB
node 0 free: 360 MB
node distances:
node   0
   0:  10


--------------020109050305020204070302-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx195.postini.com [74.125.245.195]) by kanga.kvack.org (Postfix) with SMTP id 577C06B0072 for ; Thu, 22 Nov 2012 08:16:37 -0500 (EST) Received: by mail-oa0-f41.google.com with SMTP id k14so10227473oag.14 for ; Thu, 22 Nov 2012 05:16:36 -0800 (PST) Message-ID: <50AE25AB.2060808@gmail.com> Date: Thu, 22 Nov 2012 21:16:27 +0800 From: Jaegeuk Hanse MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> <50AD7647.7050200@gmail.com> <20121122010959.GF24381@cmpxchg.org> In-Reply-To: <20121122010959.GF24381@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Jan Kara , metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org On 11/22/2012 09:09 AM, Johannes Weiner wrote: > On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote: >> On 11/22/2012 05:34 AM, Johannes Weiner wrote: >>> Hi, >>> >>> On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: >>>> On Tue 20-11-12 09:42:42, metin d wrote: >>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>>>> same machine. Both databases keep 40 GB of data, and the total memory >>>>> available on the machine is 68GB. >>>>> >>>>> I started data-1 and data-2, and ran several queries to go over all their >>>>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>>>> For some reason, the OS still holds on to large parts of data-1's pages >>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>>>> a result, my queries on data-2 keep hitting disk. >>>>> >>>>> I'm checking page cache usage with fincore. When I run a table scan query >>>>> against data-2, I see that data-2's pages get evicted and put back into >>>>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>>>> although they haven't been touched for days. >>>>> >>>>> Does anybody know why data-1's pages aren't evicted from the page cache? >>>>> I'm open to all kind of suggestions you think it might relate to problem. >>> This might be because we do not deactive pages as long as there is >>> cache on the inactive list. I'm guessing that the inter-reference >>> distance of data-2 is bigger than half of memory, so it's never >>> getting activated and data-1 is never challenged. >> Hi Johannes, >> >> What's the meaning of "inter-reference distance" > It's the number of memory accesses between two accesses to the same > page: > > A B C D A B C E ... > |_______| > | | > >> and why compare it with half of memoy, what's the trick? > If B gets accessed twice, it gets activated. If it gets evicted in > between, the second access will be a fresh page fault and B will not > be recognized as frequently used. > > Our cutoff for scanning the active list is cache size / 2 right now > (inactive_file_is_low), leaving 50% of memory to the inactive list. > If the inter-reference distance for pages on the inactive list is > bigger than that, they get evicted before their second access. Hi Johannes, Thanks for your explanation. But could you give a short description of how you resolve this inactive list thrashing issues? Regards, Jaegeuk -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx177.postini.com [74.125.245.177]) by kanga.kvack.org (Postfix) with SMTP id 079896B005D for ; Thu, 22 Nov 2012 10:26:19 -0500 (EST) Date: Thu, 22 Nov 2012 23:26:11 +0800 From: Fengguang Wu Subject: Re: Problem in Page Cache Replacement Message-ID: <20121122152611.GA11736@localhost> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50ACA209.9000101@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse Cc: metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Hi Jaegeuk, Sorry for the delay. I'm traveling these days.. On Wed, Nov 21, 2012 at 05:42:33PM +0800, Jaegeuk Hanse wrote: > On 11/21/2012 05:02 PM, Fengguang Wu wrote: > >On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: > >>Cc Fengguang Wu. > >> > >>On 11/21/2012 04:13 PM, metin d wrote: > >>>> Curious. Added linux-mm list to CC to catch more attention. If you run > >>>>echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? > >>>I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. > >>> > >>>We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. > >>>My understanding was that under memory pressure from heavily > >>>accessed pages, unused pages would eventually get evicted. Is there > >>>anything else we can try on this host to understand why this is > >>>happening? > >We may debug it this way. > > > >1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages > > (please double check via /proc/vmstat whether it does the expected work) > > > >2) run 'page-types -r' with root, to view the page status for the > > remaining pages of data-1 > > > >The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) > >Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" > > > >page-types can be found in the kernel source tree tools/vm/page-types.c > > > >Sorry that sounds a bit twisted.. I do have a patch to directly dump > >page cache status of a user specified file, however it's not > >upstreamed yet. > > Hi Fengguang, > > Thanks for you detail steps, I think metin can have a try. > > flags page-count MB symbolic-flags long-symbolic-flags > 0x0000000000000000 607699 2373 > ___________________________________ > 0x0000000100000000 343227 1340 > _______________________r___________ reserved We don't need to care about the above two pages states actually. Page cache pages will never be in the special reserved or all-flags-cleared state. > But I have some questions of the print of page-type: > > Is 2373MB here mean total memory in used include page cache? I don't > think so. > Which kind of pages will be marked reserved? > Which line of long-symbolic-flags is for page cache? The (lru && !anonymous) pages are page cache pages. Thanks, Fengguang > >>>On Tue 20-11-12 09:42:42, metin d wrote: > >>>>I have two PostgreSQL databases named data-1 and data-2 that sit on the > >>>>same machine. Both databases keep 40 GB of data, and the total memory > >>>>available on the machine is 68GB. > >>>> > >>>>I started data-1 and data-2, and ran several queries to go over all their > >>>>data. Then, I shut down data-1 and kept issuing queries against data-2. > >>>>For some reason, the OS still holds on to large parts of data-1's pages > >>>>in its page cache, and reserves about 35 GB of RAM to data-2's files. As > >>>>a result, my queries on data-2 keep hitting disk. > >>>> > >>>>I'm checking page cache usage with fincore. When I run a table scan query > >>>>against data-2, I see that data-2's pages get evicted and put back into > >>>>the cache in a round-robin manner. Nothing happens to data-1's pages, > >>>>although they haven't been touched for days. > >>>> > >>>>Does anybody know why data-1's pages aren't evicted from the page cache? > >>>>I'm open to all kind of suggestions you think it might relate to problem. > >>> Curious. Added linux-mm list to CC to catch more attention. If you run > >>>echo 1 >/proc/sys/vm/drop_caches > >>> does it evict data-1 pages from memory? > >>> > >>>>This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no > >>>>swap space. The kernel version is: > >>>> > >>>>$ uname -r > >>>>3.2.28-45.62.amzn1.x86_64 > >>>>Edit: > >>>> > >>>>and it seems that I use one NUMA instance, if you think that it can a problem. > >>>> > >>>>$ numactl --hardware > >>>>available: 1 nodes (0) > >>>>node 0 cpus: 0 1 2 3 4 5 6 7 > >>>>node 0 size: 70007 MB > >>>>node 0 free: 360 MB > >>>>node distances: > >>>>node 0 > >>>> 0: 10 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx166.postini.com [74.125.245.166]) by kanga.kvack.org (Postfix) with SMTP id 211F86B002B for ; Thu, 22 Nov 2012 10:41:12 -0500 (EST) Date: Thu, 22 Nov 2012 23:41:07 +0800 From: Fengguang Wu Subject: Re: Problem in Page Cache Replacement Message-ID: <20121122154107.GB11736@localhost> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Metin =?utf-8?B?RMO2xZ9sw7w=?= Cc: Jaegeuk Hanse , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin DA?A?lA 1/4 wrote: > On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse wrote: > > > > On 11/21/2012 05:58 PM, metin d wrote: > > > > Hi Fengguang, > > > > I run tests and attached the results. The line below I guess shows the data-1 page caches. > > > > 0x000000080000006c 6584051 25718 __RU_lA___________________P________ referenced,uptodate,lru,active,private > > > > > > I thinks this is just one state of page cache pages. > > But why these page caches are in this state as opposed to other page > caches. From the results I conclude that: > > data-1 pages are in state : referenced,uptodate,lru,active,private I wonder if it's this code that stops data-1 pages from being reclaimed: shrink_page_list(): if (page_has_private(page)) { if (!try_to_release_page(page, sc->gfp_mask)) goto activate_locked; What's the filesystem used? > data-2 pages are in state : referenced,uptodate,lru,mappedtodisk Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx119.postini.com [74.125.245.119]) by kanga.kvack.org (Postfix) with SMTP id BA2F06B005D for ; Thu, 22 Nov 2012 10:53:21 -0500 (EST) Date: Thu, 22 Nov 2012 23:53:18 +0800 From: Fengguang Wu Subject: Re: Problem in Page Cache Replacement Message-ID: <20121122155318.GA12636@localhost> References: <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> <20121122154107.GB11736@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20121122154107.GB11736@localhost> Sender: owner-linux-mm@kvack.org List-ID: To: Metin =?utf-8?B?RMO2xZ9sw7w=?= Cc: Jaegeuk Hanse , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" On Thu, Nov 22, 2012 at 11:41:07PM +0800, Fengguang Wu wrote: > On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin DA?A?lA 1/4 wrote: > > On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse wrote: > > > > > > On 11/21/2012 05:58 PM, metin d wrote: > > > > > > Hi Fengguang, > > > > > > I run tests and attached the results. The line below I guess shows the data-1 page caches. > > > > > > 0x000000080000006c 6584051 25718 __RU_lA___________________P________ referenced,uptodate,lru,active,private > > > > > > > > > I thinks this is just one state of page cache pages. > > > > But why these page caches are in this state as opposed to other page > > caches. From the results I conclude that: > > > > data-1 pages are in state : referenced,uptodate,lru,active,private > > I wonder if it's this code that stops data-1 pages from being > reclaimed: > > shrink_page_list(): > > if (page_has_private(page)) { > if (!try_to_release_page(page, sc->gfp_mask)) > goto activate_locked; > > What's the filesystem used? Ah it's more likely caused by this logic: if (is_active_lru(lru)) { if (inactive_list_is_low(mz, file)) shrink_active_list(nr_to_scan, mz, sc, priority, file); The active file list won't be scanned at all if it's smaller than the active list. In this case, it's inactive=33586MB > active=25719MB. So the data-1 pages in the active list will never be scanned and reclaimed. > > data-2 pages are in state : referenced,uptodate,lru,mappedtodisk > > Thanks, > Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx126.postini.com [74.125.245.126]) by kanga.kvack.org (Postfix) with SMTP id 3285F6B0073 for ; Thu, 22 Nov 2012 11:17:57 -0500 (EST) Date: Thu, 22 Nov 2012 11:17:43 -0500 From: Johannes Weiner Subject: Re: Problem in Page Cache Replacement Message-ID: <20121122161743.GH24381@cmpxchg.org> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> <50AD7647.7050200@gmail.com> <20121122010959.GF24381@cmpxchg.org> <50AE25AB.2060808@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50AE25AB.2060808@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse Cc: Jan Kara , metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org On Thu, Nov 22, 2012 at 09:16:27PM +0800, Jaegeuk Hanse wrote: > On 11/22/2012 09:09 AM, Johannes Weiner wrote: > >On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote: > >>On 11/22/2012 05:34 AM, Johannes Weiner wrote: > >>>Hi, > >>> > >>>On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: > >>>>On Tue 20-11-12 09:42:42, metin d wrote: > >>>>>I have two PostgreSQL databases named data-1 and data-2 that sit on the > >>>>>same machine. Both databases keep 40 GB of data, and the total memory > >>>>>available on the machine is 68GB. > >>>>> > >>>>>I started data-1 and data-2, and ran several queries to go over all their > >>>>>data. Then, I shut down data-1 and kept issuing queries against data-2. > >>>>>For some reason, the OS still holds on to large parts of data-1's pages > >>>>>in its page cache, and reserves about 35 GB of RAM to data-2's files. As > >>>>>a result, my queries on data-2 keep hitting disk. > >>>>> > >>>>>I'm checking page cache usage with fincore. When I run a table scan query > >>>>>against data-2, I see that data-2's pages get evicted and put back into > >>>>>the cache in a round-robin manner. Nothing happens to data-1's pages, > >>>>>although they haven't been touched for days. > >>>>> > >>>>>Does anybody know why data-1's pages aren't evicted from the page cache? > >>>>>I'm open to all kind of suggestions you think it might relate to problem. > >>>This might be because we do not deactive pages as long as there is > >>>cache on the inactive list. I'm guessing that the inter-reference > >>>distance of data-2 is bigger than half of memory, so it's never > >>>getting activated and data-1 is never challenged. > >>Hi Johannes, > >> > >>What's the meaning of "inter-reference distance" > >It's the number of memory accesses between two accesses to the same > >page: > > > > A B C D A B C E ... > > |_______| > > | | > > > >>and why compare it with half of memoy, what's the trick? > >If B gets accessed twice, it gets activated. If it gets evicted in > >between, the second access will be a fresh page fault and B will not > >be recognized as frequently used. > > > >Our cutoff for scanning the active list is cache size / 2 right now > >(inactive_file_is_low), leaving 50% of memory to the inactive list. > >If the inter-reference distance for pages on the inactive list is > >bigger than that, they get evicted before their second access. > > Hi Johannes, > > Thanks for your explanation. But could you give a short description > of how you resolve this inactive list thrashing issues? I remember a time stamp of evicted file pages in the page cache radix tree that let me reconstruct the inter-reference distance even after a page has been evicted from cache when it's faulted back in. This way I can tell a one-time sequence from thrashing, no matter how small the inactive list. When thrashing is detected, I start deactivating protected pages and put them next to the refaulted cache on the head of the inactive list and let them fight it out as usual. In this reported case, the old data will be challenged and since it's no longer used, it will just drop off the inactive list eventually. If the guess is wrong and the deactivated memory is used more heavily than the refaulting pages, they will just get activated again without incurring any disruption like a major fault. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx163.postini.com [74.125.245.163]) by kanga.kvack.org (Postfix) with SMTP id 9689B6B005D for ; Thu, 22 Nov 2012 20:32:13 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id e20so2418507dak.14 for ; Thu, 22 Nov 2012 17:32:12 -0800 (PST) Message-ID: <50AED214.4000701@gmail.com> Date: Fri, 23 Nov 2012 09:32:04 +0800 From: Jaegeuk Hanse MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <20121122152611.GA11736@localhost> In-Reply-To: <20121122152611.GA11736@localhost> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Fengguang Wu Cc: metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" On 11/22/2012 11:26 PM, Fengguang Wu wrote: > Hi Jaegeuk, > > Sorry for the delay. I'm traveling these days.. > > On Wed, Nov 21, 2012 at 05:42:33PM +0800, Jaegeuk Hanse wrote: >> On 11/21/2012 05:02 PM, Fengguang Wu wrote: >>> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: >>>> Cc Fengguang Wu. >>>> >>>> On 11/21/2012 04:13 PM, metin d wrote: >>>>>> Curious. Added linux-mm list to CC to catch more attention. If you run >>>>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? >>>>> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. >>>>> >>>>> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. >>>>> My understanding was that under memory pressure from heavily >>>>> accessed pages, unused pages would eventually get evicted. Is there >>>>> anything else we can try on this host to understand why this is >>>>> happening? >>> We may debug it this way. >>> >>> 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages >>> (please double check via /proc/vmstat whether it does the expected work) >>> >>> 2) run 'page-types -r' with root, to view the page status for the >>> remaining pages of data-1 >>> >>> The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) >>> Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" >>> >>> page-types can be found in the kernel source tree tools/vm/page-types.c >>> >>> Sorry that sounds a bit twisted.. I do have a patch to directly dump >>> page cache status of a user specified file, however it's not >>> upstreamed yet. >> Hi Fengguang, >> >> Thanks for you detail steps, I think metin can have a try. >> >> flags page-count MB symbolic-flags long-symbolic-flags >> 0x0000000000000000 607699 2373 >> ___________________________________ >> 0x0000000100000000 343227 1340 >> _______________________r___________ reserved > > We don't need to care about the above two pages states actually. > Page cache pages will never be in the special reserved or > all-flags-cleared state. Hi Fengguang, Thanks for your response. But which kind of pages are in the special reserved and which are all-flags-cleared? Regards, Jaegeuk > >> But I have some questions of the print of page-type: >> >> Is 2373MB here mean total memory in used include page cache? I don't >> think so. >> Which kind of pages will be marked reserved? >> Which line of long-symbolic-flags is for page cache? > The (lru && !anonymous) pages are page cache pages. > > Thanks, > Fengguang > >>>>> On Tue 20-11-12 09:42:42, metin d wrote: >>>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>>>>> same machine. Both databases keep 40 GB of data, and the total memory >>>>>> available on the machine is 68GB. >>>>>> >>>>>> I started data-1 and data-2, and ran several queries to go over all their >>>>>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>>>>> For some reason, the OS still holds on to large parts of data-1's pages >>>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>>>>> a result, my queries on data-2 keep hitting disk. >>>>>> >>>>>> I'm checking page cache usage with fincore. When I run a table scan query >>>>>> against data-2, I see that data-2's pages get evicted and put back into >>>>>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>>>>> although they haven't been touched for days. >>>>>> >>>>>> Does anybody know why data-1's pages aren't evicted from the page cache? >>>>>> I'm open to all kind of suggestions you think it might relate to problem. >>>>> Curious. Added linux-mm list to CC to catch more attention. If you run >>>>> echo 1 >/proc/sys/vm/drop_caches >>>>> does it evict data-1 pages from memory? >>>>> >>>>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >>>>>> swap space. The kernel version is: >>>>>> >>>>>> $ uname -r >>>>>> 3.2.28-45.62.amzn1.x86_64 >>>>>> Edit: >>>>>> >>>>>> and it seems that I use one NUMA instance, if you think that it can a problem. >>>>>> >>>>>> $ numactl --hardware >>>>>> available: 1 nodes (0) >>>>>> node 0 cpus: 0 1 2 3 4 5 6 7 >>>>>> node 0 size: 70007 MB >>>>>> node 0 free: 360 MB >>>>>> node distances: >>>>>> node 0 >>>>>> 0: 10 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx187.postini.com [74.125.245.187]) by kanga.kvack.org (Postfix) with SMTP id D78E86B0070 for ; Thu, 22 Nov 2012 20:58:50 -0500 (EST) Received: by mail-ia0-f169.google.com with SMTP id r4so7618039iaj.14 for ; Thu, 22 Nov 2012 17:58:50 -0800 (PST) Message-ID: <50AED854.7080300@gmail.com> Date: Fri, 23 Nov 2012 09:58:44 +0800 From: Jaegeuk Hanse MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> In-Reply-To: <20121120182500.GH1408@quack.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: metin d Cc: Jan Kara , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org On 11/21/2012 02:25 AM, Jan Kara wrote: > On Tue 20-11-12 09:42:42, metin d wrote: >> I have two PostgreSQL databases named data-1 and data-2 that sit on the >> same machine. Both databases keep 40 GB of data, and the total memory >> available on the machine is 68GB. >> >> I started data-1 and data-2, and ran several queries to go over all their >> data. Then, I shut down data-1 and kept issuing queries against data-2. >> For some reason, the OS still holds on to large parts of data-1's pages >> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >> a result, my queries on data-2 keep hitting disk. >> >> I'm checking page cache usage with fincore. When I run a table scan query >> against data-2, I see that data-2's pages get evicted and put back into >> the cache in a round-robin manner. Nothing happens to data-1's pages, >> although they haven't been touched for days. Hi metin d, fincore is a tool or ...? How could I get it? Regards, Jaegeuk >> >> Does anybody know why data-1's pages aren't evicted from the page cache? >> I'm open to all kind of suggestions you think it might relate to problem. > Curious. Added linux-mm list to CC to catch more attention. If you run > echo 1 >/proc/sys/vm/drop_caches > does it evict data-1 pages from memory? > >> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >> swap space. The kernel version is: >> >> $ uname -r >> 3.2.28-45.62.amzn1.x86_64 >> Edit: >> >> and it seems that I use one NUMA instance, if you think that it can a problem. >> >> $ numactl --hardware >> available: 1 nodes (0) >> node 0 cpus: 0 1 2 3 4 5 6 7 >> node 0 size: 70007 MB >> node 0 free: 360 MB >> node distances: >> node 0 >> 0: 10 > Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx119.postini.com [74.125.245.119]) by kanga.kvack.org (Postfix) with SMTP id 567B46B005D for ; Thu, 22 Nov 2012 21:10:33 -0500 (EST) Received: by mail-ia0-f169.google.com with SMTP id r4so7623774iaj.14 for ; Thu, 22 Nov 2012 18:10:32 -0800 (PST) Message-ID: <50AEDB12.6090300@gmail.com> Date: Fri, 23 Nov 2012 10:10:26 +0800 From: Jaegeuk Hanse MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> <20121122154107.GB11736@localhost> <20121122155318.GA12636@localhost> In-Reply-To: <20121122155318.GA12636@localhost> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Fengguang Wu Cc: =?UTF-8?B?TWV0aW4gRMO2xZ9sw7w=?= , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" On 11/22/2012 11:53 PM, Fengguang Wu wrote: > On Thu, Nov 22, 2012 at 11:41:07PM +0800, Fengguang Wu wrote: >> On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin DA?A?lA 1/4 wrote: >>> On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse wrote: >>>> On 11/21/2012 05:58 PM, metin d wrote: >>>> >>>> Hi Fengguang, >>>> >>>> I run tests and attached the results. The line below I guess shows the data-1 page caches. >>>> >>>> 0x000000080000006c 6584051 25718 __RU_lA___________________P________ referenced,uptodate,lru,active,private >>>> >>>> >>>> I thinks this is just one state of page cache pages. >>> But why these page caches are in this state as opposed to other page >>> caches. From the results I conclude that: >>> >>> data-1 pages are in state : referenced,uptodate,lru,active,private >> I wonder if it's this code that stops data-1 pages from being >> reclaimed: >> >> shrink_page_list(): >> >> if (page_has_private(page)) { >> if (!try_to_release_page(page, sc->gfp_mask)) >> goto activate_locked; >> >> What's the filesystem used? > Ah it's more likely caused by this logic: > > if (is_active_lru(lru)) { > if (inactive_list_is_low(mz, file)) > shrink_active_list(nr_to_scan, mz, sc, priority, file); > > The active file list won't be scanned at all if it's smaller than the > active list. In this case, it's inactive=33586MB > active=25719MB. So > the data-1 pages in the active list will never be scanned and reclaimed. Hi Fengguang, It seems that most of data-1 file pages are in active lru cache and most of data-2 file pages are in inactive lru cache. As Johannes mentioned, if inter-reference distance is bigger than half of memory, the pages will not be actived. How you intend to resolve this issue? Is Johannes's inactive list threshing idea available? Regards, Jaegeuk > >>> data-2 pages are in state : referenced,uptodate,lru,mappedtodisk >> Thanks, >> Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx128.postini.com [74.125.245.128]) by kanga.kvack.org (Postfix) with SMTP id 67AB56B005D for ; Thu, 22 Nov 2012 21:14:13 -0500 (EST) Received: by mail-ie0-f169.google.com with SMTP id 10so15812919ied.14 for ; Thu, 22 Nov 2012 18:14:12 -0800 (PST) Message-ID: <50AEDBEF.8070408@gmail.com> Date: Fri, 23 Nov 2012 10:14:07 +0800 From: Jaegeuk Hanse MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> <50AD7647.7050200@gmail.com> <20121122010959.GF24381@cmpxchg.org> <50AE25AB.2060808@gmail.com> <20121122161743.GH24381@cmpxchg.org> In-Reply-To: <20121122161743.GH24381@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: Jan Kara , metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org On 11/23/2012 12:17 AM, Johannes Weiner wrote: > On Thu, Nov 22, 2012 at 09:16:27PM +0800, Jaegeuk Hanse wrote: >> On 11/22/2012 09:09 AM, Johannes Weiner wrote: >>> On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote: >>>> On 11/22/2012 05:34 AM, Johannes Weiner wrote: >>>>> Hi, >>>>> >>>>> On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: >>>>>> On Tue 20-11-12 09:42:42, metin d wrote: >>>>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>>>>>> same machine. Both databases keep 40 GB of data, and the total memory >>>>>>> available on the machine is 68GB. >>>>>>> >>>>>>> I started data-1 and data-2, and ran several queries to go over all their >>>>>>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>>>>>> For some reason, the OS still holds on to large parts of data-1's pages >>>>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>>>>>> a result, my queries on data-2 keep hitting disk. >>>>>>> >>>>>>> I'm checking page cache usage with fincore. When I run a table scan query >>>>>>> against data-2, I see that data-2's pages get evicted and put back into >>>>>>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>>>>>> although they haven't been touched for days. >>>>>>> >>>>>>> Does anybody know why data-1's pages aren't evicted from the page cache? >>>>>>> I'm open to all kind of suggestions you think it might relate to problem. >>>>> This might be because we do not deactive pages as long as there is >>>>> cache on the inactive list. I'm guessing that the inter-reference >>>>> distance of data-2 is bigger than half of memory, so it's never >>>>> getting activated and data-1 is never challenged. >>>> Hi Johannes, >>>> >>>> What's the meaning of "inter-reference distance" >>> It's the number of memory accesses between two accesses to the same >>> page: >>> >>> A B C D A B C E ... >>> |_______| >>> | | >>> >>>> and why compare it with half of memoy, what's the trick? >>> If B gets accessed twice, it gets activated. If it gets evicted in >>> between, the second access will be a fresh page fault and B will not >>> be recognized as frequently used. >>> >>> Our cutoff for scanning the active list is cache size / 2 right now >>> (inactive_file_is_low), leaving 50% of memory to the inactive list. >>> If the inter-reference distance for pages on the inactive list is >>> bigger than that, they get evicted before their second access. >> Hi Johannes, >> >> Thanks for your explanation. But could you give a short description >> of how you resolve this inactive list thrashing issues? > I remember a time stamp of evicted file pages in the page cache radix > tree that let me reconstruct the inter-reference distance even after a > page has been evicted from cache when it's faulted back in. This way > I can tell a one-time sequence from thrashing, no matter how small the > inactive list. > > When thrashing is detected, I start deactivating protected pages and > put them next to the refaulted cache on the head of the inactive list > and let them fight it out as usual. In this reported case, the old > data will be challenged and since it's no longer used, it will just > drop off the inactive list eventually. If the guess is wrong and the > deactivated memory is used more heavily than the refaulting pages, > they will just get activated again without incurring any disruption > like a major fault. Hi Johannes, If you also add the time stamp to the protected pages which you deactive when incur thrashing? Regards, Jaegeuk -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx128.postini.com [74.125.245.128]) by kanga.kvack.org (Postfix) with SMTP id 2FF426B0044 for ; Thu, 22 Nov 2012 21:26:08 -0500 (EST) Date: Fri, 23 Nov 2012 10:25:57 +0800 From: Fengguang Wu Subject: Re: Problem in Page Cache Replacement Message-ID: <20121123022557.GA3954@localhost> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <20121122152611.GA11736@localhost> <50AED214.4000701@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50AED214.4000701@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse Cc: metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Jaegeuk, > Thanks for your response. But which kind of pages are in the special > reserved and which are all-flags-cleared? The all-flags-cleared pages are mostly free pages in the buddy system. The pages with flag "buddy" are also free pages: the buddy system only marks the head pages of each order-2 free range with flag "buddy". The reserved pages come from many sources, they may be set for memory reserved for BIOS, memory holes, offlined memory, or used by some device drivers. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx153.postini.com [74.125.245.153]) by kanga.kvack.org (Postfix) with SMTP id 72F856B005D for ; Fri, 23 Nov 2012 03:08:45 -0500 (EST) References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <50AED854.7080300@gmail.com> Message-ID: <1353658123.36385.YahooMailNeo@web141101.mail.bf1.yahoo.com> Date: Fri, 23 Nov 2012 00:08:43 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement In-Reply-To: <50AED854.7080300@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse Cc: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" ----- Original Message -----=0A=0AFrom: Jaegeuk Hanse =0ATo: metin d =0ACc: Jan Kara ; "linux= -kernel@vger.kernel.org" ; linux-mm@kvack.org= =0ASent: Friday, November 23, 2012 3:58 AM=0ASubject: Re: Problem in Page C= ache Replacement=0A=0AOn 11/21/2012 02:25 AM, Jan Kara wrote:=0A> On Tue 20= -11-12 09:42:42, metin d wrote:=0A>> I have two PostgreSQL databases named = data-1 and data-2 that sit on the=0A>> same machine. Both databases keep 40= GB of data, and the total memory=0A>> available on the machine is 68GB.=0A= >>=0A>> I started data-1 and data-2, and ran several queries to go over all= their=0A>> data. Then, I shut down data-1 and kept issuing queries against= data-2.=0A>> For some reason, the OS still holds on to large parts of data= -1's pages=0A>> in its page cache, and reserves about 35 GB of RAM to data-= 2's files. As=0A>> a result, my queries on data-2 keep hitting disk.=0A>>= =0A>> I'm checking page cache usage with fincore. When I run a table scan q= uery=0A>> against data-2, I see that data-2's pages get evicted and put bac= k into=0A>> the cache in a round-robin manner. Nothing happens to data-1's = pages,=0A>> although they haven't been touched for days.=0A=0A> Hi metin d,= =0A=0A> fincore is a tool or ...? How could I get it?=0A=0A> Regards,=0A> J= aegeuk=0A=0A=0AHi=A0Jaegeuk,=0A=0AYes, it is a tool, you get it from here := =0Ahttp://code.google.com/p/linux-ftools/=0A=0A=0ARegards,=0AMetin=0A>>=0A>= > Does anybody know why data-1's pages aren't evicted from the page cache?= =0A>> I'm open to all kind of suggestions you think it might relate to prob= lem.=0A>=A0 =A0 Curious. Added linux-mm list to CC to catch more attention.= If you run=0A> echo 1 >/proc/sys/vm/drop_caches=0A>=A0 =A0 does it evict d= ata-1 pages from memory?=0A>=0A>> This is an EC2 m2.4xlarge instance on Ama= zon with 68 GB of RAM and no=0A>> swap space. The kernel version is:=0A>>= =0A>> $ uname -r=0A>> 3.2.28-45.62.amzn1.x86_64=0A>> Edit:=0A>>=0A>> and it= seems that I use one NUMA instance, if=A0 you think that it can a problem.= =0A>>=0A>> $ numactl --hardware=0A>> available: 1 nodes (0)=0A>> node 0 cpu= s: 0 1 2 3 4 5 6 7=0A>> node 0 size: 70007 MB=0A>> node 0 free: 360 MB=0A>>= node distances:=0A>> node=A0 0=0A>>=A0 =A0 0:=A0 10=0A> =A0=A0=A0 =A0=A0= =A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx173.postini.com [74.125.245.173]) by kanga.kvack.org (Postfix) with SMTP id 9EB3D6B006C for ; Fri, 23 Nov 2012 03:18:04 -0500 (EST) Received: by mail-da0-f41.google.com with SMTP id e20so2550586dak.14 for ; Fri, 23 Nov 2012 00:18:04 -0800 (PST) Message-ID: <50AF3134.3090803@gmail.com> Date: Fri, 23 Nov 2012 16:17:56 +0800 From: Jaegeuk Hanse MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <50AED854.7080300@gmail.com> <1353658123.36385.YahooMailNeo@web141101.mail.bf1.yahoo.com> In-Reply-To: <1353658123.36385.YahooMailNeo@web141101.mail.bf1.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: metin d Cc: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" On 11/23/2012 04:08 PM, metin d wrote: > ----- Original Message ----- > > From: Jaegeuk Hanse > To: metin d > Cc: Jan Kara ; "linux-kernel@vger.kernel.org" ; linux-mm@kvack.org > Sent: Friday, November 23, 2012 3:58 AM > Subject: Re: Problem in Page Cache Replacement > > On 11/21/2012 02:25 AM, Jan Kara wrote: >> On Tue 20-11-12 09:42:42, metin d wrote: >>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>> same machine. Both databases keep 40 GB of data, and the total memory >>> available on the machine is 68GB. >>> >>> I started data-1 and data-2, and ran several queries to go over all their >>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>> For some reason, the OS still holds on to large parts of data-1's pages >>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>> a result, my queries on data-2 keep hitting disk. >>> >>> I'm checking page cache usage with fincore. When I run a table scan query >>> against data-2, I see that data-2's pages get evicted and put back into >>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>> although they haven't been touched for days. >> Hi metin d, >> fincore is a tool or ...? How could I get it? >> Regards, >> Jaegeuk > > Hi Jaegeuk, > > Yes, it is a tool, you get it from here : > http://code.google.com/p/linux-ftools/ Hi Metin, Could you give me a link to download it? I can't get it from the link you give me. Thanks in advance. :-) Regards, Jaegeuk > > > Regards, > Metin >>> Does anybody know why data-1's pages aren't evicted from the page cache? >>> I'm open to all kind of suggestions you think it might relate to problem. >> Curious. Added linux-mm list to CC to catch more attention. If you run >> echo 1 >/proc/sys/vm/drop_caches >> does it evict data-1 pages from memory? >> >>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >>> swap space. The kernel version is: >>> >>> $ uname -r >>> 3.2.28-45.62.amzn1.x86_64 >>> Edit: >>> >>> and it seems that I use one NUMA instance, if you think that it can a problem. >>> >>> $ numactl --hardware >>> available: 1 nodes (0) >>> node 0 cpus: 0 1 2 3 4 5 6 7 >>> node 0 size: 70007 MB >>> node 0 free: 360 MB >>> node distances: >>> node 0 >>> 0: 10 >> Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx104.postini.com [74.125.245.104]) by kanga.kvack.org (Postfix) with SMTP id 90CEC6B0071 for ; Fri, 23 Nov 2012 03:25:17 -0500 (EST) References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <50AED854.7080300@gmail.com> <1353658123.36385.YahooMailNeo@web141101.mail.bf1.yahoo.com> <50AF3134.3090803@gmail.com> Message-ID: <1353659115.24777.YahooMailNeo@web141102.mail.bf1.yahoo.com> Date: Fri, 23 Nov 2012 00:25:15 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement In-Reply-To: <50AF3134.3090803@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Jaegeuk Hanse Cc: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" ----- Original Message -----=0A=0AFrom: Jaegeuk Hanse =0ATo: metin d =0ACc: Jan Kara ; "linux= -kernel@vger.kernel.org" ; "linux-mm@kvack.or= g" =0ASent: Friday, November 23, 2012 10:17 AM=0ASubjec= t: Re: Problem in Page Cache Replacement=0A=0AOn 11/23/2012 04:08 PM, metin= d wrote:=0A> ----- Original Message -----=0A>=0A> From: Jaegeuk Hanse =0A> To: metin d =0A> Cc: Jan Kara <= jack@suse.cz>; "linux-kernel@vger.kernel.org" ; linux-mm@kvack.org=0A> Sent: Friday, November 23, 2012 3:58 AM=0A> Subje= ct: Re: Problem in Page Cache Replacement=0A>=0A> On 11/21/2012 02:25 AM, J= an Kara wrote:=0A>> On Tue 20-11-12 09:42:42, metin d wrote:=0A>>> I have t= wo PostgreSQL databases named data-1 and data-2 that sit on the=0A>>> same = machine. Both databases keep 40 GB of data, and the total memory=0A>>> avai= lable on the machine is 68GB.=0A>>>=0A>>> I started data-1 and data-2, and = ran several queries to go over all their=0A>>> data. Then, I shut down data= -1 and kept issuing queries against data-2.=0A>>> For some reason, the OS s= till holds on to large parts of data-1's pages=0A>>> in its page cache, and= reserves about 35 GB of RAM to data-2's files. As=0A>>> a result, my queri= es on data-2 keep hitting disk.=0A>>>=0A>>> I'm checking page cache usage w= ith fincore. When I run a table scan query=0A>>> against data-2, I see that= data-2's pages get evicted and put back into=0A>>> the cache in a round-ro= bin manner. Nothing happens to data-1's pages,=0A>>> although they haven't = been touched for days.=0A>> Hi metin d,=0A>> fincore is a tool or ...? How = could I get it?=0A>> Regards,=0A>> Jaegeuk=0A>=0A> Hi Jaegeuk,=0A>=0A> Yes,= it is a tool, you get it from here :=0A> http://code.google.com/p/linux-ft= ools/=0A=0A=0A> Hi Metin,=0A=0A> Could you give me a link to download it? I= can't get it from the link =0A> you give me. Thanks in advance. :-)=0A=0A>= Regards,=0A> Jaegeuk=0A=0AHi=A0Jaegeuk,=0A=0AYou may need to install mercu= rial on your system, I'm able to download source code with this command:=0A= =0Ahg clone https://code.google.com/p/linux-ftools/=0A=0A=0ARegards,=0AMeti= n=0A=0A>=0A>=0A> Regards,=0A> Metin=0A>>> Does anybody know why data-1's pa= ges aren't evicted from the page cache?=0A>>> I'm open to all kind of sugge= stions you think it might relate to problem.=0A>>=A0 =A0 =A0 Curious. Added= linux-mm list to CC to catch more attention. If you run=0A>> echo 1 >/proc= /sys/vm/drop_caches=0A>>=A0 =A0 =A0 does it evict data-1 pages from memory?= =0A>>=0A>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM = and no=0A>>> swap space. The kernel version is:=0A>>>=0A>>> $ uname -r=0A>>= > 3.2.28-45.62.amzn1.x86_64=0A>>> Edit:=0A>>>=0A>>> and it seems that I use= one NUMA instance, if=A0 you think that it can a problem.=0A>>>=0A>>> $ nu= mactl --hardware=0A>>> available: 1 nodes (0)=0A>>> node 0 cpus: 0 1 2 3 4 = 5 6 7=0A>>> node 0 size: 70007 MB=0A>>> node 0 free: 360 MB=0A>>> node dist= ances:=0A>>> node=A0 0=0A>>>=A0 =A0 =A0 0:=A0 10=0A>>=A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 Honza -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx195.postini.com [74.125.245.195]) by kanga.kvack.org (Postfix) with SMTP id 069346B0044 for ; Sat, 24 Nov 2012 10:06:30 -0500 (EST) Received: by mail-qa0-f48.google.com with SMTP id s11so2295273qaa.14 for ; Sat, 24 Nov 2012 07:06:30 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20121122154107.GB11736@localhost> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> <20121122154107.GB11736@localhost> From: =?UTF-8?B?TWV0aW4gRMO2xZ9sw7w=?= Date: Sat, 24 Nov 2012 17:06:09 +0200 Message-ID: Subject: Re: Problem in Page Cache Replacement Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Fengguang Wu Cc: Jaegeuk Hanse , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" On Thu, Nov 22, 2012 at 5:41 PM, Fengguang Wu wrot= e: > On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin D=C3=B6=C5=9Fl=C3=BC wrot= e: >> On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse wrote: >> > >> > On 11/21/2012 05:58 PM, metin d wrote: >> > >> > Hi Fengguang, >> > >> > I run tests and attached the results. The line below I guess shows the= data-1 page caches. >> > >> > 0x000000080000006c 6584051 25718 __RU_lA___________________P= ________ referenced,uptodate,lru,active,private >> > >> > >> > I thinks this is just one state of page cache pages. >> >> But why these page caches are in this state as opposed to other page >> caches. From the results I conclude that: >> >> data-1 pages are in state : referenced,uptodate,lru,active,private > > I wonder if it's this code that stops data-1 pages from being > reclaimed: > > shrink_page_list(): > > if (page_has_private(page)) { > if (!try_to_release_page(page, sc->gfp_mask)) > goto activate_locked; > > What's the filesystem used? It was ext3. >> data-2 pages are in state : referenced,uptodate,lru,mappedtodisk > > Thanks, > Fengguang --=20 Metin D=C3=B6=C5=9Fl=C3=BC -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx140.postini.com [74.125.245.140]) by kanga.kvack.org (Postfix) with SMTP id 69E196B005A for ; Sun, 25 Nov 2012 15:08:56 -0500 (EST) Message-ID: <50B27AD1.6010703@redhat.com> Date: Sun, 25 Nov 2012 15:08:49 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: Problem in Page Cache Replacement References: <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> <20121122154107.GB11736@localhost> <20121122155318.GA12636@localhost> In-Reply-To: <20121122155318.GA12636@localhost> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Fengguang Wu Cc: =?UTF-8?B?TWV0aW4gRMO2xZ9sw7w=?= , Jaegeuk Hanse , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Johannes Weiner On 11/22/2012 10:53 AM, Fengguang Wu wrote: > Ah it's more likely caused by this logic: > > if (is_active_lru(lru)) { > if (inactive_list_is_low(mz, file)) > shrink_active_list(nr_to_scan, mz, sc, priority, file); > > The active file list won't be scanned at all if it's smaller than the > active list. In this case, it's inactive=33586MB > active=25719MB. So > the data-1 pages in the active list will never be scanned and reclaimed. That's it, indeed. The reason we have that code is that otherwise one large streaming IO could easily end up evicting the entire page cache working set. Usually it works well, because the new page cache working set tends to get touched twice while on the inactive list, and the old working set gets demoted from the active list. Only in a few very specific cases, where the inter-reference distance of the new working set is larger than the size of the inactive list, does it fail. Something like Johannes's patches should solve the problem. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752113Ab2KTRmp (ORCPT ); Tue, 20 Nov 2012 12:42:45 -0500 Received: from nm4.bullet.mail.bf1.yahoo.com ([98.139.212.163]:32862 "EHLO nm4.bullet.mail.bf1.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751238Ab2KTRmo convert rfc822-to-8bit (ORCPT ); Tue, 20 Nov 2012 12:42:44 -0500 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 305692.73846.bm@omp1015.mail.bf1.yahoo.com DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:Message-ID:Date:From:Reply-To:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=rCUoKMh4wVBP9/Pz34PBRnhJ7jNcqGv418zZn/zwzuFJo96DNZLEps99CRcBJ1r3VgYfYO/BCmzT9hmILt591pDACB+S4vbfnXQ7K6qx1x8CE6EujRSfkzF/3J4M/NCWmIA3PUICzzAnypPd4E/XAEFeqDLvRH26UVAYdwlouxw=; X-YMail-OSG: dJmwW9wVM1klPBRv_2GT5R0YtFUYEotYPQ1l2rJ4wEXqxq5 TXPHtaas7SluZiEGNGcbQCw49S.8oei5EyWjDedOzPOk0VavK3iw4lgiEEFT 3kG8vwr1C5nV7x5rAklyhQ7nrStF3pLFOSTs6Q5UXbZ4V._nwgln7Kmm.gH2 BtEC7X1PthJjsQG2aKSIwxv24j6qew4p5R_ou0MQzFq6VzJVRDetU6PiMvxc 7Mnw6PPJDTB.OsxwGCLi443Ovsni6NLcKGoLIYNvKjgvy_6QxT3a4eyNsk7g bUMnTZxyyXQgQ6KS8bwCjL2S9_y97aJjelJTcMepkHseC9uOGhAaV00POb.A .H8Aj9zNWwVVTV8iDNBiVx0sTg8wMVF8Clq9MqLOJzBxNEWxQQHjrZIAaK4S 8FQvXLQBEi.iMSyUMfbtlOTciR1sJ7cHB3bIiT90Z7C4lX6QGdZ8c5jza.su vl221AvPUCTkQMhr9nDH6QlimK3z0TKRxyaYQXOkWBS2HQmpFSvUA X-Rocket-MIMEInfo: 001.001,SSBoYXZlIHR3byBQb3N0Z3JlU1FMIGRhdGFiYXNlcyBuYW1lZCBkYXRhLTEgYW5kIGRhdGEtMiB0aGF0IHNpdCBvbiB0aGUgc2FtZSBtYWNoaW5lLiBCb3RoIGRhdGFiYXNlcyBrZWVwIDQwIEdCIG9mIGRhdGEsIGFuZCB0aGUgdG90YWwgbWVtb3J5IGF2YWlsYWJsZSBvbiB0aGUgbWFjaGluZSBpcyA2OEdCLgoKSSBzdGFydGVkIGRhdGEtMSBhbmQgZGF0YS0yLCBhbmQgcmFuIHNldmVyYWwgcXVlcmllcyB0byBnbyBvdmVyIGFsbCB0aGVpciBkYXRhLiBUaGVuLCBJIHNodXQgZG93biBkYXRhLTEgYW5kIGtlcHQBMAEBAQE- X-Mailer: YahooMailWebService/0.8.126.470 Message-ID: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> Date: Tue, 20 Nov 2012 09:42:42 -0800 (PST) From: metin d Reply-To: metin d Subject: Problem in Page Cache Replacement To: "linux-kernel@vger.kernel.org" MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB. I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk. I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days. Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem. This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no swap space. The kernel version is: $ uname -r 3.2.28-45.62.amzn1.x86_64 Edit: and it seems that I use one NUMA instance, if  you think that it can a problem. $ numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 70007 MB node 0 free: 360 MB node distances: node   0   0:  10 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753405Ab2KTSZW (ORCPT ); Tue, 20 Nov 2012 13:25:22 -0500 Received: from cantor2.suse.de ([195.135.220.15]:38129 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751582Ab2KTSZU (ORCPT ); Tue, 20 Nov 2012 13:25:20 -0500 Date: Tue, 20 Nov 2012 19:25:00 +0100 From: Jan Kara To: metin d Cc: "linux-kernel@vger.kernel.org" , linux-mm@kvack.org Subject: Re: Problem in Page Cache Replacement Message-ID: <20121120182500.GH1408@quack.suse.cz> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 20-11-12 09:42:42, metin d wrote: > I have two PostgreSQL databases named data-1 and data-2 that sit on the > same machine. Both databases keep 40 GB of data, and the total memory > available on the machine is 68GB. > > I started data-1 and data-2, and ran several queries to go over all their > data. Then, I shut down data-1 and kept issuing queries against data-2. > For some reason, the OS still holds on to large parts of data-1's pages > in its page cache, and reserves about 35 GB of RAM to data-2's files. As > a result, my queries on data-2 keep hitting disk. > > I'm checking page cache usage with fincore. When I run a table scan query > against data-2, I see that data-2's pages get evicted and put back into > the cache in a round-robin manner. Nothing happens to data-1's pages, > although they haven't been touched for days. > > Does anybody know why data-1's pages aren't evicted from the page cache? > I'm open to all kind of suggestions you think it might relate to problem. Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? > This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no > swap space. The kernel version is: > > $ uname -r > 3.2.28-45.62.amzn1.x86_64 > Edit: > > and it seems that I use one NUMA instance, if  you think that it can a problem. > > $ numactl --hardware > available: 1 nodes (0) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 70007 MB > node 0 free: 360 MB > node distances: > node   0 >   0:  10 Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753451Ab2KUIKC (ORCPT ); Wed, 21 Nov 2012 03:10:02 -0500 Received: from nm6-vm0.bullet.mail.bf1.yahoo.com ([98.139.213.146]:20908 "EHLO nm6-vm0.bullet.mail.bf1.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751634Ab2KUIKB convert rfc822-to-8bit (ORCPT ); Wed, 21 Nov 2012 03:10:01 -0500 X-Greylist: delayed 379 seconds by postgrey-1.27 at vger.kernel.org; Wed, 21 Nov 2012 03:10:00 EST X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 529279.44201.bm@omp1005.mail.bf1.yahoo.com DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=onET3wcQ7rTvTZ9F3uLlaeHO4erKeZant7LI72Is33R5plG+FYiTLHV1oVwG/IoGBvXQatweuBx9Q6k3/lJ5fS2fm6GYwH+6Xz9Ce1JxlOHbHKCtF/qMdFClmjKuRklFzVFBSEcOxTd8edzfjfN15KL0z/MfNg0u/11Hht6FvYo=; X-YMail-OSG: spmwjbMVM1kpjNjvqx6HTUWQMI2YSqOxm34e7VckV77WhD9 1gGS3pPFBj._jPWsfnWMPHSIcFR4IjBYXNTtorfKnDbsNJ9b0jfuNhykhp9c BLffXPOgjzMY3CaQvdIypxMqK_dBFliUSVOm3qH5kMvo5GdhcY9IoSFf_I.D KRIB9CcbwhjWtnlMSriEpfS34h76mCTz0h_BeBknEFnW3.RUbEoOnxrmj2D_ tSv6iR0nkYWEMmdvPB_QqOuW4CfgX44XBFsU3XZT_lJkpjKZkUYO1HTQVuTM _3gAjBgKeqRby7Sjk9shrvzdz2LALWOO7xTbGPJZ8zRadH7N2EvrUP8CiL4c E7ovFIqNSDNmdHTeyCeou9T4xZL2.vM9oWbo6OHxoHQSsmNJQ11ct3aJbLmD vrthAvE5FubXV3Ofay11PS0v8delNl6RV7Vuk__btUB4yLzlicP2qGayp7Do poxBavbRh4kFMwHqT6pcbT75wCDjizxUjbBcdoCXxOU8T4LX3uOaCOYb3ZM9 l22o- X-Rocket-MIMEInfo: 001.001,Cgo.IMKgQ3VyaW91cy4gQWRkZWQgbGludXgtbW0gbGlzdCB0byBDQyB0byBjYXRjaCBtb3JlIGF0dGVudGlvbi4gSWYgeW91IHJ1bgo.IGVjaG8gMSA.L3Byb2Mvc3lzL3ZtL2Ryb3BfY2FjaGVzwqBkb2VzIGl0IGV2aWN0IGRhdGEtMSBwYWdlcyBmcm9tIG1lbW9yeT8KCgpJJ20gZ3Vlc3NpbmcgaXQnZCBldmljdCB0aGUgZW50cmllcywgYnV0IGFtIHdvbmRlcmluZyBpZiB3ZSBjb3VsZCBydW4gYW55IG1vcmUgZGlhZ25vc3RpY3MgYmVmb3JlIHRyeWluZyB0aGlzLgoKV2UgcmVndWxhcmx5IHVzZSBhIHNldHUBMAEBAQE- X-Mailer: YahooMailWebService/0.8.123.460 References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> Message-ID: <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> Date: Wed, 21 Nov 2012 00:03:40 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement To: Jan Kara Cc: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" In-Reply-To: <20121120182500.GH1408@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >  Curious. Added linux-mm list to CC to catch more attention. If you run > echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. My understanding was that under memory pressure from heavily accessed pages, unused pages would eventually get evicted. Is there anything else we can try on this host to understand why this is happening? Thank you, Metin ----- Original Message ----- From: Jan Kara To: metin d Cc: "linux-kernel@vger.kernel.org" ; linux-mm@kvack.org Sent: Tuesday, November 20, 2012 8:25 PM Subject: Re: Problem in Page Cache Replacement On Tue 20-11-12 09:42:42, metin d wrote: > I have two PostgreSQL databases named data-1 and data-2 that sit on the > same machine. Both databases keep 40 GB of data, and the total memory > available on the machine is 68GB. > > I started data-1 and data-2, and ran several queries to go over all their > data. Then, I shut down data-1 and kept issuing queries against data-2. > For some reason, the OS still holds on to large parts of data-1's pages > in its page cache, and reserves about 35 GB of RAM to data-2's files. As > a result, my queries on data-2 keep hitting disk. > > I'm checking page cache usage with fincore. When I run a table scan query > against data-2, I see that data-2's pages get evicted and put back into > the cache in a round-robin manner. Nothing happens to data-1's pages, > although they haven't been touched for days. > > Does anybody know why data-1's pages aren't evicted from the page cache? > I'm open to all kind of suggestions you think it might relate to problem.   Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches   does it evict data-1 pages from memory? > This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no > swap space. The kernel version is: > > $ uname -r > 3.2.28-45.62.amzn1.x86_64 > Edit: > > and it seems that I use one NUMA instance, if  you think that it can a problem. > > $ numactl --hardware > available: 1 nodes (0) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 70007 MB > node 0 free: 360 MB > node distances: > node   0 >   0:  10                                 Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753469Ab2KUINy (ORCPT ); Wed, 21 Nov 2012 03:13:54 -0500 Received: from nm1.bullet.mail.bf1.yahoo.com ([98.139.212.160]:27667 "EHLO nm1.bullet.mail.bf1.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752126Ab2KUINv convert rfc822-to-8bit (ORCPT ); Wed, 21 Nov 2012 03:13:51 -0500 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 754567.44531.bm@omp1012.mail.bf1.yahoo.com DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=dWa5lM5/v2GtjXj8OjIOJdLspjndJa9syCp2cD8pNvmX9nwFDv7vO1A3+RXTHOv7zUOawsUjAEl4CcoP3DwfVUxCBKd/VYv3nxUzQ/nnD2+rfo9QPLJXabM+DQFOYHjjCAUDfx2XYRThzTD/GG4Lg/K85uuL3hs5LXWbhdZDnyA=; X-YMail-OSG: a0Cq4D4VM1nmea9T7n4bhYQ0IislFpihyboYIAKXDu4t9zC qkD.cowzDY3zphWUEpcipsuqPSAh4IdJ3Oh6JK74az0o8RfbfQ9ZanfMWcs7 zBcIoly3apXtAPQXW2lReqYzdQmmiFfBHeEC1pCJDJxI1Upl6lsIpl1zJ8mO jFGhvZUnYE02sVsRIbq_GW5C_49qXMplPNzFOd5rzM2lVUhU2LkZojmt3FgR Any00mMn9.oR.kdMmAnh5UrgVWomDQKAagZDMdoI38UHFEjs7Uw1P.rdOPPf hjFaTSxqsANStsdHWVlLYq8UBy1OCXqLPbdiGOJnqNvGI88SpPsm0BPvJIZv dKkfKrcOu4V.ImHfxUhc_U0HG.eiooekVdUHi5FI3DdK5vPGjucKqap3NChB llRvSnag7Bw6q_5UH3ZCvYCd8TLPJo8pJCQbkxKK_POjlY3Yq442MPRGDX4u ATVw.bXZ9vfrLPseRztk3enZOorPuQ1fc9e5_BwIy30YklcF8m57eLO2yrBc rMDnh X-Rocket-MIMEInfo: 001.001,PsKgIEN1cmlvdXMuIEFkZGVkIGxpbnV4LW1tIGxpc3QgdG8gQ0MgdG8gY2F0Y2ggbW9yZSBhdHRlbnRpb24uIElmIHlvdSBydW4KPiBlY2hvIDEgPi9wcm9jL3N5cy92bS9kcm9wX2NhY2hlcyBkb2VzIGl0IGV2aWN0IGRhdGEtMSBwYWdlcyBmcm9tIG1lbW9yeT8KCkknbSBndWVzc2luZyBpdCdkIGV2aWN0IHRoZSBlbnRyaWVzLCBidXQgYW0gd29uZGVyaW5nIGlmIHdlIGNvdWxkIHJ1biBhbnkgbW9yZSBkaWFnbm9zdGljcyBiZWZvcmUgdHJ5aW5nIHRoaXMuCgpXZSByZWd1bGFybHkgdXNlIGEgc2V0dXAgd2gBMAEBAQE- X-Mailer: YahooMailWebService/0.8.123.460 References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> Message-ID: <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> Date: Wed, 21 Nov 2012 00:13:50 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement To: Jan Kara Cc: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" In-Reply-To: <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >  Curious. Added linux-mm list to CC to catch more attention. If you run > echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. My understanding was that under memory pressure from heavily accessed pages, unused pages would eventually get evicted. Is there anything else we can try on this host to understand why this is happening? Thank you, Metin On Tue 20-11-12 09:42:42, metin d wrote: > I have two PostgreSQL databases named data-1 and data-2 that sit on the > same machine. Both databases keep 40 GB of data, and the total memory > available on the machine is 68GB. > > I started data-1 and data-2, and ran several queries to go over all their > data. Then, I shut down data-1 and kept issuing queries against data-2. > For some reason, the OS still holds on to large parts of data-1's pages > in its page cache, and reserves about 35 GB of RAM to data-2's files. As > a result, my queries on data-2 keep hitting disk. > > I'm checking page cache usage with fincore. When I run a table scan query > against data-2, I see that data-2's pages get evicted and put back into > the cache in a round-robin manner. Nothing happens to data-1's pages, > although they haven't been touched for days. > > Does anybody know why data-1's pages aren't evicted from the page cache? > I'm open to all kind of suggestions you think it might relate to problem.   Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches   does it evict data-1 pages from memory? > This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no > swap space. The kernel version is: > > $ uname -r > 3.2.28-45.62.amzn1.x86_64 > Edit: > > and it seems that I use one NUMA instance, if  you think that it can a problem. > > $ numactl --hardware > available: 1 nodes (0) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 70007 MB > node 0 free: 360 MB > node distances: > node   0 >   0:  10 -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754012Ab2KUIet (ORCPT ); Wed, 21 Nov 2012 03:34:49 -0500 Received: from mail-ob0-f174.google.com ([209.85.214.174]:60086 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753590Ab2KUIer (ORCPT ); Wed, 21 Nov 2012 03:34:47 -0500 Message-ID: <50AC9220.70202@gmail.com> Date: Wed, 21 Nov 2012 16:34:40 +0800 From: Jaegeuk Hanse User-Agent: Mozilla/5.0 (X11; Linux i686; rv:16.0) Gecko/20121028 Thunderbird/16.0.2 MIME-Version: 1.0 To: metin d , Fengguang Wu CC: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> In-Reply-To: <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Cc Fengguang Wu. On 11/21/2012 04:13 PM, metin d wrote: >> Curious. Added linux-mm list to CC to catch more attention. If you run >> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? > I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. > > We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. > > My understanding was that under memory pressure from heavily accessed pages, unused pages would eventually get evicted. Is there anything else we can try on this host to understand why this is happening? > > Thank you, > > Metin > > On Tue 20-11-12 09:42:42, metin d wrote: >> I have two PostgreSQL databases named data-1 and data-2 that sit on the >> same machine. Both databases keep 40 GB of data, and the total memory >> available on the machine is 68GB. >> >> I started data-1 and data-2, and ran several queries to go over all their >> data. Then, I shut down data-1 and kept issuing queries against data-2. >> For some reason, the OS still holds on to large parts of data-1's pages >> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >> a result, my queries on data-2 keep hitting disk. >> >> I'm checking page cache usage with fincore. When I run a table scan query >> against data-2, I see that data-2's pages get evicted and put back into >> the cache in a round-robin manner. Nothing happens to data-1's pages, >> although they haven't been touched for days. >> >> Does anybody know why data-1's pages aren't evicted from the page cache? >> I'm open to all kind of suggestions you think it might relate to problem. > Curious. Added linux-mm list to CC to catch more attention. If you run > echo 1 >/proc/sys/vm/drop_caches > does it evict data-1 pages from memory? > >> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >> swap space. The kernel version is: >> >> $ uname -r >> 3.2.28-45.62.amzn1.x86_64 >> Edit: >> >> and it seems that I use one NUMA instance, if you think that it can a problem. >> >> $ numactl --hardware >> available: 1 nodes (0) >> node 0 cpus: 0 1 2 3 4 5 6 7 >> node 0 size: 70007 MB >> node 0 free: 360 MB >> node distances: >> node 0 >> 0: 10 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753526Ab2KUJKq (ORCPT ); Wed, 21 Nov 2012 04:10:46 -0500 Received: from mga02.intel.com ([134.134.136.20]:23173 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752359Ab2KUJKm (ORCPT ); Wed, 21 Nov 2012 04:10:42 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.83,291,1352102400"; d="scan'208";a="245245298" Date: Wed, 21 Nov 2012 17:10:02 +0800 From: Fengguang Wu To: Jaegeuk Hanse Cc: metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: Problem in Page Cache Replacement Message-ID: <20121121091002.GA10255@localhost> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121121090204.GA9064@localhost> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 21, 2012 at 05:02:04PM +0800, Fengguang Wu wrote: > On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: > > Cc Fengguang Wu. > > > > On 11/21/2012 04:13 PM, metin d wrote: > > >> Curious. Added linux-mm list to CC to catch more attention. If you run > > >>echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? > > >I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. > > > > > >We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. > > > >My understanding was that under memory pressure from heavily > > >accessed pages, unused pages would eventually get evicted. Is there > > >anything else we can try on this host to understand why this is > > >happening? > > We may debug it this way. Better to add a step 0) run 'page-types -r' to get an initial view of the page cache status. Thanks, Fengguang > 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages > (please double check via /proc/vmstat whether it does the expected work) > > 2) run 'page-types -r' with root, to view the page status for the > remaining pages of data-1 > > The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) > Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" > > page-types can be found in the kernel source tree tools/vm/page-types.c > > Sorry that sounds a bit twisted.. I do have a patch to directly dump > page cache status of a user specified file, however it's not > upstreamed yet. > > Thanks, > Fengguang > > > >On Tue 20-11-12 09:42:42, metin d wrote: > > >>I have two PostgreSQL databases named data-1 and data-2 that sit on the > > >>same machine. Both databases keep 40 GB of data, and the total memory > > >>available on the machine is 68GB. > > >> > > >>I started data-1 and data-2, and ran several queries to go over all their > > >>data. Then, I shut down data-1 and kept issuing queries against data-2. > > >>For some reason, the OS still holds on to large parts of data-1's pages > > >>in its page cache, and reserves about 35 GB of RAM to data-2's files. As > > >>a result, my queries on data-2 keep hitting disk. > > >> > > >>I'm checking page cache usage with fincore. When I run a table scan query > > >>against data-2, I see that data-2's pages get evicted and put back into > > >>the cache in a round-robin manner. Nothing happens to data-1's pages, > > >>although they haven't been touched for days. > > >> > > >>Does anybody know why data-1's pages aren't evicted from the page cache? > > >>I'm open to all kind of suggestions you think it might relate to problem. > > > Curious. Added linux-mm list to CC to catch more attention. If you run > > >echo 1 >/proc/sys/vm/drop_caches > > > does it evict data-1 pages from memory? > > > > > >>This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no > > >>swap space. The kernel version is: > > >> > > >>$ uname -r > > >>3.2.28-45.62.amzn1.x86_64 > > >>Edit: > > >> > > >>and it seems that I use one NUMA instance, if you think that it can a problem. > > >> > > >>$ numactl --hardware > > >>available: 1 nodes (0) > > >>node 0 cpus: 0 1 2 3 4 5 6 7 > > >>node 0 size: 70007 MB > > >>node 0 free: 360 MB > > >>node distances: > > >>node 0 > > >> 0: 10 > #include > #include > #include > #include > #include > #include > > #include "fadvise.h" > > char *progname; > > static void usage(void) > { > fprintf(stderr, "Usage: %s filename offset length advice [loops]\n", progname); > fprintf(stderr, " advice: normal sequential willneed noreuse " > "dontneed asyncwrite writewait\n"); > exit(1); > } > > int > main(int argc, char *argv[]) > { > int c; > int fd; > char *sadvice; > char *filename; > loff_t offset; > unsigned long length; > int advice = 0; > int ret; > int loops = 1; > > progname = argv[0]; > > while ((c = getopt(argc, argv, "")) != -1) { > switch (c) { > } > } > > if (optind == argc) > usage(); > filename = argv[optind++]; > > if (optind == argc) > usage(); > offset = strtoull(argv[optind++], NULL, 0); > > if (optind == argc) > usage(); > length = strtol(argv[optind++], NULL, 0); > > if (optind == argc) > usage(); > sadvice = argv[optind++]; > > if (optind != argc) > loops = strtol(argv[optind++], NULL, 0); > > if (optind != argc) > usage(); > > if (!strcmp(sadvice, "normal")) > advice = POSIX_FADV_NORMAL; > else if (!strcmp(sadvice, "sequential")) > advice = POSIX_FADV_SEQUENTIAL; > else if (!strcmp(sadvice, "willneed")) > advice = POSIX_FADV_WILLNEED; > else if (!strcmp(sadvice, "noreuse")) > advice = POSIX_FADV_NOREUSE; > else if (!strcmp(sadvice, "dontneed")) > advice = POSIX_FADV_DONTNEED; > else if (!strcmp(sadvice, "asyncwrite")) > advice = LINUX_FADV_ASYNC_WRITE; > else if (!strcmp(sadvice, "writewait")) > advice = LINUX_FADV_WRITE_WAIT; > else > usage(); > > fd = open(filename, O_RDONLY); > if (fd < 0) { > fprintf(stderr, "%s: cannot open `%s': %s\n", > progname, filename, strerror(errno)); > exit(1); > } > > while (loops--) { > ret = __posix_fadvise64(fd, offset, length, advice); > if (ret) { > fprintf(stderr, "%s: fadvise() failed: %s\n", > progname, strerror(errno)); > exit(1); > } > } > close(fd); > exit(0); > } > #include > #include > > #ifndef __NR_fadvise64 > #if defined (__i386__) > #define __NR_fadvise64 250 > #elif defined(__powerpc__) > #define __NR_fadvise64 233 > #elif defined(__ia64__) > #define __NR_fadvise64 1234 > #elif defined(__x86_64__) > #define __NR_fadvise64 221 > #endif > #endif > > #ifndef LINUX_FADV_ASYNC_WRITE > #define LINUX_FADV_ASYNC_WRITE 32 > #endif > > #ifndef LINUX_FADV_WRITE_WAIT > #define LINUX_FADV_WRITE_WAIT 33 > #endif > > #ifndef __x86_64__ > _syscall5(int,fadvise64, int,fd, long,offset_lo, > long,offset_hi, size_t,len, int,advice) > #endif > > /* Works by luck on ppc32, fails on ppc64 */ > #if defined(__i386__) > int __posix_fadvise(int fd, off_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, 0, len, advice); > } > > int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, offset >> 32, len, advice); > } > #elif defined(__powerpc64__) > int __posix_fadvise(int fd, off_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, len, advice); > } > > int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, len, advice); > } > #elif defined(__powerpc__) > > /* > * long longs are passed in an odd even register pair on ppc32 so > * we need to pad before offset > * > * Note also the glibc syscall() function for ppc has been broken for > * 6 argument syscalls until recently (~2.3.1 CVS) > */ > #define ppc_fadvise64(fd, offset_hi, offset_lo, len, advice) \ > syscall(__NR_fadvise64, fd, 0, offset_hi, offset_lo, len, advice) > > int __posix_fadvise(int fd, off_t offset, size_t len, int advice) > { > return ppc_fadvise64(fd, 0, offset, len, advice); > } > > /* big endian, akpm. */ > int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) > { > return ppc_fadvise64(fd, (unsigned int)(offset >> 32), > (unsigned int)(offset & 0xffffffff), len, advice); > } > #elif defined(__ia64__) > int __posix_fadvise(int fd, off_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, len, advice); > } > > int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) > { > return fadvise64(fd, offset, len, advice); > } > #elif defined(__x86_64__) > int __posix_fadvise(int fd, off_t offset, size_t len, int advice) > { > return -1; > } > > int __posix_fadvise64(int fd, loff_t offset, size_t len, int advice) > { > return syscall(__NR_fadvise64, fd, offset, len, advice); > } > #endif From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754117Ab2KUJmo (ORCPT ); Wed, 21 Nov 2012 04:42:44 -0500 Received: from mail-pa0-f46.google.com ([209.85.220.46]:64732 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752729Ab2KUJml (ORCPT ); Wed, 21 Nov 2012 04:42:41 -0500 Message-ID: <50ACA209.9000101@gmail.com> Date: Wed, 21 Nov 2012 17:42:33 +0800 From: Jaegeuk Hanse User-Agent: Mozilla/5.0 (X11; Linux i686; rv:16.0) Gecko/20121028 Thunderbird/16.0.2 MIME-Version: 1.0 To: Fengguang Wu CC: metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> In-Reply-To: <20121121090204.GA9064@localhost> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/21/2012 05:02 PM, Fengguang Wu wrote: > On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: >> Cc Fengguang Wu. >> >> On 11/21/2012 04:13 PM, metin d wrote: >>>> Curious. Added linux-mm list to CC to catch more attention. If you run >>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? >>> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. >>> >>> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. >>> My understanding was that under memory pressure from heavily >>> accessed pages, unused pages would eventually get evicted. Is there >>> anything else we can try on this host to understand why this is >>> happening? > We may debug it this way. > > 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages > (please double check via /proc/vmstat whether it does the expected work) > > 2) run 'page-types -r' with root, to view the page status for the > remaining pages of data-1 > > The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) > Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" > > page-types can be found in the kernel source tree tools/vm/page-types.c > > Sorry that sounds a bit twisted.. I do have a patch to directly dump > page cache status of a user specified file, however it's not > upstreamed yet. Hi Fengguang, Thanks for you detail steps, I think metin can have a try. flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 607699 2373 ___________________________________ 0x0000000100000000 343227 1340 _______________________r___________ reserved But I have some questions of the print of page-type: Is 2373MB here mean total memory in used include page cache? I don't think so. Which kind of pages will be marked reserved? Which line of long-symbolic-flags is for page cache? Regards, Jaegeuk > > Thanks, > Fengguang > >>> On Tue 20-11-12 09:42:42, metin d wrote: >>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>>> same machine. Both databases keep 40 GB of data, and the total memory >>>> available on the machine is 68GB. >>>> >>>> I started data-1 and data-2, and ran several queries to go over all their >>>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>>> For some reason, the OS still holds on to large parts of data-1's pages >>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>>> a result, my queries on data-2 keep hitting disk. >>>> >>>> I'm checking page cache usage with fincore. When I run a table scan query >>>> against data-2, I see that data-2's pages get evicted and put back into >>>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>>> although they haven't been touched for days. >>>> >>>> Does anybody know why data-1's pages aren't evicted from the page cache? >>>> I'm open to all kind of suggestions you think it might relate to problem. >>> Curious. Added linux-mm list to CC to catch more attention. If you run >>> echo 1 >/proc/sys/vm/drop_caches >>> does it evict data-1 pages from memory? >>> >>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >>>> swap space. The kernel version is: >>>> >>>> $ uname -r >>>> 3.2.28-45.62.amzn1.x86_64 >>>> Edit: >>>> >>>> and it seems that I use one NUMA instance, if you think that it can a problem. >>>> >>>> $ numactl --hardware >>>> available: 1 nodes (0) >>>> node 0 cpus: 0 1 2 3 4 5 6 7 >>>> node 0 size: 70007 MB >>>> node 0 free: 360 MB >>>> node distances: >>>> node 0 >>>> 0: 10 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754282Ab2KUKHs (ORCPT ); Wed, 21 Nov 2012 05:07:48 -0500 Received: from mail-qc0-f174.google.com ([209.85.216.174]:63618 "EHLO mail-qc0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753308Ab2KUKHo convert rfc822-to-8bit (ORCPT ); Wed, 21 Nov 2012 05:07:44 -0500 MIME-Version: 1.0 In-Reply-To: <50ACA634.5000007@gmail.com> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> From: =?UTF-8?B?TWV0aW4gRMO2xZ9sw7w=?= Date: Wed, 21 Nov 2012 12:07:22 +0200 Message-ID: Subject: Re: Problem in Page Cache Replacement To: Jaegeuk Hanse Cc: Fengguang Wu , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse wrote: > > On 11/21/2012 05:58 PM, metin d wrote: > > Hi Fengguang, > > I run tests and attached the results. The line below I guess shows the data-1 page caches. > > 0x000000080000006c 6584051 25718 __RU_lA___________________P________ referenced,uptodate,lru,active,private > > > I thinks this is just one state of page cache pages. But why these page caches are in this state as opposed to other page caches. From the results I conclude that: data-1 pages are in state : referenced,uptodate,lru,active,private data-2 pages are in state : referenced,uptodate,lru,mappedtodisk > > > > > Metin > > > ----- Original Message ----- > From: Jaegeuk Hanse > To: Fengguang Wu > Cc: metin d ; Jan Kara ; "linux-kernel@vger.kernel.org" ; "linux-mm@kvack.org" > Sent: Wednesday, November 21, 2012 11:42 AM > Subject: Re: Problem in Page Cache Replacement > > On 11/21/2012 05:02 PM, Fengguang Wu wrote: > > On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: > >> Cc Fengguang Wu. > >> > >> On 11/21/2012 04:13 PM, metin d wrote: > >>>> Curious. Added linux-mm list to CC to catch more attention. If you run > >>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? > >>> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. > >>> > >>> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. > >>> My understanding was that under memory pressure from heavily > >>> accessed pages, unused pages would eventually get evicted. Is there > >>> anything else we can try on this host to understand why this is > >>> happening? > > We may debug it this way. > > > > 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages > > (please double check via /proc/vmstat whether it does the expected work) > > > > 2) run 'page-types -r' with root, to view the page status for the > > remaining pages of data-1 > > > > The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) > > Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" > > > > page-types can be found in the kernel source tree tools/vm/page-types.c > > > > Sorry that sounds a bit twisted.. I do have a patch to directly dump > > page cache status of a user specified file, however it's not > > upstreamed yet. > > Hi Fengguang, > > Thanks for you detail steps, I think metin can have a try. > > flags page-count MB symbolic-flags long-symbolic-flags > 0x0000000000000000 607699 2373 > ___________________________________ > 0x0000000100000000 343227 1340 > _______________________r___________ reserved > > But I have some questions of the print of page-type: > > Is 2373MB here mean total memory in used include page cache? I don't > think so. > Which kind of pages will be marked reserved? > Which line of long-symbolic-flags is for page cache? > > Regards, > Jaegeuk > > > > > Thanks, > > Fengguang > > > >>> On Tue 20-11-12 09:42:42, metin d wrote: > >>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the > >>>> same machine. Both databases keep 40 GB of data, and the total memory > >>>> available on the machine is 68GB. > >>>> > >>>> I started data-1 and data-2, and ran several queries to go over all their > >>>> data. Then, I shut down data-1 and kept issuing queries against data-2. > >>>> For some reason, the OS still holds on to large parts of data-1's pages > >>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As > >>>> a result, my queries on data-2 keep hitting disk. > >>>> > >>>> I'm checking page cache usage with fincore. When I run a table scan query > >>>> against data-2, I see that data-2's pages get evicted and put back into > >>>> the cache in a round-robin manner. Nothing happens to data-1's pages, > >>>> although they haven't been touched for days. > >>>> > >>>> Does anybody know why data-1's pages aren't evicted from the page cache? > >>>> I'm open to all kind of suggestions you think it might relate to problem. > >>> Curious. Added linux-mm list to CC to catch more attention. If you run > >>> echo 1 >/proc/sys/vm/drop_caches > >>> does it evict data-1 pages from memory? > >>> > >>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no > >>>> swap space. The kernel version is: > >>>> > >>>> $ uname -r > >>>> 3.2.28-45.62.amzn1.x86_64 > >>>> Edit: > >>>> > >>>> and it seems that I use one NUMA instance, if you think that it can a problem. > >>>> > >>>> $ numactl --hardware > >>>> available: 1 nodes (0) > >>>> node 0 cpus: 0 1 2 3 4 5 6 7 > >>>> node 0 size: 70007 MB > >>>> node 0 free: 360 MB > >>>> node distances: > >>>> node 0 > >>>> 0: 10 > > -- Metin Döşlü From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755888Ab2KVSxJ (ORCPT ); Thu, 22 Nov 2012 13:53:09 -0500 Received: from mga11.intel.com ([192.55.52.93]:34490 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755890Ab2KVSvv (ORCPT ); Thu, 22 Nov 2012 13:51:51 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.83,301,1352102400"; d="scan'208";a="251105653" Date: Thu, 22 Nov 2012 23:26:11 +0800 From: Fengguang Wu To: Jaegeuk Hanse Cc: metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: Problem in Page Cache Replacement Message-ID: <20121122152611.GA11736@localhost> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50ACA209.9000101@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Jaegeuk, Sorry for the delay. I'm traveling these days.. On Wed, Nov 21, 2012 at 05:42:33PM +0800, Jaegeuk Hanse wrote: > On 11/21/2012 05:02 PM, Fengguang Wu wrote: > >On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: > >>Cc Fengguang Wu. > >> > >>On 11/21/2012 04:13 PM, metin d wrote: > >>>> Curious. Added linux-mm list to CC to catch more attention. If you run > >>>>echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? > >>>I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. > >>> > >>>We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. > >>>My understanding was that under memory pressure from heavily > >>>accessed pages, unused pages would eventually get evicted. Is there > >>>anything else we can try on this host to understand why this is > >>>happening? > >We may debug it this way. > > > >1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages > > (please double check via /proc/vmstat whether it does the expected work) > > > >2) run 'page-types -r' with root, to view the page status for the > > remaining pages of data-1 > > > >The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) > >Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" > > > >page-types can be found in the kernel source tree tools/vm/page-types.c > > > >Sorry that sounds a bit twisted.. I do have a patch to directly dump > >page cache status of a user specified file, however it's not > >upstreamed yet. > > Hi Fengguang, > > Thanks for you detail steps, I think metin can have a try. > > flags page-count MB symbolic-flags long-symbolic-flags > 0x0000000000000000 607699 2373 > ___________________________________ > 0x0000000100000000 343227 1340 > _______________________r___________ reserved We don't need to care about the above two pages states actually. Page cache pages will never be in the special reserved or all-flags-cleared state. > But I have some questions of the print of page-type: > > Is 2373MB here mean total memory in used include page cache? I don't > think so. > Which kind of pages will be marked reserved? > Which line of long-symbolic-flags is for page cache? The (lru && !anonymous) pages are page cache pages. Thanks, Fengguang > >>>On Tue 20-11-12 09:42:42, metin d wrote: > >>>>I have two PostgreSQL databases named data-1 and data-2 that sit on the > >>>>same machine. Both databases keep 40 GB of data, and the total memory > >>>>available on the machine is 68GB. > >>>> > >>>>I started data-1 and data-2, and ran several queries to go over all their > >>>>data. Then, I shut down data-1 and kept issuing queries against data-2. > >>>>For some reason, the OS still holds on to large parts of data-1's pages > >>>>in its page cache, and reserves about 35 GB of RAM to data-2's files. As > >>>>a result, my queries on data-2 keep hitting disk. > >>>> > >>>>I'm checking page cache usage with fincore. When I run a table scan query > >>>>against data-2, I see that data-2's pages get evicted and put back into > >>>>the cache in a round-robin manner. Nothing happens to data-1's pages, > >>>>although they haven't been touched for days. > >>>> > >>>>Does anybody know why data-1's pages aren't evicted from the page cache? > >>>>I'm open to all kind of suggestions you think it might relate to problem. > >>> Curious. Added linux-mm list to CC to catch more attention. If you run > >>>echo 1 >/proc/sys/vm/drop_caches > >>> does it evict data-1 pages from memory? > >>> > >>>>This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no > >>>>swap space. The kernel version is: > >>>> > >>>>$ uname -r > >>>>3.2.28-45.62.amzn1.x86_64 > >>>>Edit: > >>>> > >>>>and it seems that I use one NUMA instance, if you think that it can a problem. > >>>> > >>>>$ numactl --hardware > >>>>available: 1 nodes (0) > >>>>node 0 cpus: 0 1 2 3 4 5 6 7 > >>>>node 0 size: 70007 MB > >>>>node 0 free: 360 MB > >>>>node distances: > >>>>node 0 > >>>> 0: 10 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756047Ab2KVTGv (ORCPT ); Thu, 22 Nov 2012 14:06:51 -0500 Received: from mga11.intel.com ([192.55.52.93]:59181 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754380Ab2KVTGt (ORCPT ); Thu, 22 Nov 2012 14:06:49 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.83,301,1352102400"; d="scan'208";a="251109879" Date: Thu, 22 Nov 2012 23:41:07 +0800 From: Fengguang Wu To: Metin =?utf-8?B?RMO2xZ9sw7w=?= Cc: Jaegeuk Hanse , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: Problem in Page Cache Replacement Message-ID: <20121122154107.GB11736@localhost> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin Döşlü wrote: > On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse wrote: > > > > On 11/21/2012 05:58 PM, metin d wrote: > > > > Hi Fengguang, > > > > I run tests and attached the results. The line below I guess shows the data-1 page caches. > > > > 0x000000080000006c 6584051 25718 __RU_lA___________________P________ referenced,uptodate,lru,active,private > > > > > > I thinks this is just one state of page cache pages. > > But why these page caches are in this state as opposed to other page > caches. From the results I conclude that: > > data-1 pages are in state : referenced,uptodate,lru,active,private I wonder if it's this code that stops data-1 pages from being reclaimed: shrink_page_list(): if (page_has_private(page)) { if (!try_to_release_page(page, sc->gfp_mask)) goto activate_locked; What's the filesystem used? > data-2 pages are in state : referenced,uptodate,lru,mappedtodisk Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756700Ab2KVTS1 (ORCPT ); Thu, 22 Nov 2012 14:18:27 -0500 Received: from mga11.intel.com ([192.55.52.93]:59422 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965165Ab2KVTST (ORCPT ); Thu, 22 Nov 2012 14:18:19 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.83,301,1352102400"; d="scan'208";a="251113228" Date: Thu, 22 Nov 2012 23:53:18 +0800 From: Fengguang Wu To: Metin =?utf-8?B?RMO2xZ9sw7w=?= Cc: Jaegeuk Hanse , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: Problem in Page Cache Replacement Message-ID: <20121122155318.GA12636@localhost> References: <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> <20121122154107.GB11736@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20121122154107.GB11736@localhost> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 22, 2012 at 11:41:07PM +0800, Fengguang Wu wrote: > On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin Döşlü wrote: > > On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse wrote: > > > > > > On 11/21/2012 05:58 PM, metin d wrote: > > > > > > Hi Fengguang, > > > > > > I run tests and attached the results. The line below I guess shows the data-1 page caches. > > > > > > 0x000000080000006c 6584051 25718 __RU_lA___________________P________ referenced,uptodate,lru,active,private > > > > > > > > > I thinks this is just one state of page cache pages. > > > > But why these page caches are in this state as opposed to other page > > caches. From the results I conclude that: > > > > data-1 pages are in state : referenced,uptodate,lru,active,private > > I wonder if it's this code that stops data-1 pages from being > reclaimed: > > shrink_page_list(): > > if (page_has_private(page)) { > if (!try_to_release_page(page, sc->gfp_mask)) > goto activate_locked; > > What's the filesystem used? Ah it's more likely caused by this logic: if (is_active_lru(lru)) { if (inactive_list_is_low(mz, file)) shrink_active_list(nr_to_scan, mz, sc, priority, file); The active file list won't be scanned at all if it's smaller than the active list. In this case, it's inactive=33586MB > active=25719MB. So the data-1 pages in the active list will never be scanned and reclaimed. > > data-2 pages are in state : referenced,uptodate,lru,mappedtodisk > > Thanks, > Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757193Ab2KVTgE (ORCPT ); Thu, 22 Nov 2012 14:36:04 -0500 Received: from mail-oa0-f46.google.com ([209.85.219.46]:34357 "EHLO mail-oa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757148Ab2KVTfy (ORCPT ); Thu, 22 Nov 2012 14:35:54 -0500 Message-ID: <50AD7647.7050200@gmail.com> Date: Thu, 22 Nov 2012 08:48:07 +0800 From: Jaegeuk Hanse User-Agent: Mozilla/5.0 (X11; Linux i686; rv:16.0) Gecko/20121028 Thunderbird/16.0.2 MIME-Version: 1.0 To: Johannes Weiner CC: Jan Kara , metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> In-Reply-To: <20121121213417.GC24381@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/22/2012 05:34 AM, Johannes Weiner wrote: > Hi, > > On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: >> On Tue 20-11-12 09:42:42, metin d wrote: >>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>> same machine. Both databases keep 40 GB of data, and the total memory >>> available on the machine is 68GB. >>> >>> I started data-1 and data-2, and ran several queries to go over all their >>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>> For some reason, the OS still holds on to large parts of data-1's pages >>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>> a result, my queries on data-2 keep hitting disk. >>> >>> I'm checking page cache usage with fincore. When I run a table scan query >>> against data-2, I see that data-2's pages get evicted and put back into >>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>> although they haven't been touched for days. >>> >>> Does anybody know why data-1's pages aren't evicted from the page cache? >>> I'm open to all kind of suggestions you think it might relate to problem. > This might be because we do not deactive pages as long as there is > cache on the inactive list. I'm guessing that the inter-reference > distance of data-2 is bigger than half of memory, so it's never > getting activated and data-1 is never challenged. Hi Johannes, What's the meaning of "inter-reference distance" and why compare it with half of memoy, what's the trick? Regards, Jaegeuk > > I have a series of patches that detects a thrashing inactive list and > handles working set changes up to the size of memory. Would you be > willing to test them? They are currently based on 3.4, let me know > what version works best for you. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932241Ab2KVUlE (ORCPT ); Thu, 22 Nov 2012 15:41:04 -0500 Received: from mail-ob0-f174.google.com ([209.85.214.174]:60532 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756338Ab2KVUk7 (ORCPT ); Thu, 22 Nov 2012 15:40:59 -0500 Message-ID: <50AE25AB.2060808@gmail.com> Date: Thu, 22 Nov 2012 21:16:27 +0800 From: Jaegeuk Hanse User-Agent: Mozilla/5.0 (X11; Linux i686; rv:16.0) Gecko/20121028 Thunderbird/16.0.2 MIME-Version: 1.0 To: Johannes Weiner CC: Jan Kara , metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> <50AD7647.7050200@gmail.com> <20121122010959.GF24381@cmpxchg.org> In-Reply-To: <20121122010959.GF24381@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/22/2012 09:09 AM, Johannes Weiner wrote: > On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote: >> On 11/22/2012 05:34 AM, Johannes Weiner wrote: >>> Hi, >>> >>> On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: >>>> On Tue 20-11-12 09:42:42, metin d wrote: >>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>>>> same machine. Both databases keep 40 GB of data, and the total memory >>>>> available on the machine is 68GB. >>>>> >>>>> I started data-1 and data-2, and ran several queries to go over all their >>>>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>>>> For some reason, the OS still holds on to large parts of data-1's pages >>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>>>> a result, my queries on data-2 keep hitting disk. >>>>> >>>>> I'm checking page cache usage with fincore. When I run a table scan query >>>>> against data-2, I see that data-2's pages get evicted and put back into >>>>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>>>> although they haven't been touched for days. >>>>> >>>>> Does anybody know why data-1's pages aren't evicted from the page cache? >>>>> I'm open to all kind of suggestions you think it might relate to problem. >>> This might be because we do not deactive pages as long as there is >>> cache on the inactive list. I'm guessing that the inter-reference >>> distance of data-2 is bigger than half of memory, so it's never >>> getting activated and data-1 is never challenged. >> Hi Johannes, >> >> What's the meaning of "inter-reference distance" > It's the number of memory accesses between two accesses to the same > page: > > A B C D A B C E ... > |_______| > | | > >> and why compare it with half of memoy, what's the trick? > If B gets accessed twice, it gets activated. If it gets evicted in > between, the second access will be a fresh page fault and B will not > be recognized as frequently used. > > Our cutoff for scanning the active list is cache size / 2 right now > (inactive_file_is_low), leaving 50% of memory to the inactive list. > If the inter-reference distance for pages on the inactive list is > bigger than that, they get evicted before their second access. Hi Johannes, Thanks for your explanation. But could you give a short description of how you resolve this inactive list thrashing issues? Regards, Jaegeuk From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755610Ab2KVVOj (ORCPT ); Thu, 22 Nov 2012 16:14:39 -0500 Received: from nm24-vm0.bullet.mail.bf1.yahoo.com ([98.139.213.161]:34605 "EHLO nm24-vm0.bullet.mail.bf1.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755635Ab2KVVOg convert rfc822-to-8bit (ORCPT ); Thu, 22 Nov 2012 16:14:36 -0500 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 605159.12783.bm@omp1044.mail.bf1.yahoo.com DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=lWT22+kWsGkMXR9wOaEKahGS1+iEAheREt1p8W6jdvND0xCwLMGzthn834s+TqOuFYYr225Fzmq9RwBayWa1SWo99AaWvqD87/y2CoHq6p7ZD75T8aR882eYR70GefgmS36NjoNRV0OZqL1oRQQTcolCbwEsdtdp2AyxXPcO66g=; X-YMail-OSG: Z3a.RacVM1lj9j4MyCLp8cpf9biSNlihJIaB0mr_UwFOPov t6cEeJdjzI4zHL5n3847QD5fX8Jtvd.hM9xpM1cfnt1hJBGu47VqPer2xmMe oII5ME1zb1n_Tlo7u3oLpuh76vtPvS.NaV_YhbNfKJ14igmALWYRqG8dLIFY lrVLfGgsIjRyc5nqfgVzGJH4O.s8ZiErZX2dnVG8hngT7Urw1AV8phum8Dhp liqfAR.xWaqC3U9Kzqpu1J56vkZ5nTtSM5RTntFhQO64Q1MTlXBKNRcZ4awB 1W943VDPsjTWXn5kt5_dL_27LMaMyWBsXfGTWNBkurlcQiKS2j6ty4D5GIOa IlovfMv6BskJekXDuM3kLHPgXLshDceqxUPcwquVOK_d4B5k597KQdi3SV0H cUd3lCzwrR4Rq.NUPdXlohFyE7Djd9gi47gOh4LaFotj.wIB.tzB3D5xHclw 2e52O4mc7HvtS0RvGkhKur6CCCAvgXhB7TCnRulaZc39whYzph8L_jaiWTzA T0UM- X-Rocket-MIMEInfo: 001.001,SGksCgpZZXMgZGF0YS0yIGlzIGJpZ2dlciB0aGFuIGhhbGYgb2YgbWVtb3J5LiBJJ20gd2lsbGluZyB0byB0cnkgdGhvc2UgcGF0Y2hlcy4gCgpUaGlzIGlzIHRoZSB2ZXJzaW9uIG9mIHRoaXMgbWFjaGluZToKCiQgdW5hbWUgLXIKMy4yLjI4LTQ1LjYyLmFtem4xLng4Nl82NAoKCgotLS0tLSBPcmlnaW5hbCBNZXNzYWdlIC0tLS0tCkZyb206IEpvaGFubmVzIFdlaW5lciA8aGFubmVzQGNtcHhjaGcub3JnPgpUbzogSmFuIEthcmEgPGphY2tAc3VzZS5jej4KQ2M6IG1ldGluIGQgPG1ldGRvc0B5YWhvby5jb20BMAEBAQE- X-Mailer: YahooMailWebService/0.8.123.460 References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> Message-ID: <1353535288.94916.YahooMailNeo@web141101.mail.bf1.yahoo.com> Date: Wed, 21 Nov 2012 14:01:28 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement To: Johannes Weiner , Jan Kara Cc: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , =?utf-8?B?TWV0aW4gRMO2xZ9sw7w=?= In-Reply-To: <20121121213417.GC24381@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Yes data-2 is bigger than half of memory. I'm willing to try those patches. This is the version of this machine: $ uname -r 3.2.28-45.62.amzn1.x86_64 ----- Original Message ----- From: Johannes Weiner To: Jan Kara Cc: metin d ; "linux-kernel@vger.kernel.org" ; linux-mm@kvack.org Sent: Wednesday, November 21, 2012 11:34 PM Subject: Re: Problem in Page Cache Replacement Hi, On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: > On Tue 20-11-12 09:42:42, metin d wrote: > > I have two PostgreSQL databases named data-1 and data-2 that sit on the > > same machine. Both databases keep 40 GB of data, and the total memory > > available on the machine is 68GB. > > > > I started data-1 and data-2, and ran several queries to go over all their > > data. Then, I shut down data-1 and kept issuing queries against data-2. > > For some reason, the OS still holds on to large parts of data-1's pages > > in its page cache, and reserves about 35 GB of RAM to data-2's files. As > > a result, my queries on data-2 keep hitting disk. > > > > I'm checking page cache usage with fincore. When I run a table scan query > > against data-2, I see that data-2's pages get evicted and put back into > > the cache in a round-robin manner. Nothing happens to data-1's pages, > > although they haven't been touched for days. > > > > Does anybody know why data-1's pages aren't evicted from the page cache? > > I'm open to all kind of suggestions you think it might relate to problem. This might be because we do not deactive pages as long as there is cache on the inactive list.  I'm guessing that the inter-reference distance of data-2 is bigger than half of memory, so it's never getting activated and data-1 is never challenged. I have a series of patches that detects a thrashing inactive list and handles working set changes up to the size of memory.  Would you be willing to test them?  They are currently based on 3.4, let me know what version works best for you. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758781Ab2KVXDz (ORCPT ); Thu, 22 Nov 2012 18:03:55 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:34814 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752475Ab2KVXDx (ORCPT ); Thu, 22 Nov 2012 18:03:53 -0500 Date: Thu, 22 Nov 2012 11:17:43 -0500 From: Johannes Weiner To: Jaegeuk Hanse Cc: Jan Kara , metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org Subject: Re: Problem in Page Cache Replacement Message-ID: <20121122161743.GH24381@cmpxchg.org> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> <50AD7647.7050200@gmail.com> <20121122010959.GF24381@cmpxchg.org> <50AE25AB.2060808@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50AE25AB.2060808@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 22, 2012 at 09:16:27PM +0800, Jaegeuk Hanse wrote: > On 11/22/2012 09:09 AM, Johannes Weiner wrote: > >On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote: > >>On 11/22/2012 05:34 AM, Johannes Weiner wrote: > >>>Hi, > >>> > >>>On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: > >>>>On Tue 20-11-12 09:42:42, metin d wrote: > >>>>>I have two PostgreSQL databases named data-1 and data-2 that sit on the > >>>>>same machine. Both databases keep 40 GB of data, and the total memory > >>>>>available on the machine is 68GB. > >>>>> > >>>>>I started data-1 and data-2, and ran several queries to go over all their > >>>>>data. Then, I shut down data-1 and kept issuing queries against data-2. > >>>>>For some reason, the OS still holds on to large parts of data-1's pages > >>>>>in its page cache, and reserves about 35 GB of RAM to data-2's files. As > >>>>>a result, my queries on data-2 keep hitting disk. > >>>>> > >>>>>I'm checking page cache usage with fincore. When I run a table scan query > >>>>>against data-2, I see that data-2's pages get evicted and put back into > >>>>>the cache in a round-robin manner. Nothing happens to data-1's pages, > >>>>>although they haven't been touched for days. > >>>>> > >>>>>Does anybody know why data-1's pages aren't evicted from the page cache? > >>>>>I'm open to all kind of suggestions you think it might relate to problem. > >>>This might be because we do not deactive pages as long as there is > >>>cache on the inactive list. I'm guessing that the inter-reference > >>>distance of data-2 is bigger than half of memory, so it's never > >>>getting activated and data-1 is never challenged. > >>Hi Johannes, > >> > >>What's the meaning of "inter-reference distance" > >It's the number of memory accesses between two accesses to the same > >page: > > > > A B C D A B C E ... > > |_______| > > | | > > > >>and why compare it with half of memoy, what's the trick? > >If B gets accessed twice, it gets activated. If it gets evicted in > >between, the second access will be a fresh page fault and B will not > >be recognized as frequently used. > > > >Our cutoff for scanning the active list is cache size / 2 right now > >(inactive_file_is_low), leaving 50% of memory to the inactive list. > >If the inter-reference distance for pages on the inactive list is > >bigger than that, they get evicted before their second access. > > Hi Johannes, > > Thanks for your explanation. But could you give a short description > of how you resolve this inactive list thrashing issues? I remember a time stamp of evicted file pages in the page cache radix tree that let me reconstruct the inter-reference distance even after a page has been evicted from cache when it's faulted back in. This way I can tell a one-time sequence from thrashing, no matter how small the inactive list. When thrashing is detected, I start deactivating protected pages and put them next to the refaulted cache on the head of the inactive list and let them fight it out as usual. In this reported case, the old data will be challenged and since it's no longer used, it will just drop off the inactive list eventually. If the guess is wrong and the deactivated memory is used more heavily than the refaulting pages, they will just get activated again without incurring any disruption like a major fault. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758740Ab2KVXDw (ORCPT ); Thu, 22 Nov 2012 18:03:52 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:34812 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752475Ab2KVXDt (ORCPT ); Thu, 22 Nov 2012 18:03:49 -0500 Date: Wed, 21 Nov 2012 20:09:59 -0500 From: Johannes Weiner To: Jaegeuk Hanse Cc: Jan Kara , metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org Subject: Re: Problem in Page Cache Replacement Message-ID: <20121122010959.GF24381@cmpxchg.org> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> <50AD7647.7050200@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50AD7647.7050200@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote: > On 11/22/2012 05:34 AM, Johannes Weiner wrote: > >Hi, > > > >On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: > >>On Tue 20-11-12 09:42:42, metin d wrote: > >>>I have two PostgreSQL databases named data-1 and data-2 that sit on the > >>>same machine. Both databases keep 40 GB of data, and the total memory > >>>available on the machine is 68GB. > >>> > >>>I started data-1 and data-2, and ran several queries to go over all their > >>>data. Then, I shut down data-1 and kept issuing queries against data-2. > >>>For some reason, the OS still holds on to large parts of data-1's pages > >>>in its page cache, and reserves about 35 GB of RAM to data-2's files. As > >>>a result, my queries on data-2 keep hitting disk. > >>> > >>>I'm checking page cache usage with fincore. When I run a table scan query > >>>against data-2, I see that data-2's pages get evicted and put back into > >>>the cache in a round-robin manner. Nothing happens to data-1's pages, > >>>although they haven't been touched for days. > >>> > >>>Does anybody know why data-1's pages aren't evicted from the page cache? > >>>I'm open to all kind of suggestions you think it might relate to problem. > >This might be because we do not deactive pages as long as there is > >cache on the inactive list. I'm guessing that the inter-reference > >distance of data-2 is bigger than half of memory, so it's never > >getting activated and data-1 is never challenged. > > Hi Johannes, > > What's the meaning of "inter-reference distance" It's the number of memory accesses between two accesses to the same page: A B C D A B C E ... |_______| | | > and why compare it with half of memoy, what's the trick? If B gets accessed twice, it gets activated. If it gets evicted in between, the second access will be a fresh page fault and B will not be recognized as frequently used. Our cutoff for scanning the active list is cache size / 2 right now (inactive_file_is_low), leaving 50% of memory to the inactive list. If the inter-reference distance for pages on the inactive list is bigger than that, they get evicted before their second access. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753180Ab2KVXE0 (ORCPT ); Thu, 22 Nov 2012 18:04:26 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:34813 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758522Ab2KVXDw (ORCPT ); Thu, 22 Nov 2012 18:03:52 -0500 Date: Wed, 21 Nov 2012 16:34:18 -0500 From: Johannes Weiner To: Jan Kara Cc: metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org Subject: Re: Problem in Page Cache Replacement Message-ID: <20121121213417.GC24381@cmpxchg.org> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121120182500.GH1408@quack.suse.cz> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: > On Tue 20-11-12 09:42:42, metin d wrote: > > I have two PostgreSQL databases named data-1 and data-2 that sit on the > > same machine. Both databases keep 40 GB of data, and the total memory > > available on the machine is 68GB. > > > > I started data-1 and data-2, and ran several queries to go over all their > > data. Then, I shut down data-1 and kept issuing queries against data-2. > > For some reason, the OS still holds on to large parts of data-1's pages > > in its page cache, and reserves about 35 GB of RAM to data-2's files. As > > a result, my queries on data-2 keep hitting disk. > > > > I'm checking page cache usage with fincore. When I run a table scan query > > against data-2, I see that data-2's pages get evicted and put back into > > the cache in a round-robin manner. Nothing happens to data-1's pages, > > although they haven't been touched for days. > > > > Does anybody know why data-1's pages aren't evicted from the page cache? > > I'm open to all kind of suggestions you think it might relate to problem. This might be because we do not deactive pages as long as there is cache on the inactive list. I'm guessing that the inter-reference distance of data-2 is bigger than half of memory, so it's never getting activated and data-1 is never challenged. I have a series of patches that detects a thrashing inactive list and handles working set changes up to the size of memory. Would you be willing to test them? They are currently based on 3.4, let me know what version works best for you. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756970Ab2KWAsS (ORCPT ); Thu, 22 Nov 2012 19:48:18 -0500 Received: from nm40-vm1.bullet.mail.bf1.yahoo.com ([72.30.239.209]:32805 "EHLO nm40-vm1.bullet.mail.bf1.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756547Ab2KWAsN convert rfc822-to-8bit (ORCPT ); Thu, 22 Nov 2012 19:48:13 -0500 X-Greylist: delayed 14417 seconds by postgrey-1.27 at vger.kernel.org; Thu, 22 Nov 2012 19:48:13 EST X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 533563.18359.bm@omp1047.mail.bf1.yahoo.com DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=s0HtsvC1sQkTgpMrxqMys/8ujR5zNr+8pOzW94kifpBklMBOo6uXaaHJ5uupsq/8wm2XxtfdlGDjE4rlU48gJlSyL0iB0C0FRS0YYJBBwtQQ0duogKlJsGwLCghtxiSEYsHTWP2uYHaabGetbd1zIbt24lZAMXLX2kS7YHUe4A4=; X-YMail-OSG: 0AvVFeoVM1kMdcoJAPOAzdAM1uQE7j37baX.yqMukeRvZGJ dGw62EXkyaOdZ8enqFLVCh_ntkyTkPmn4p7LTGoPYrbR6aTT0yMigNhIk1hu VtdOMRgHYmU5fmmRPZKsNo5rcIhUSjQaeu1NmzWki4qK99Ua5g3jy7rjVPx9 atp0UDNPUGgATbCYakYJOINAk6Osxy_NadT4AfgrR7oP3oIvtQ71vwzBC65J 7c8mzoUdsd.19v7rDwF7cFPhMPkoGdR.9DXDJFIQaKtmGEUpmk60cYd5L1Rh crAG9YJ_8jENabCVKbcr2s5RUA0aoSHiteFghtTYgzP7nXK_tSN9DhRvfkqe KdUL.HaFsEhJGb1gLOi4vZrcG1SOqV5aZemjGBwmdvnc6mUkZS51Lfo5GPiz 04NA4OQM7hfiX2fdt5xg9UB._I8zU2IskoJci29BG9x6AbD9BopvLe8Uv.QE a33gC2VkO_dGy.lPBOHs0YbLSSGgIxU.hrejCtJJM0zYWUCnRi6G57zn1jU2 Jf4jv X-Rocket-MIMEInfo: 001.001,SGkgSm9oYW5uZXMsCgpZZXMsIHByb2JsZW0gd2FzIGFzIHlvdSBwcm9qZWN0ZWQuIEkgdHJpZWQgdG8gbWFrZSAiYWN0aXZlIiBkYXRhLTIgcGFnZXMgYnkgbWFudWFsbHkgcmVhZGluZyB0aGVtIHR3aWNlLCBhbmQgZmluYWxseSBkYXRhLTEgYXJlIGdvdCBvdXQgb2YgcGFnZSBjYWNoZS4KCldlIGhhdmUgbGFyZ2UgZmlsZXMgaW4gUG9zdGdyZVNRTCBhbmQgSGFkb29wIHRoYXQgd2Ugc2VxdWVudGlhbGx5IHNjYW4gb3ZlcjsgYW5kIHRyeSB0byBmaXQgb3VyIHdvcmtpbmcgc2V0IGludG8gdG90YWwgbWVtb3IBMAEBAQE- X-Mailer: YahooMailWebService/0.8.123.460 References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> <50AD7647.7050200@gmail.com> <20121122010959.GF24381@cmpxchg.org> Message-ID: <1353577068.19982.YahooMailNeo@web141101.mail.bf1.yahoo.com> Date: Thu, 22 Nov 2012 01:37:48 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement To: Johannes Weiner , Jaegeuk Hanse Cc: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , =?utf-8?B?TWV0aW4gRMO2xZ9sw7w=?= In-Reply-To: <20121122010959.GF24381@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Johannes, Yes, problem was as you projected. I tried to make "active" data-2 pages by manually reading them twice, and finally data-1 are got out of page cache. We have large files in PostgreSQL and Hadoop that we sequentially scan over; and try to fit our working set into total memory. So I hope your patches will take place in the soonest linux kernel version. Thanks, Metin ----- Original Message ----- From: Johannes Weiner To: Jaegeuk Hanse Cc: Jan Kara ; metin d ; "linux-kernel@vger.kernel.org" ; linux-mm@kvack.org Sent: Thursday, November 22, 2012 3:09 AM Subject: Re: Problem in Page Cache Replacement On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote: > On 11/22/2012 05:34 AM, Johannes Weiner wrote: > >Hi, > > > >On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: > >>On Tue 20-11-12 09:42:42, metin d wrote: > >>>I have two PostgreSQL databases named data-1 and data-2 that sit on the > >>>same machine. Both databases keep 40 GB of data, and the total memory > >>>available on the machine is 68GB. > >>> > >>>I started data-1 and data-2, and ran several queries to go over all their > >>>data. Then, I shut down data-1 and kept issuing queries against data-2. > >>>For some reason, the OS still holds on to large parts of data-1's pages > >>>in its page cache, and reserves about 35 GB of RAM to data-2's files. As > >>>a result, my queries on data-2 keep hitting disk. > >>> > >>>I'm checking page cache usage with fincore. When I run a table scan query > >>>against data-2, I see that data-2's pages get evicted and put back into > >>>the cache in a round-robin manner. Nothing happens to data-1's pages, > >>>although they haven't been touched for days. > >>> > >>>Does anybody know why data-1's pages aren't evicted from the page cache? > >>>I'm open to all kind of suggestions you think it might relate to problem. > >This might be because we do not deactive pages as long as there is > >cache on the inactive list.  I'm guessing that the inter-reference > >distance of data-2 is bigger than half of memory, so it's never > >getting activated and data-1 is never challenged. > > Hi Johannes, > > What's the meaning of "inter-reference distance" It's the number of memory accesses between two accesses to the same page:   A B C D A B C E ...     |_______|     |      | > and why compare it with half of memoy, what's the trick? If B gets accessed twice, it gets activated.  If it gets evicted in between, the second access will be a fresh page fault and B will not be recognized as frequently used. Our cutoff for scanning the active list is cache size / 2 right now (inactive_file_is_low), leaving 50% of memory to the inactive list. If the inter-reference distance for pages on the inactive list is bigger than that, they get evicted before their second access. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758601Ab2KWBcO (ORCPT ); Thu, 22 Nov 2012 20:32:14 -0500 Received: from mail-pb0-f46.google.com ([209.85.160.46]:45989 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755862Ab2KWBcN (ORCPT ); Thu, 22 Nov 2012 20:32:13 -0500 Message-ID: <50AED214.4000701@gmail.com> Date: Fri, 23 Nov 2012 09:32:04 +0800 From: Jaegeuk Hanse User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Fengguang Wu CC: metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <20121122152611.GA11736@localhost> In-Reply-To: <20121122152611.GA11736@localhost> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/22/2012 11:26 PM, Fengguang Wu wrote: > Hi Jaegeuk, > > Sorry for the delay. I'm traveling these days.. > > On Wed, Nov 21, 2012 at 05:42:33PM +0800, Jaegeuk Hanse wrote: >> On 11/21/2012 05:02 PM, Fengguang Wu wrote: >>> On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: >>>> Cc Fengguang Wu. >>>> >>>> On 11/21/2012 04:13 PM, metin d wrote: >>>>>> Curious. Added linux-mm list to CC to catch more attention. If you run >>>>>> echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? >>>>> I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. >>>>> >>>>> We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. >>>>> My understanding was that under memory pressure from heavily >>>>> accessed pages, unused pages would eventually get evicted. Is there >>>>> anything else we can try on this host to understand why this is >>>>> happening? >>> We may debug it this way. >>> >>> 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages >>> (please double check via /proc/vmstat whether it does the expected work) >>> >>> 2) run 'page-types -r' with root, to view the page status for the >>> remaining pages of data-1 >>> >>> The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) >>> Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" >>> >>> page-types can be found in the kernel source tree tools/vm/page-types.c >>> >>> Sorry that sounds a bit twisted.. I do have a patch to directly dump >>> page cache status of a user specified file, however it's not >>> upstreamed yet. >> Hi Fengguang, >> >> Thanks for you detail steps, I think metin can have a try. >> >> flags page-count MB symbolic-flags long-symbolic-flags >> 0x0000000000000000 607699 2373 >> ___________________________________ >> 0x0000000100000000 343227 1340 >> _______________________r___________ reserved > > We don't need to care about the above two pages states actually. > Page cache pages will never be in the special reserved or > all-flags-cleared state. Hi Fengguang, Thanks for your response. But which kind of pages are in the special reserved and which are all-flags-cleared? Regards, Jaegeuk > >> But I have some questions of the print of page-type: >> >> Is 2373MB here mean total memory in used include page cache? I don't >> think so. >> Which kind of pages will be marked reserved? >> Which line of long-symbolic-flags is for page cache? > The (lru && !anonymous) pages are page cache pages. > > Thanks, > Fengguang > >>>>> On Tue 20-11-12 09:42:42, metin d wrote: >>>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>>>>> same machine. Both databases keep 40 GB of data, and the total memory >>>>>> available on the machine is 68GB. >>>>>> >>>>>> I started data-1 and data-2, and ran several queries to go over all their >>>>>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>>>>> For some reason, the OS still holds on to large parts of data-1's pages >>>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>>>>> a result, my queries on data-2 keep hitting disk. >>>>>> >>>>>> I'm checking page cache usage with fincore. When I run a table scan query >>>>>> against data-2, I see that data-2's pages get evicted and put back into >>>>>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>>>>> although they haven't been touched for days. >>>>>> >>>>>> Does anybody know why data-1's pages aren't evicted from the page cache? >>>>>> I'm open to all kind of suggestions you think it might relate to problem. >>>>> Curious. Added linux-mm list to CC to catch more attention. If you run >>>>> echo 1 >/proc/sys/vm/drop_caches >>>>> does it evict data-1 pages from memory? >>>>> >>>>>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >>>>>> swap space. The kernel version is: >>>>>> >>>>>> $ uname -r >>>>>> 3.2.28-45.62.amzn1.x86_64 >>>>>> Edit: >>>>>> >>>>>> and it seems that I use one NUMA instance, if you think that it can a problem. >>>>>> >>>>>> $ numactl --hardware >>>>>> available: 1 nodes (0) >>>>>> node 0 cpus: 0 1 2 3 4 5 6 7 >>>>>> node 0 size: 70007 MB >>>>>> node 0 free: 360 MB >>>>>> node distances: >>>>>> node 0 >>>>>> 0: 10 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758860Ab2KWB6x (ORCPT ); Thu, 22 Nov 2012 20:58:53 -0500 Received: from mail-ie0-f174.google.com ([209.85.223.174]:55031 "EHLO mail-ie0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753039Ab2KWB6v (ORCPT ); Thu, 22 Nov 2012 20:58:51 -0500 Message-ID: <50AED854.7080300@gmail.com> Date: Fri, 23 Nov 2012 09:58:44 +0800 From: Jaegeuk Hanse User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: metin d CC: Jan Kara , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> In-Reply-To: <20121120182500.GH1408@quack.suse.cz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/21/2012 02:25 AM, Jan Kara wrote: > On Tue 20-11-12 09:42:42, metin d wrote: >> I have two PostgreSQL databases named data-1 and data-2 that sit on the >> same machine. Both databases keep 40 GB of data, and the total memory >> available on the machine is 68GB. >> >> I started data-1 and data-2, and ran several queries to go over all their >> data. Then, I shut down data-1 and kept issuing queries against data-2. >> For some reason, the OS still holds on to large parts of data-1's pages >> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >> a result, my queries on data-2 keep hitting disk. >> >> I'm checking page cache usage with fincore. When I run a table scan query >> against data-2, I see that data-2's pages get evicted and put back into >> the cache in a round-robin manner. Nothing happens to data-1's pages, >> although they haven't been touched for days. Hi metin d, fincore is a tool or ...? How could I get it? Regards, Jaegeuk >> >> Does anybody know why data-1's pages aren't evicted from the page cache? >> I'm open to all kind of suggestions you think it might relate to problem. > Curious. Added linux-mm list to CC to catch more attention. If you run > echo 1 >/proc/sys/vm/drop_caches > does it evict data-1 pages from memory? > >> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >> swap space. The kernel version is: >> >> $ uname -r >> 3.2.28-45.62.amzn1.x86_64 >> Edit: >> >> and it seems that I use one NUMA instance, if you think that it can a problem. >> >> $ numactl --hardware >> available: 1 nodes (0) >> node 0 cpus: 0 1 2 3 4 5 6 7 >> node 0 size: 70007 MB >> node 0 free: 360 MB >> node distances: >> node 0 >> 0: 10 > Honza From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758900Ab2KWCKe (ORCPT ); Thu, 22 Nov 2012 21:10:34 -0500 Received: from mail-ia0-f174.google.com ([209.85.210.174]:53084 "EHLO mail-ia0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752382Ab2KWCKd (ORCPT ); Thu, 22 Nov 2012 21:10:33 -0500 Message-ID: <50AEDB12.6090300@gmail.com> Date: Fri, 23 Nov 2012 10:10:26 +0800 From: Jaegeuk Hanse User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Fengguang Wu CC: =?UTF-8?B?TWV0aW4gRMO2xZ9sw7w=?= , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: Problem in Page Cache Replacement References: <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> <20121122154107.GB11736@localhost> <20121122155318.GA12636@localhost> In-Reply-To: <20121122155318.GA12636@localhost> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/22/2012 11:53 PM, Fengguang Wu wrote: > On Thu, Nov 22, 2012 at 11:41:07PM +0800, Fengguang Wu wrote: >> On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin Döşlü wrote: >>> On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse wrote: >>>> On 11/21/2012 05:58 PM, metin d wrote: >>>> >>>> Hi Fengguang, >>>> >>>> I run tests and attached the results. The line below I guess shows the data-1 page caches. >>>> >>>> 0x000000080000006c 6584051 25718 __RU_lA___________________P________ referenced,uptodate,lru,active,private >>>> >>>> >>>> I thinks this is just one state of page cache pages. >>> But why these page caches are in this state as opposed to other page >>> caches. From the results I conclude that: >>> >>> data-1 pages are in state : referenced,uptodate,lru,active,private >> I wonder if it's this code that stops data-1 pages from being >> reclaimed: >> >> shrink_page_list(): >> >> if (page_has_private(page)) { >> if (!try_to_release_page(page, sc->gfp_mask)) >> goto activate_locked; >> >> What's the filesystem used? > Ah it's more likely caused by this logic: > > if (is_active_lru(lru)) { > if (inactive_list_is_low(mz, file)) > shrink_active_list(nr_to_scan, mz, sc, priority, file); > > The active file list won't be scanned at all if it's smaller than the > active list. In this case, it's inactive=33586MB > active=25719MB. So > the data-1 pages in the active list will never be scanned and reclaimed. Hi Fengguang, It seems that most of data-1 file pages are in active lru cache and most of data-2 file pages are in inactive lru cache. As Johannes mentioned, if inter-reference distance is bigger than half of memory, the pages will not be actived. How you intend to resolve this issue? Is Johannes's inactive list threshing idea available? Regards, Jaegeuk > >>> data-2 pages are in state : referenced,uptodate,lru,mappedtodisk >> Thanks, >> Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758918Ab2KWCOO (ORCPT ); Thu, 22 Nov 2012 21:14:14 -0500 Received: from mail-ie0-f174.google.com ([209.85.223.174]:61925 "EHLO mail-ie0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752382Ab2KWCON (ORCPT ); Thu, 22 Nov 2012 21:14:13 -0500 Message-ID: <50AEDBEF.8070408@gmail.com> Date: Fri, 23 Nov 2012 10:14:07 +0800 From: Jaegeuk Hanse User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Johannes Weiner CC: Jan Kara , metin d , "linux-kernel@vger.kernel.org" , linux-mm@kvack.org Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <20121121213417.GC24381@cmpxchg.org> <50AD7647.7050200@gmail.com> <20121122010959.GF24381@cmpxchg.org> <50AE25AB.2060808@gmail.com> <20121122161743.GH24381@cmpxchg.org> In-Reply-To: <20121122161743.GH24381@cmpxchg.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/23/2012 12:17 AM, Johannes Weiner wrote: > On Thu, Nov 22, 2012 at 09:16:27PM +0800, Jaegeuk Hanse wrote: >> On 11/22/2012 09:09 AM, Johannes Weiner wrote: >>> On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote: >>>> On 11/22/2012 05:34 AM, Johannes Weiner wrote: >>>>> Hi, >>>>> >>>>> On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: >>>>>> On Tue 20-11-12 09:42:42, metin d wrote: >>>>>>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>>>>>> same machine. Both databases keep 40 GB of data, and the total memory >>>>>>> available on the machine is 68GB. >>>>>>> >>>>>>> I started data-1 and data-2, and ran several queries to go over all their >>>>>>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>>>>>> For some reason, the OS still holds on to large parts of data-1's pages >>>>>>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>>>>>> a result, my queries on data-2 keep hitting disk. >>>>>>> >>>>>>> I'm checking page cache usage with fincore. When I run a table scan query >>>>>>> against data-2, I see that data-2's pages get evicted and put back into >>>>>>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>>>>>> although they haven't been touched for days. >>>>>>> >>>>>>> Does anybody know why data-1's pages aren't evicted from the page cache? >>>>>>> I'm open to all kind of suggestions you think it might relate to problem. >>>>> This might be because we do not deactive pages as long as there is >>>>> cache on the inactive list. I'm guessing that the inter-reference >>>>> distance of data-2 is bigger than half of memory, so it's never >>>>> getting activated and data-1 is never challenged. >>>> Hi Johannes, >>>> >>>> What's the meaning of "inter-reference distance" >>> It's the number of memory accesses between two accesses to the same >>> page: >>> >>> A B C D A B C E ... >>> |_______| >>> | | >>> >>>> and why compare it with half of memoy, what's the trick? >>> If B gets accessed twice, it gets activated. If it gets evicted in >>> between, the second access will be a fresh page fault and B will not >>> be recognized as frequently used. >>> >>> Our cutoff for scanning the active list is cache size / 2 right now >>> (inactive_file_is_low), leaving 50% of memory to the inactive list. >>> If the inter-reference distance for pages on the inactive list is >>> bigger than that, they get evicted before their second access. >> Hi Johannes, >> >> Thanks for your explanation. But could you give a short description >> of how you resolve this inactive list thrashing issues? > I remember a time stamp of evicted file pages in the page cache radix > tree that let me reconstruct the inter-reference distance even after a > page has been evicted from cache when it's faulted back in. This way > I can tell a one-time sequence from thrashing, no matter how small the > inactive list. > > When thrashing is detected, I start deactivating protected pages and > put them next to the refaulted cache on the head of the inactive list > and let them fight it out as usual. In this reported case, the old > data will be challenged and since it's no longer used, it will just > drop off the inactive list eventually. If the guess is wrong and the > deactivated memory is used more heavily than the refaulting pages, > they will just get activated again without incurring any disruption > like a major fault. Hi Johannes, If you also add the time stamp to the protected pages which you deactive when incur thrashing? Regards, Jaegeuk From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758908Ab2KWC0J (ORCPT ); Thu, 22 Nov 2012 21:26:09 -0500 Received: from mga01.intel.com ([192.55.52.88]:18728 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756873Ab2KWC0I (ORCPT ); Thu, 22 Nov 2012 21:26:08 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.83,304,1352102400"; d="scan'208";a="251261115" Date: Fri, 23 Nov 2012 10:25:57 +0800 From: Fengguang Wu To: Jaegeuk Hanse Cc: metin d , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: Problem in Page Cache Replacement Message-ID: <20121123022557.GA3954@localhost> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <20121122152611.GA11736@localhost> <50AED214.4000701@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50AED214.4000701@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Jaegeuk, > Thanks for your response. But which kind of pages are in the special > reserved and which are all-flags-cleared? The all-flags-cleared pages are mostly free pages in the buddy system. The pages with flag "buddy" are also free pages: the buddy system only marks the head pages of each order-2 free range with flag "buddy". The reserved pages come from many sources, they may be set for memory reserved for BIOS, memory holes, offlined memory, or used by some device drivers. Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161008Ab2KWIIq (ORCPT ); Fri, 23 Nov 2012 03:08:46 -0500 Received: from nm6.bullet.mail.bf1.yahoo.com ([98.139.212.165]:47243 "EHLO nm6.bullet.mail.bf1.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030255Ab2KWIIp convert rfc822-to-8bit (ORCPT ); Fri, 23 Nov 2012 03:08:45 -0500 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 271599.49294.bm@omp1019.mail.bf1.yahoo.com DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=I/siNWg1Cf6VonRtAVW42HxSJ1gpFKQNa1wnomAPQGGLh9/PM9JsaAhs4+t2kpkNdxf5xNM3icke7eOQ3MPOCm9klsSjR3gglaJD4FAo40ofhlb6BloO5xhqq/UQCk1w9RQNVqFXl9hETXoOwc7KVZFXtQ6KoYtxj+IKW5EeQjA=; X-YMail-OSG: _KjlndQVM1l2uGZVJGjA2c1n8jh4vnbiUefaXDo4j7icfyO 82GpIBU4l5U5g70RLf4EBlRvipX6lQb_dUvDvssqffo.3gvmmhoOGjB9ryuC wsGpCZCASpmsMyjVAC.XupGFmuV.v6IT3SnLffJKiKGJPN6UogjKS3dRI1aY RggTJu6XUtN7suOdVY76KdGnZpmwSZsWj3DjM6UO66eg6ctWtGxMkfx.f1Eh UtHiYJwy4fv9WvsxX5BAoR4ZrqaRmdrbDfxpr2aEi.ET9Z_AttdrRQc0Wd.d T_mCljQ_yO5sPsZGZXeLhvYZ6MytCxvtJ6IpPRdJ1x1Aqadz8g1F6Uu8P3BA yUxF4u5SeA8KR.GItrXsNtIyf.W13jacktq.QKerDOqV3fB88L2pWAsYJOvN 7nPeRmCtswAEjvMBGnf5PJk.jHRR2TL3g6eRfxO2QZ04S66_gvEy26DJ008z pMPp8tLvmlNQMUZlrZ1Ssn8a4KuxHqdyKJjmT.2TwY2XAqPpZ_LxLK69Pw1m 9exFU4KzdaK69NZzi0PYLejwJlZP9N6Wh4ByTcPa5rflrUKI0Anu9Yi3oJR. 2xpC1RrWbrXLIV13CRUHoaMLHV1Nvk0km3g-- X-Rocket-MIMEInfo: 001.001,LS0tLS0gT3JpZ2luYWwgTWVzc2FnZSAtLS0tLQoKRnJvbTogSmFlZ2V1ayBIYW5zZSA8amFlZ2V1ay5oYW5zZUBnbWFpbC5jb20.ClRvOiBtZXRpbiBkIDxtZXRkb3NAeWFob28uY29tPgpDYzogSmFuIEthcmEgPGphY2tAc3VzZS5jej47ICJsaW51eC1rZXJuZWxAdmdlci5rZXJuZWwub3JnIiA8bGludXgta2VybmVsQHZnZXIua2VybmVsLm9yZz47IGxpbnV4LW1tQGt2YWNrLm9yZwpTZW50OiBGcmlkYXksIE5vdmVtYmVyIDIzLCAyMDEyIDM6NTggQU0KU3ViamVjdDogUmU6IFByb2JsZW0gaW4gUGFnZSBDYWMBMAEBAQE- X-Mailer: YahooMailWebService/0.8.123.460 References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <50AED854.7080300@gmail.com> Message-ID: <1353658123.36385.YahooMailNeo@web141101.mail.bf1.yahoo.com> Date: Fri, 23 Nov 2012 00:08:43 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement To: Jaegeuk Hanse Cc: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" In-Reply-To: <50AED854.7080300@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ----- Original Message ----- From: Jaegeuk Hanse To: metin d Cc: Jan Kara ; "linux-kernel@vger.kernel.org" ; linux-mm@kvack.org Sent: Friday, November 23, 2012 3:58 AM Subject: Re: Problem in Page Cache Replacement On 11/21/2012 02:25 AM, Jan Kara wrote: > On Tue 20-11-12 09:42:42, metin d wrote: >> I have two PostgreSQL databases named data-1 and data-2 that sit on the >> same machine. Both databases keep 40 GB of data, and the total memory >> available on the machine is 68GB. >> >> I started data-1 and data-2, and ran several queries to go over all their >> data. Then, I shut down data-1 and kept issuing queries against data-2. >> For some reason, the OS still holds on to large parts of data-1's pages >> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >> a result, my queries on data-2 keep hitting disk. >> >> I'm checking page cache usage with fincore. When I run a table scan query >> against data-2, I see that data-2's pages get evicted and put back into >> the cache in a round-robin manner. Nothing happens to data-1's pages, >> although they haven't been touched for days. > Hi metin d, > fincore is a tool or ...? How could I get it? > Regards, > Jaegeuk Hi Jaegeuk, Yes, it is a tool, you get it from here : http://code.google.com/p/linux-ftools/ Regards, Metin >> >> Does anybody know why data-1's pages aren't evicted from the page cache? >> I'm open to all kind of suggestions you think it might relate to problem. >    Curious. Added linux-mm list to CC to catch more attention. If you run > echo 1 >/proc/sys/vm/drop_caches >    does it evict data-1 pages from memory? > >> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >> swap space. The kernel version is: >> >> $ uname -r >> 3.2.28-45.62.amzn1.x86_64 >> Edit: >> >> and it seems that I use one NUMA instance, if  you think that it can a problem. >> >> $ numactl --hardware >> available: 1 nodes (0) >> node 0 cpus: 0 1 2 3 4 5 6 7 >> node 0 size: 70007 MB >> node 0 free: 360 MB >> node distances: >> node  0 >>    0:  10 >                                 Honza From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757864Ab2KWISG (ORCPT ); Fri, 23 Nov 2012 03:18:06 -0500 Received: from mail-pa0-f46.google.com ([209.85.220.46]:51429 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753754Ab2KWISE (ORCPT ); Fri, 23 Nov 2012 03:18:04 -0500 Message-ID: <50AF3134.3090803@gmail.com> Date: Fri, 23 Nov 2012 16:17:56 +0800 From: Jaegeuk Hanse User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: metin d CC: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: Problem in Page Cache Replacement References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <50AED854.7080300@gmail.com> <1353658123.36385.YahooMailNeo@web141101.mail.bf1.yahoo.com> In-Reply-To: <1353658123.36385.YahooMailNeo@web141101.mail.bf1.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/23/2012 04:08 PM, metin d wrote: > ----- Original Message ----- > > From: Jaegeuk Hanse > To: metin d > Cc: Jan Kara ; "linux-kernel@vger.kernel.org" ; linux-mm@kvack.org > Sent: Friday, November 23, 2012 3:58 AM > Subject: Re: Problem in Page Cache Replacement > > On 11/21/2012 02:25 AM, Jan Kara wrote: >> On Tue 20-11-12 09:42:42, metin d wrote: >>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>> same machine. Both databases keep 40 GB of data, and the total memory >>> available on the machine is 68GB. >>> >>> I started data-1 and data-2, and ran several queries to go over all their >>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>> For some reason, the OS still holds on to large parts of data-1's pages >>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>> a result, my queries on data-2 keep hitting disk. >>> >>> I'm checking page cache usage with fincore. When I run a table scan query >>> against data-2, I see that data-2's pages get evicted and put back into >>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>> although they haven't been touched for days. >> Hi metin d, >> fincore is a tool or ...? How could I get it? >> Regards, >> Jaegeuk > > Hi Jaegeuk, > > Yes, it is a tool, you get it from here : > http://code.google.com/p/linux-ftools/ Hi Metin, Could you give me a link to download it? I can't get it from the link you give me. Thanks in advance. :-) Regards, Jaegeuk > > > Regards, > Metin >>> Does anybody know why data-1's pages aren't evicted from the page cache? >>> I'm open to all kind of suggestions you think it might relate to problem. >> Curious. Added linux-mm list to CC to catch more attention. If you run >> echo 1 >/proc/sys/vm/drop_caches >> does it evict data-1 pages from memory? >> >>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >>> swap space. The kernel version is: >>> >>> $ uname -r >>> 3.2.28-45.62.amzn1.x86_64 >>> Edit: >>> >>> and it seems that I use one NUMA instance, if you think that it can a problem. >>> >>> $ numactl --hardware >>> available: 1 nodes (0) >>> node 0 cpus: 0 1 2 3 4 5 6 7 >>> node 0 size: 70007 MB >>> node 0 free: 360 MB >>> node distances: >>> node 0 >>> 0: 10 >> Honza From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757945Ab2KWIZT (ORCPT ); Fri, 23 Nov 2012 03:25:19 -0500 Received: from nm2-vm0.bullet.mail.bf1.yahoo.com ([98.139.213.127]:37648 "EHLO nm2-vm0.bullet.mail.bf1.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754013Ab2KWIZR convert rfc822-to-8bit (ORCPT ); Fri, 23 Nov 2012 03:25:17 -0500 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 419686.36942.bm@omp1058.mail.bf1.yahoo.com DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=jK9dIY7vZpK9w0cAo/6NLzrQ4dDcRSIua1sanuBIY1nEAgqrr3ItRFD4nJuGQ9y8FxFrF5+amTaaLF4jhVntaEK9xjIsnYra91tltJcAoVJsgS+dlRU+E8EWFOuJCrA5nZGnTMfg8b7oJWdUA8duZY8KErtmwNQD9MSimQScZsM=; X-YMail-OSG: Rr32o7gVM1kXrrSbm1jvFimRiFAADVgjCG4t.DiwraH8DNQ HWrs9zgWQziYSd_5h_zb_Nm0MwkCaM1A6ziRaxcDv_VKl7uWxpHptNh7x04m cA3begDWbEGwWAfYuNv2LHVBn5reAVu._kyxooZf43BUOt8VtZpH0diZYAAr zSEUbhSe8a21_Dz.HRiBlnCyhIG_MuIYyfnh35g1Nr2LPegHBh2E9JOU7HfZ reB_hH2YPF0LFaoutQXj1GU8EhEFZkMawHU7Kd75EJaJ0SV8zxQNVqMtP9k0 h0uDv9FruXwgU0D4aQOyNGb485ouBM8XII6a21anqoFUp4fa_uc.JdeiN3i8 elnbUcqfk1xE0_ChPoSPKLMo_GG9lBKh1mE.1i3QsiZnsQBM6hV7.PPFI6E8 u57wpMdGXM2m0b2Wm_TJKUXqN3s5csE1zoiTJABljpTX_Icey9EGgvgGVIEQ 06vPnsS_v6LQDY6s4WaAMWwujSC7Iea6.CHsWAkjHTk50RioVjgxDDqhVXGm V6vYkkQt9WahKrNzDGGT5WDJlA8Ukdbi3T6hcfA4yMBEIAIBGCWKH41b.Azj FwLEUn2dFW5zGKOiVAIPO8JEoM4ethpC5 X-Rocket-MIMEInfo: 001.001,LS0tLS0gT3JpZ2luYWwgTWVzc2FnZSAtLS0tLQoKRnJvbTogSmFlZ2V1ayBIYW5zZSA8amFlZ2V1ay5oYW5zZUBnbWFpbC5jb20.ClRvOiBtZXRpbiBkIDxtZXRkb3NAeWFob28uY29tPgpDYzogSmFuIEthcmEgPGphY2tAc3VzZS5jej47ICJsaW51eC1rZXJuZWxAdmdlci5rZXJuZWwub3JnIiA8bGludXgta2VybmVsQHZnZXIua2VybmVsLm9yZz47ICJsaW51eC1tbUBrdmFjay5vcmciIDxsaW51eC1tbUBrdmFjay5vcmc.ClNlbnQ6IEZyaWRheSwgTm92ZW1iZXIgMjMsIDIwMTIgMTA6MTcgQU0KU3ViamVjdDoBMAEBAQE- X-Mailer: YahooMailWebService/0.8.123.460 References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <50AED854.7080300@gmail.com> <1353658123.36385.YahooMailNeo@web141101.mail.bf1.yahoo.com> <50AF3134.3090803@gmail.com> Message-ID: <1353659115.24777.YahooMailNeo@web141102.mail.bf1.yahoo.com> Date: Fri, 23 Nov 2012 00:25:15 -0800 (PST) From: metin d Reply-To: metin d Subject: Re: Problem in Page Cache Replacement To: Jaegeuk Hanse Cc: Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" In-Reply-To: <50AF3134.3090803@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ----- Original Message ----- From: Jaegeuk Hanse To: metin d Cc: Jan Kara ; "linux-kernel@vger.kernel.org" ; "linux-mm@kvack.org" Sent: Friday, November 23, 2012 10:17 AM Subject: Re: Problem in Page Cache Replacement On 11/23/2012 04:08 PM, metin d wrote: > ----- Original Message ----- > > From: Jaegeuk Hanse > To: metin d > Cc: Jan Kara ; "linux-kernel@vger.kernel.org" ; linux-mm@kvack.org > Sent: Friday, November 23, 2012 3:58 AM > Subject: Re: Problem in Page Cache Replacement > > On 11/21/2012 02:25 AM, Jan Kara wrote: >> On Tue 20-11-12 09:42:42, metin d wrote: >>> I have two PostgreSQL databases named data-1 and data-2 that sit on the >>> same machine. Both databases keep 40 GB of data, and the total memory >>> available on the machine is 68GB. >>> >>> I started data-1 and data-2, and ran several queries to go over all their >>> data. Then, I shut down data-1 and kept issuing queries against data-2. >>> For some reason, the OS still holds on to large parts of data-1's pages >>> in its page cache, and reserves about 35 GB of RAM to data-2's files. As >>> a result, my queries on data-2 keep hitting disk. >>> >>> I'm checking page cache usage with fincore. When I run a table scan query >>> against data-2, I see that data-2's pages get evicted and put back into >>> the cache in a round-robin manner. Nothing happens to data-1's pages, >>> although they haven't been touched for days. >> Hi metin d, >> fincore is a tool or ...? How could I get it? >> Regards, >> Jaegeuk > > Hi Jaegeuk, > > Yes, it is a tool, you get it from here : > http://code.google.com/p/linux-ftools/ > Hi Metin, > Could you give me a link to download it? I can't get it from the link > you give me. Thanks in advance. :-) > Regards, > Jaegeuk Hi Jaegeuk, You may need to install mercurial on your system, I'm able to download source code with this command: hg clone https://code.google.com/p/linux-ftools/ Regards, Metin > > > Regards, > Metin >>> Does anybody know why data-1's pages aren't evicted from the page cache? >>> I'm open to all kind of suggestions you think it might relate to problem. >>      Curious. Added linux-mm list to CC to catch more attention. If you run >> echo 1 >/proc/sys/vm/drop_caches >>      does it evict data-1 pages from memory? >> >>> This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no >>> swap space. The kernel version is: >>> >>> $ uname -r >>> 3.2.28-45.62.amzn1.x86_64 >>> Edit: >>> >>> and it seems that I use one NUMA instance, if  you think that it can a problem. >>> >>> $ numactl --hardware >>> available: 1 nodes (0) >>> node 0 cpus: 0 1 2 3 4 5 6 7 >>> node 0 size: 70007 MB >>> node 0 free: 360 MB >>> node distances: >>> node  0 >>>      0:  10 >>                                  Honza From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751807Ab2KXPGc (ORCPT ); Sat, 24 Nov 2012 10:06:32 -0500 Received: from mail-qa0-f53.google.com ([209.85.216.53]:41920 "EHLO mail-qa0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751536Ab2KXPGa convert rfc822-to-8bit (ORCPT ); Sat, 24 Nov 2012 10:06:30 -0500 MIME-Version: 1.0 In-Reply-To: <20121122154107.GB11736@localhost> References: <1353433362.85184.YahooMailNeo@web141101.mail.bf1.yahoo.com> <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> <20121122154107.GB11736@localhost> From: =?UTF-8?B?TWV0aW4gRMO2xZ9sw7w=?= Date: Sat, 24 Nov 2012 17:06:09 +0200 Message-ID: Subject: Re: Problem in Page Cache Replacement To: Fengguang Wu Cc: Jaegeuk Hanse , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 22, 2012 at 5:41 PM, Fengguang Wu wrote: > On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin Döşlü wrote: >> On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse wrote: >> > >> > On 11/21/2012 05:58 PM, metin d wrote: >> > >> > Hi Fengguang, >> > >> > I run tests and attached the results. The line below I guess shows the data-1 page caches. >> > >> > 0x000000080000006c 6584051 25718 __RU_lA___________________P________ referenced,uptodate,lru,active,private >> > >> > >> > I thinks this is just one state of page cache pages. >> >> But why these page caches are in this state as opposed to other page >> caches. From the results I conclude that: >> >> data-1 pages are in state : referenced,uptodate,lru,active,private > > I wonder if it's this code that stops data-1 pages from being > reclaimed: > > shrink_page_list(): > > if (page_has_private(page)) { > if (!try_to_release_page(page, sc->gfp_mask)) > goto activate_locked; > > What's the filesystem used? It was ext3. >> data-2 pages are in state : referenced,uptodate,lru,mappedtodisk > > Thanks, > Fengguang -- Metin Döşlü From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753405Ab2KYUJk (ORCPT ); Sun, 25 Nov 2012 15:09:40 -0500 Received: from mx1.redhat.com ([209.132.183.28]:50883 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753369Ab2KYUJj (ORCPT ); Sun, 25 Nov 2012 15:09:39 -0500 Message-ID: <50B27AD1.6010703@redhat.com> Date: Sun, 25 Nov 2012 15:08:49 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120827 Thunderbird/15.0 MIME-Version: 1.0 To: Fengguang Wu CC: =?UTF-8?B?TWV0aW4gRMO2xZ9sw7w=?= , Jaegeuk Hanse , Jan Kara , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Johannes Weiner Subject: Re: Problem in Page Cache Replacement References: <20121120182500.GH1408@quack.suse.cz> <1353485020.53500.YahooMailNeo@web141104.mail.bf1.yahoo.com> <1353485630.17455.YahooMailNeo@web141106.mail.bf1.yahoo.com> <50AC9220.70202@gmail.com> <20121121090204.GA9064@localhost> <50ACA209.9000101@gmail.com> <1353491880.11679.YahooMailNeo@web141102.mail.bf1.yahoo.com> <50ACA634.5000007@gmail.com> <20121122154107.GB11736@localhost> <20121122155318.GA12636@localhost> In-Reply-To: <20121122155318.GA12636@localhost> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/22/2012 10:53 AM, Fengguang Wu wrote: > Ah it's more likely caused by this logic: > > if (is_active_lru(lru)) { > if (inactive_list_is_low(mz, file)) > shrink_active_list(nr_to_scan, mz, sc, priority, file); > > The active file list won't be scanned at all if it's smaller than the > active list. In this case, it's inactive=33586MB > active=25719MB. So > the data-1 pages in the active list will never be scanned and reclaimed. That's it, indeed. The reason we have that code is that otherwise one large streaming IO could easily end up evicting the entire page cache working set. Usually it works well, because the new page cache working set tends to get touched twice while on the inactive list, and the old working set gets demoted from the active list. Only in a few very specific cases, where the inter-reference distance of the new working set is larger than the size of the inactive list, does it fail. Something like Johannes's patches should solve the problem. -- All rights reversed