* Re: mmap vs fs cache [not found] <5136320E.8030109@symas.com> @ 2013-03-07 15:43 ` Jan Kara 2013-03-08 2:08 ` Johannes Weiner 0 siblings, 1 reply; 17+ messages in thread From: Jan Kara @ 2013-03-07 15:43 UTC (permalink / raw) To: Howard Chu; +Cc: linux-kernel, linux-mm Added mm list to CC. On Tue 05-03-13 09:57:34, Howard Chu wrote: > I'm testing our memory-mapped database code on a small VM. The > machine has 32GB of RAM and the size of the DB on disk is ~44GB. The > database library mmaps the entire file as a single region and starts > accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23 > kernel, XFS on a local disk. > > If I start running read-only queries against the DB with a freshly > started server, I see that my process (OpenLDAP slapd) quickly grows > to an RSS of about 16GB in tandem with the FS cache. (I.e., "top" > shows 16GB cached, and slapd is 16GB.) > If I confine my queries to the first 20% of the data then it all > fits in RAM and queries are nice and fast. > > if I extend the query range to cover more of the data, approaching > the size of physical RAM, I see something strange - the FS cache > keeps growing, but the slapd process size grows at a slower rate. > This is rather puzzling to me since the only thing triggering reads > is accesses through the mmap region. Eventually the FS cache grows > to basically all of the 32GB of RAM (+/- some text/data space...) > but the slapd process only reaches 25GB, at which point it actually > starts to shrink - apparently the FS cache is now stealing pages > from it. I find that a bit puzzling; if the pages are present in > memory, and the only reason they were paged in was to satisfy an > mmap reference, why aren't they simply assigned to the slapd > process? > > The current behavior gets even more aggravating: I can run a test > that spans exactly 30GB of the data. One would expect that the slapd > process should simply grow to 30GB in size, and then remain static > for the remainder of the test. Instead, the server grows to 25GB, > the FS cache grows to 32GB, and starts stealing pages from the > server, shrinking it back down to 19GB or so. > > If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this > condition, the FS cache shrinks back to 25GB, matching the slapd > process size. > This then frees up enough RAM for slapd to grow further. If I don't > do this, the test is constantly paging in data from disk. Even so, > the FS cache continues to grow faster than the slapd process size, > so the system may run out of free RAM again, and I have to drop > caches multiple times before slapd finally grows to the full 30GB. > Once it gets to that size the test runs entirely from RAM with zero > I/Os, but it doesn't get there without a lot of babysitting. > > 2 questions: > why is there data in the FS cache that isn't owned by (the mmap > of) the process that caused it to be paged in in the first place? > is there a tunable knob to discourage the page cache from stealing > from the process? > > -- > -- Howard Chu > CTO, Symas Corp. http://www.symas.com > Director, Highland Sun http://highlandsun.com/hyc/ > Chief Architect, OpenLDAP http://www.openldap.org/project/ > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-07 15:43 ` mmap vs fs cache Jan Kara @ 2013-03-08 2:08 ` Johannes Weiner 2013-03-08 7:46 ` Howard Chu 2013-03-09 2:34 ` Ric Mason 0 siblings, 2 replies; 17+ messages in thread From: Johannes Weiner @ 2013-03-08 2:08 UTC (permalink / raw) To: Jan Kara; +Cc: Howard Chu, linux-kernel, linux-mm On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote: > Added mm list to CC. > > On Tue 05-03-13 09:57:34, Howard Chu wrote: > > I'm testing our memory-mapped database code on a small VM. The > > machine has 32GB of RAM and the size of the DB on disk is ~44GB. The > > database library mmaps the entire file as a single region and starts > > accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23 > > kernel, XFS on a local disk. > > > > If I start running read-only queries against the DB with a freshly > > started server, I see that my process (OpenLDAP slapd) quickly grows > > to an RSS of about 16GB in tandem with the FS cache. (I.e., "top" > > shows 16GB cached, and slapd is 16GB.) > > If I confine my queries to the first 20% of the data then it all > > fits in RAM and queries are nice and fast. > > > > if I extend the query range to cover more of the data, approaching > > the size of physical RAM, I see something strange - the FS cache > > keeps growing, but the slapd process size grows at a slower rate. > > This is rather puzzling to me since the only thing triggering reads > > is accesses through the mmap region. Eventually the FS cache grows > > to basically all of the 32GB of RAM (+/- some text/data space...) > > but the slapd process only reaches 25GB, at which point it actually > > starts to shrink - apparently the FS cache is now stealing pages > > from it. I find that a bit puzzling; if the pages are present in > > memory, and the only reason they were paged in was to satisfy an > > mmap reference, why aren't they simply assigned to the slapd > > process? > > > > The current behavior gets even more aggravating: I can run a test > > that spans exactly 30GB of the data. One would expect that the slapd > > process should simply grow to 30GB in size, and then remain static > > for the remainder of the test. Instead, the server grows to 25GB, > > the FS cache grows to 32GB, and starts stealing pages from the > > server, shrinking it back down to 19GB or so. > > > > If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this > > condition, the FS cache shrinks back to 25GB, matching the slapd > > process size. > > This then frees up enough RAM for slapd to grow further. If I don't > > do this, the test is constantly paging in data from disk. Even so, > > the FS cache continues to grow faster than the slapd process size, > > so the system may run out of free RAM again, and I have to drop > > caches multiple times before slapd finally grows to the full 30GB. > > Once it gets to that size the test runs entirely from RAM with zero > > I/Os, but it doesn't get there without a lot of babysitting. > > > > 2 questions: > > why is there data in the FS cache that isn't owned by (the mmap > > of) the process that caused it to be paged in in the first place? The filesystem cache is shared among processes because the filesystem is also shared among processes. If another task were to access the same file, we still should only have one copy of that data in memory. It sounds to me like slapd is itself caching all the data it reads. If that is true, shouldn't it really be using direct IO to prevent this double buffering of filesystem data in memory? > > is there a tunable knob to discourage the page cache from stealing > > from the process? Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and defaults to 60. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 2:08 ` Johannes Weiner @ 2013-03-08 7:46 ` Howard Chu 2013-03-08 8:42 ` Kirill A. Shutemov 2013-03-09 2:34 ` Ric Mason 1 sibling, 1 reply; 17+ messages in thread From: Howard Chu @ 2013-03-08 7:46 UTC (permalink / raw) To: Johannes Weiner, Jan Kara; +Cc: linux-kernel, linux-mm Johannes Weiner wrote: > On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote: >>> 2 questions: >>> why is there data in the FS cache that isn't owned by (the mmap >>> of) the process that caused it to be paged in in the first place? > > The filesystem cache is shared among processes because the filesystem > is also shared among processes. If another task were to access the > same file, we still should only have one copy of that data in memory. That's irrelevant to the question. As I already explained, the first 16GB that was paged in didn't behave this way. Perhaps "owned" was the wrong word, since this is a MAP_SHARED mapping. But the point is that the memory is not being accounted in slapd's process size, when it was before, up to 16GB. > It sounds to me like slapd is itself caching all the data it reads. You're misreading the information then. slapd is doing no caching of its own, its RSS and SHR memory size are both the same. All it is using is the mmap, nothing else. The RSS == SHR == FS cache, up to 16GB. RSS is always == SHR, but above 16GB they grow more slowly than the FS cache. > If that is true, shouldn't it really be using direct IO to prevent > this double buffering of filesystem data in memory? There is no double buffering. >>> is there a tunable knob to discourage the page cache from stealing >>> from the process? > > Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and > defaults to 60. I've already tried setting it to 0 with no effect. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 7:46 ` Howard Chu @ 2013-03-08 8:42 ` Kirill A. Shutemov 2013-03-08 9:40 ` Howard Chu 0 siblings, 1 reply; 17+ messages in thread From: Kirill A. Shutemov @ 2013-03-08 8:42 UTC (permalink / raw) To: Howard Chu; +Cc: Johannes Weiner, Jan Kara, linux-kernel, linux-mm On Thu, Mar 07, 2013 at 11:46:39PM -0800, Howard Chu wrote: > You're misreading the information then. slapd is doing no caching of > its own, its RSS and SHR memory size are both the same. All it is > using is the mmap, nothing else. The RSS == SHR == FS cache, up to > 16GB. RSS is always == SHR, but above 16GB they grow more slowly > than the FS cache. It only means, that some pages got unmapped from your process. It can happned, for instance, due page migration. There's nothing worry about: it will be mapped back on next page fault to the page and it's only minor page fault since the page is in pagecache anyway. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 8:42 ` Kirill A. Shutemov @ 2013-03-08 9:40 ` Howard Chu 2013-03-08 14:47 ` Chris Friesen 0 siblings, 1 reply; 17+ messages in thread From: Howard Chu @ 2013-03-08 9:40 UTC (permalink / raw) To: Kirill A. Shutemov; +Cc: Johannes Weiner, Jan Kara, linux-kernel, linux-mm Kirill A. Shutemov wrote: > On Thu, Mar 07, 2013 at 11:46:39PM -0800, Howard Chu wrote: >> You're misreading the information then. slapd is doing no caching of >> its own, its RSS and SHR memory size are both the same. All it is >> using is the mmap, nothing else. The RSS == SHR == FS cache, up to >> 16GB. RSS is always == SHR, but above 16GB they grow more slowly >> than the FS cache. > > It only means, that some pages got unmapped from your process. It can > happned, for instance, due page migration. There's nothing worry about: it > will be mapped back on next page fault to the page and it's only minor > page fault since the page is in pagecache anyway. Unfortunately there *is* something to worry about. As I said already - when the test spans 30GB, the FS cache fills up the rest of RAM and the test is doing a lot of real I/O even though it shouldn't need to. Please, read the entire original post before replying. There is no way that a process that is accessing only 30GB of a mmap should be able to fill up 32GB of RAM. There's nothing else running on the machine, I've killed or suspended everything else in userland besides a couple shells running top and vmstat. When I manually drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and the physical I/O stops. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 9:40 ` Howard Chu @ 2013-03-08 14:47 ` Chris Friesen 2013-03-08 15:00 ` Howard Chu 0 siblings, 1 reply; 17+ messages in thread From: Chris Friesen @ 2013-03-08 14:47 UTC (permalink / raw) To: Howard Chu Cc: Kirill A. Shutemov, Johannes Weiner, Jan Kara, linux-kernel, linux-mm On 03/08/2013 03:40 AM, Howard Chu wrote: > There is no way that a process that is accessing only 30GB of a mmap > should be able to fill up 32GB of RAM. There's nothing else running on > the machine, I've killed or suspended everything else in userland > besides a couple shells running top and vmstat. When I manually > drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and > the physical I/O stops. Is it possible that the kernel is doing some sort of automatic readahead, but it ends up reading pages corresponding to data that isn't ever queried and so doesn't get mapped by the application? Chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 14:47 ` Chris Friesen @ 2013-03-08 15:00 ` Howard Chu 2013-03-08 15:25 ` Chris Friesen ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Howard Chu @ 2013-03-08 15:00 UTC (permalink / raw) To: Chris Friesen Cc: Kirill A. Shutemov, Johannes Weiner, Jan Kara, linux-kernel, linux-mm Chris Friesen wrote: > On 03/08/2013 03:40 AM, Howard Chu wrote: > >> There is no way that a process that is accessing only 30GB of a mmap >> should be able to fill up 32GB of RAM. There's nothing else running on >> the machine, I've killed or suspended everything else in userland >> besides a couple shells running top and vmstat. When I manually >> drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and >> the physical I/O stops. > > Is it possible that the kernel is doing some sort of automatic > readahead, but it ends up reading pages corresponding to data that isn't > ever queried and so doesn't get mapped by the application? Yes, that's what I was thinking. I added a posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the test. First obvious conclusion - kswapd is being too aggressive. When free memory hits the low watermark, the reclaim shrinks slapd down from 25GB to 18-19GB, while the page cache still contains ~7GB of unmapped pages. Ideally I'd like a tuning knob so I can say to keep no more than 2GB of unmapped pages in the cache. (And the desired effect of that would be to allow user processes to grow to 30GB total, in this case.) I mentioned this "unmapped page cache control" post already http://lwn.net/Articles/436010/ but it seems that the idea was ultimately rejected. Is there anything else similar in current kernels? -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 15:00 ` Howard Chu @ 2013-03-08 15:25 ` Chris Friesen 2013-03-08 16:16 ` Johannes Weiner 2013-03-09 1:22 ` Phillip Susi 2 siblings, 0 replies; 17+ messages in thread From: Chris Friesen @ 2013-03-08 15:25 UTC (permalink / raw) To: Howard Chu Cc: Kirill A. Shutemov, Johannes Weiner, Jan Kara, linux-kernel, linux-mm On 03/08/2013 09:00 AM, Howard Chu wrote: > First obvious conclusion - kswapd is being too aggressive. When free > memory hits the low watermark, the reclaim shrinks slapd down from 25GB > to 18-19GB, while the page cache still contains ~7GB of unmapped pages. > Ideally I'd like a tuning knob so I can say to keep no more than 2GB of > unmapped pages in the cache. (And the desired effect of that would be to > allow user processes to grow to 30GB total, in this case.) > > I mentioned this "unmapped page cache control" post already > http://lwn.net/Articles/436010/ but it seems that the idea was > ultimately rejected. Is there anything else similar in current kernels? Sorry, I'm not aware of anything. I'm not a filesystem/vm guy though, so maybe there's something I don't know about. I would have expected both posix_madvise(..POSIX_MADV_RANDOM) and swappiness to help, but it doesn't sound like they're working. Chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 15:00 ` Howard Chu 2013-03-08 15:25 ` Chris Friesen @ 2013-03-08 16:16 ` Johannes Weiner 2013-03-08 20:04 ` Howard Chu 2013-03-09 3:28 ` Ric Mason 2013-03-09 1:22 ` Phillip Susi 2 siblings, 2 replies; 17+ messages in thread From: Johannes Weiner @ 2013-03-08 16:16 UTC (permalink / raw) To: Howard Chu Cc: Chris Friesen, Kirill A. Shutemov, Jan Kara, Mel Gorman, Rik van Riel, linux-kernel, linux-mm On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote: > Chris Friesen wrote: > >On 03/08/2013 03:40 AM, Howard Chu wrote: > > > >>There is no way that a process that is accessing only 30GB of a mmap > >>should be able to fill up 32GB of RAM. There's nothing else running on > >>the machine, I've killed or suspended everything else in userland > >>besides a couple shells running top and vmstat. When I manually > >>drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and > >>the physical I/O stops. > > > >Is it possible that the kernel is doing some sort of automatic > >readahead, but it ends up reading pages corresponding to data that isn't > >ever queried and so doesn't get mapped by the application? > > Yes, that's what I was thinking. I added a > posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the > test. > > First obvious conclusion - kswapd is being too aggressive. When free > memory hits the low watermark, the reclaim shrinks slapd down from > 25GB to 18-19GB, while the page cache still contains ~7GB of > unmapped pages. Ideally I'd like a tuning knob so I can say to keep > no more than 2GB of unmapped pages in the cache. (And the desired > effect of that would be to allow user processes to grow to 30GB > total, in this case.) We should find out where the unmapped page cache is coming from if you are only accessing mapped file cache and disabled readahead. How do you arrive at this number of unmapped page cache? What could happen is that previously used and activated pages do not get evicted anymore since there is a constant supply of younger reclaimable cache that is actually thrashing. Whenever you drop the caches, you get rid of those stale active pages and allow the previously thrashing cache to get activated. However, that would require that there is already a significant amount of active file pages before your workload starts (check the nr_active_file number in /proc/vmstat before launching slapd, try sync; echo 3 >drop_caches before launching to eliminate this option) OR that the set of pages accessed during your workload changes and the combined set of pages accessed by your workload is bigger than available memory -- which you claimed would not happen because you only access the 30GB file area on that system. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 16:16 ` Johannes Weiner @ 2013-03-08 20:04 ` Howard Chu 2013-03-11 12:04 ` Jan Kara 2013-03-09 3:28 ` Ric Mason 1 sibling, 1 reply; 17+ messages in thread From: Howard Chu @ 2013-03-08 20:04 UTC (permalink / raw) To: Johannes Weiner Cc: Chris Friesen, Kirill A. Shutemov, Jan Kara, Mel Gorman, Rik van Riel, linux-kernel, linux-mm Johannes Weiner wrote: > On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote: >> Chris Friesen wrote: >>> On 03/08/2013 03:40 AM, Howard Chu wrote: >>> >>>> There is no way that a process that is accessing only 30GB of a mmap >>>> should be able to fill up 32GB of RAM. There's nothing else running on >>>> the machine, I've killed or suspended everything else in userland >>>> besides a couple shells running top and vmstat. When I manually >>>> drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and >>>> the physical I/O stops. >>> >>> Is it possible that the kernel is doing some sort of automatic >>> readahead, but it ends up reading pages corresponding to data that isn't >>> ever queried and so doesn't get mapped by the application? >> >> Yes, that's what I was thinking. I added a >> posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the >> test. >> >> First obvious conclusion - kswapd is being too aggressive. When free >> memory hits the low watermark, the reclaim shrinks slapd down from >> 25GB to 18-19GB, while the page cache still contains ~7GB of >> unmapped pages. Ideally I'd like a tuning knob so I can say to keep >> no more than 2GB of unmapped pages in the cache. (And the desired >> effect of that would be to allow user processes to grow to 30GB >> total, in this case.) > > We should find out where the unmapped page cache is coming from if you > are only accessing mapped file cache and disabled readahead. > > How do you arrive at this number of unmapped page cache? This number is pretty obvious. When slapd has grown to 25GB, the page cache has grown to 32GB (less about 200MB, the minfree). So: 7GB unmapped in the cache. > What could happen is that previously used and activated pages do not > get evicted anymore since there is a constant supply of younger > reclaimable cache that is actually thrashing. Whenever you drop the > caches, you get rid of those stale active pages and allow the > previously thrashing cache to get activated. However, that would > require that there is already a significant amount of active file > pages before your workload starts (check the nr_active_file number in > /proc/vmstat before launching slapd, try sync; echo 3 >drop_caches > before launching to eliminate this option) OR that the set of pages > accessed during your workload changes and the combined set of pages > accessed by your workload is bigger than available memory -- which you > claimed would not happen because you only access the 30GB file area on > that system. There are no other active pages before the test begins. There's nothing else running. caches have been dropped completely at the beginning. The test clearly is accessing only 30GB of data. Once slapd reaches this process size, the test can be stopped and restarted any number of times, run for any number of hours continuously, and memory use on the system is unchanged, and no pageins occur. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 20:04 ` Howard Chu @ 2013-03-11 12:04 ` Jan Kara 2013-03-11 12:40 ` Howard Chu 0 siblings, 1 reply; 17+ messages in thread From: Jan Kara @ 2013-03-11 12:04 UTC (permalink / raw) To: Howard Chu Cc: Johannes Weiner, Chris Friesen, Kirill A. Shutemov, Jan Kara, Mel Gorman, Rik van Riel, linux-kernel, linux-mm On Fri 08-03-13 12:04:46, Howard Chu wrote: > Johannes Weiner wrote: > >On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote: > >>Chris Friesen wrote: > >>>On 03/08/2013 03:40 AM, Howard Chu wrote: > >>> > >>>>There is no way that a process that is accessing only 30GB of a mmap > >>>>should be able to fill up 32GB of RAM. There's nothing else running on > >>>>the machine, I've killed or suspended everything else in userland > >>>>besides a couple shells running top and vmstat. When I manually > >>>>drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and > >>>>the physical I/O stops. > >>> > >>>Is it possible that the kernel is doing some sort of automatic > >>>readahead, but it ends up reading pages corresponding to data that isn't > >>>ever queried and so doesn't get mapped by the application? > >> > >>Yes, that's what I was thinking. I added a > >>posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the > >>test. > >> > >>First obvious conclusion - kswapd is being too aggressive. When free > >>memory hits the low watermark, the reclaim shrinks slapd down from > >>25GB to 18-19GB, while the page cache still contains ~7GB of > >>unmapped pages. Ideally I'd like a tuning knob so I can say to keep > >>no more than 2GB of unmapped pages in the cache. (And the desired > >>effect of that would be to allow user processes to grow to 30GB > >>total, in this case.) > > > >We should find out where the unmapped page cache is coming from if you > >are only accessing mapped file cache and disabled readahead. > > > >How do you arrive at this number of unmapped page cache? > > This number is pretty obvious. When slapd has grown to 25GB, the This 25G is presumably from /proc/pid/statm, right? > page cache has grown to 32GB (less about 200MB, the minfree). So: And this value is from where? /proc/meminfo - Cached line? > 7GB unmapped in the cache. > > >What could happen is that previously used and activated pages do not > >get evicted anymore since there is a constant supply of younger > >reclaimable cache that is actually thrashing. Whenever you drop the > >caches, you get rid of those stale active pages and allow the > >previously thrashing cache to get activated. However, that would > >require that there is already a significant amount of active file > >pages before your workload starts (check the nr_active_file number in > >/proc/vmstat before launching slapd, try sync; echo 3 >drop_caches > >before launching to eliminate this option) OR that the set of pages > >accessed during your workload changes and the combined set of pages > >accessed by your workload is bigger than available memory -- which you > >claimed would not happen because you only access the 30GB file area on > >that system. > > There are no other active pages before the test begins. There's > nothing else running. caches have been dropped completely at the > beginning. > > The test clearly is accessing only 30GB of data. Once slapd reaches > this process size, the test can be stopped and restarted any number > of times, run for any number of hours continuously, and memory use > on the system is unchanged, and no pageins occur. Interesting. It might be worth trying what happens if you do madvise(..., MADV_DONTNEED) on the data file instead of dropping caches with /proc/sys/vm/drop_caches. That way we can establish whether the extra cached data is in the data file (things will look the same way as with drop_caches) or somewhere else (there will be still unmapped page cache). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-11 12:04 ` Jan Kara @ 2013-03-11 12:40 ` Howard Chu 0 siblings, 0 replies; 17+ messages in thread From: Howard Chu @ 2013-03-11 12:40 UTC (permalink / raw) To: Jan Kara Cc: Johannes Weiner, Chris Friesen, Kirill A. Shutemov, Mel Gorman, Rik van Riel, linux-kernel, linux-mm Jan Kara wrote: > On Fri 08-03-13 12:04:46, Howard Chu wrote: >> The test clearly is accessing only 30GB of data. Once slapd reaches >> this process size, the test can be stopped and restarted any number >> of times, run for any number of hours continuously, and memory use >> on the system is unchanged, and no pageins occur. > Interesting. It might be worth trying what happens if you do > madvise(..., MADV_DONTNEED) on the data file instead of dropping caches > with /proc/sys/vm/drop_caches. That way we can establish whether the extra > cached data is in the data file (things will look the same way as with > drop_caches) or somewhere else (there will be still unmapped page cache). I screwed up. My madvise(RANDOM) call used the wrong address/len so it didn't cover the whole region. After fixing this, the test now runs as expected - the slapd process size grows to 30GB without any problem. Sorry for the noise. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 16:16 ` Johannes Weiner 2013-03-08 20:04 ` Howard Chu @ 2013-03-09 3:28 ` Ric Mason 1 sibling, 0 replies; 17+ messages in thread From: Ric Mason @ 2013-03-09 3:28 UTC (permalink / raw) To: Johannes Weiner Cc: Howard Chu, Chris Friesen, Kirill A. Shutemov, Jan Kara, Mel Gorman, Rik van Riel, linux-kernel, linux-mm Hi Johannes, On 03/09/2013 12:16 AM, Johannes Weiner wrote: > On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote: >> Chris Friesen wrote: >>> On 03/08/2013 03:40 AM, Howard Chu wrote: >>> >>>> There is no way that a process that is accessing only 30GB of a mmap >>>> should be able to fill up 32GB of RAM. There's nothing else running on >>>> the machine, I've killed or suspended everything else in userland >>>> besides a couple shells running top and vmstat. When I manually >>>> drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and >>>> the physical I/O stops. >>> Is it possible that the kernel is doing some sort of automatic >>> readahead, but it ends up reading pages corresponding to data that isn't >>> ever queried and so doesn't get mapped by the application? >> Yes, that's what I was thinking. I added a >> posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the >> test. >> >> First obvious conclusion - kswapd is being too aggressive. When free >> memory hits the low watermark, the reclaim shrinks slapd down from >> 25GB to 18-19GB, while the page cache still contains ~7GB of >> unmapped pages. Ideally I'd like a tuning knob so I can say to keep >> no more than 2GB of unmapped pages in the cache. (And the desired >> effect of that would be to allow user processes to grow to 30GB >> total, in this case.) > We should find out where the unmapped page cache is coming from if you > are only accessing mapped file cache and disabled readahead. > > How do you arrive at this number of unmapped page cache? > > What could happen is that previously used and activated pages do not > get evicted anymore since there is a constant supply of younger If a user process exit, its file pages and anonymous pages will be freed immediately or go through page reclaim? > reclaimable cache that is actually thrashing. Whenever you drop the > caches, you get rid of those stale active pages and allow the > previously thrashing cache to get activated. However, that would > require that there is already a significant amount of active file Why you emphasize a *significant* amount of active file pages? > pages before your workload starts (check the nr_active_file number in > /proc/vmstat before launching slapd, try sync; echo 3 >drop_caches > before launching to eliminate this option) OR that the set of pages > accessed during your workload changes and the combined set of pages > accessed by your workload is bigger than available memory -- which you > claimed would not happen because you only access the 30GB file area on > that system. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 15:00 ` Howard Chu 2013-03-08 15:25 ` Chris Friesen 2013-03-08 16:16 ` Johannes Weiner @ 2013-03-09 1:22 ` Phillip Susi 2013-03-11 11:52 ` Jan Kara 2 siblings, 1 reply; 17+ messages in thread From: Phillip Susi @ 2013-03-09 1:22 UTC (permalink / raw) To: Howard Chu Cc: Chris Friesen, Kirill A. Shutemov, Johannes Weiner, Jan Kara, linux-kernel, linux-mm -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/08/2013 10:00 AM, Howard Chu wrote: > Yes, that's what I was thinking. I added a > posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the > test. Yep, that's because it isn't implemented. You might try MADV_WILLNEED to schedule it to be read in first. I believe that will only read in the requested page, without additional readahead, and then when you fault on the page, it already has IO scheduled, so the extra readahead will also be skipped. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQEcBAEBAgAGBQJROo7GAAoJEJrBOlT6nu759SAH+wRhoUIZUuzNGrhfUJ6RnwV8 VjFyftBCAsdC+Mzq81Da3KJOi+BdYV8VbkYNPzbKll5AnxzL5Udvbdyf9SkROhug UgLWHe8pC6ZtHfSvWBCqS1YDLkzw+TiWwJzuL5iUEDC2NGuUJQ5SbhwyTEypvWai pdPZeFVyhLAKOtAUwD5e/5vhBWSq2M1TG2C7BUCow2fbJ6kil+kWuXtiDeNPvtUk 4FwabL8zHA9pNtMlHB0cUrn5W3VQYGqeTaDngjyLxR1gw7uFQn52G47IPe2LAMGx 58L/tHjbkSY9oukGiMHoF1jiaFqJqV1pw+Q2P7S+0XsU8JdW6CmzotTqDmcozqE= =DOZT -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-09 1:22 ` Phillip Susi @ 2013-03-11 11:52 ` Jan Kara 2013-03-11 15:03 ` Phillip Susi 0 siblings, 1 reply; 17+ messages in thread From: Jan Kara @ 2013-03-11 11:52 UTC (permalink / raw) To: Phillip Susi Cc: Howard Chu, Chris Friesen, Kirill A. Shutemov, Johannes Weiner, Jan Kara, linux-kernel, linux-mm On Fri 08-03-13 20:22:19, Phillip Susi wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 03/08/2013 10:00 AM, Howard Chu wrote: > > Yes, that's what I was thinking. I added a > > posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the > > test. > > Yep, that's because it isn't implemented. Why do you think so? AFAICS it is implemented by setting VM_RAND_READ flag in the VMA and do_async_mmap_readahead() and do_sync_mmap_readahead() check for the flag and don't do anything if it is set... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-11 11:52 ` Jan Kara @ 2013-03-11 15:03 ` Phillip Susi 0 siblings, 0 replies; 17+ messages in thread From: Phillip Susi @ 2013-03-11 15:03 UTC (permalink / raw) To: Jan Kara Cc: Howard Chu, Chris Friesen, Kirill A. Shutemov, Johannes Weiner, linux-kernel, linux-mm On 3/11/2013 7:52 AM, Jan Kara wrote: >> Yep, that's because it isn't implemented. > Why do you think so? AFAICS it is implemented by setting VM_RAND_READ > flag in the VMA and do_async_mmap_readahead() and do_sync_mmap_readahead() > check for the flag and don't do anything if it is set... Oh, don't know how I missed that... I was just looking for it the other day and couldn't find any references to VM_RandomReadHint so I assumed it hadn't been implemented. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mmap vs fs cache 2013-03-08 2:08 ` Johannes Weiner 2013-03-08 7:46 ` Howard Chu @ 2013-03-09 2:34 ` Ric Mason 1 sibling, 0 replies; 17+ messages in thread From: Ric Mason @ 2013-03-09 2:34 UTC (permalink / raw) To: Johannes Weiner; +Cc: Jan Kara, Howard Chu, linux-kernel, linux-mm Hi Johannes, On 03/08/2013 10:08 AM, Johannes Weiner wrote: > On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote: >> Added mm list to CC. >> >> On Tue 05-03-13 09:57:34, Howard Chu wrote: >>> I'm testing our memory-mapped database code on a small VM. The >>> machine has 32GB of RAM and the size of the DB on disk is ~44GB. The >>> database library mmaps the entire file as a single region and starts >>> accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23 >>> kernel, XFS on a local disk. >>> >>> If I start running read-only queries against the DB with a freshly >>> started server, I see that my process (OpenLDAP slapd) quickly grows >>> to an RSS of about 16GB in tandem with the FS cache. (I.e., "top" >>> shows 16GB cached, and slapd is 16GB.) >>> If I confine my queries to the first 20% of the data then it all >>> fits in RAM and queries are nice and fast. >>> >>> if I extend the query range to cover more of the data, approaching >>> the size of physical RAM, I see something strange - the FS cache >>> keeps growing, but the slapd process size grows at a slower rate. >>> This is rather puzzling to me since the only thing triggering reads >>> is accesses through the mmap region. Eventually the FS cache grows >>> to basically all of the 32GB of RAM (+/- some text/data space...) >>> but the slapd process only reaches 25GB, at which point it actually >>> starts to shrink - apparently the FS cache is now stealing pages >>> from it. I find that a bit puzzling; if the pages are present in >>> memory, and the only reason they were paged in was to satisfy an >>> mmap reference, why aren't they simply assigned to the slapd >>> process? >>> >>> The current behavior gets even more aggravating: I can run a test >>> that spans exactly 30GB of the data. One would expect that the slapd >>> process should simply grow to 30GB in size, and then remain static >>> for the remainder of the test. Instead, the server grows to 25GB, >>> the FS cache grows to 32GB, and starts stealing pages from the >>> server, shrinking it back down to 19GB or so. >>> >>> If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this >>> condition, the FS cache shrinks back to 25GB, matching the slapd >>> process size. >>> This then frees up enough RAM for slapd to grow further. If I don't >>> do this, the test is constantly paging in data from disk. Even so, >>> the FS cache continues to grow faster than the slapd process size, >>> so the system may run out of free RAM again, and I have to drop >>> caches multiple times before slapd finally grows to the full 30GB. >>> Once it gets to that size the test runs entirely from RAM with zero >>> I/Os, but it doesn't get there without a lot of babysitting. >>> >>> 2 questions: >>> why is there data in the FS cache that isn't owned by (the mmap >>> of) the process that caused it to be paged in in the first place? > The filesystem cache is shared among processes because the filesystem > is also shared among processes. If another task were to access the > same file, we still should only have one copy of that data in memory. > > It sounds to me like slapd is itself caching all the data it reads. > If that is true, shouldn't it really be using direct IO to prevent > this double buffering of filesystem data in memory? When use direct IO is better? When use page cache is better? > >>> is there a tunable knob to discourage the page cache from stealing >>> from the process? > Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and > defaults to 60. Why redunce? IIUC, swappiness is used to determine how aggressive reclaim anonymous pages, if the value is high more anonymous pages will be reclaimed. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2013-03-11 15:03 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <5136320E.8030109@symas.com> 2013-03-07 15:43 ` mmap vs fs cache Jan Kara 2013-03-08 2:08 ` Johannes Weiner 2013-03-08 7:46 ` Howard Chu 2013-03-08 8:42 ` Kirill A. Shutemov 2013-03-08 9:40 ` Howard Chu 2013-03-08 14:47 ` Chris Friesen 2013-03-08 15:00 ` Howard Chu 2013-03-08 15:25 ` Chris Friesen 2013-03-08 16:16 ` Johannes Weiner 2013-03-08 20:04 ` Howard Chu 2013-03-11 12:04 ` Jan Kara 2013-03-11 12:40 ` Howard Chu 2013-03-09 3:28 ` Ric Mason 2013-03-09 1:22 ` Phillip Susi 2013-03-11 11:52 ` Jan Kara 2013-03-11 15:03 ` Phillip Susi 2013-03-09 2:34 ` Ric Mason
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).