Re: mmap vs fs cache

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: mmap vs fs cache
       [not found] <5136320E.8030109@symas.com>
@ 2013-03-07 15:43 ` Jan Kara
  2013-03-08  2:08   ` Johannes Weiner
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2013-03-07 15:43 UTC (permalink / raw)
  To: Howard Chu; +Cc: linux-kernel, linux-mm

  Added mm list to CC.

On Tue 05-03-13 09:57:34, Howard Chu wrote:
> I'm testing our memory-mapped database code on a small VM. The
> machine has 32GB of RAM and the size of the DB on disk is ~44GB. The
> database library mmaps the entire file as a single region and starts
> accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23
> kernel, XFS on a local disk.
> 
> If I start running read-only queries against the DB with a freshly
> started server, I see that my process (OpenLDAP slapd) quickly grows
> to an RSS of about 16GB in tandem with the FS cache. (I.e., "top"
> shows 16GB cached, and slapd is 16GB.)
> If I confine my queries to the first 20% of the data then it all
> fits in RAM and queries are nice and fast.
> 
> if I extend the query range to cover more of the data, approaching
> the size of physical RAM, I see something strange - the FS cache
> keeps growing, but the slapd process size grows at a slower rate.
> This is rather puzzling to me since the only thing triggering reads
> is accesses through the mmap region. Eventually the FS cache grows
> to basically all of the 32GB of RAM (+/- some text/data space...)
> but the slapd process only reaches 25GB, at which point it actually
> starts to shrink - apparently the FS cache is now stealing pages
> from it. I find that a bit puzzling; if the pages are present in
> memory, and the only reason they were paged in was to satisfy an
> mmap reference, why aren't they simply assigned to the slapd
> process?
> 
> The current behavior gets even more aggravating: I can run a test
> that spans exactly 30GB of the data. One would expect that the slapd
> process should simply grow to 30GB in size, and then remain static
> for the remainder of the test. Instead, the server grows to 25GB,
> the FS cache grows to 32GB, and starts stealing pages from the
> server, shrinking it back down to 19GB or so.
> 
> If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this
> condition, the FS cache shrinks back to 25GB, matching the slapd
> process size.
> This then frees up enough RAM for slapd to grow further. If I don't
> do this, the test is constantly paging in data from disk. Even so,
> the FS cache continues to grow faster than the slapd process size,
> so the system may run out of free RAM again, and I have to drop
> caches multiple times before slapd finally grows to the full 30GB.
> Once it gets to that size the test runs entirely from RAM with zero
> I/Os, but it doesn't get there without a lot of babysitting.
> 
> 2 questions:
>   why is there data in the FS cache that isn't owned by (the mmap
> of) the process that caused it to be paged in in the first place?
>   is there a tunable knob to discourage the page cache from stealing
> from the process?
> 
> -- 
>   -- Howard Chu
>   CTO, Symas Corp.           http://www.symas.com
>   Director, Highland Sun     http://highlandsun.com/hyc/
>   Chief Architect, OpenLDAP  http://www.openldap.org/project/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-07 15:43 ` mmap vs fs cache Jan Kara
@ 2013-03-08  2:08   ` Johannes Weiner
  2013-03-08  7:46     ` Howard Chu
  2013-03-09  2:34     ` Ric Mason
  0 siblings, 2 replies; 17+ messages in thread
From: Johannes Weiner @ 2013-03-08  2:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: Howard Chu, linux-kernel, linux-mm

On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote:
>   Added mm list to CC.
> 
> On Tue 05-03-13 09:57:34, Howard Chu wrote:
> > I'm testing our memory-mapped database code on a small VM. The
> > machine has 32GB of RAM and the size of the DB on disk is ~44GB. The
> > database library mmaps the entire file as a single region and starts
> > accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23
> > kernel, XFS on a local disk.
> > 
> > If I start running read-only queries against the DB with a freshly
> > started server, I see that my process (OpenLDAP slapd) quickly grows
> > to an RSS of about 16GB in tandem with the FS cache. (I.e., "top"
> > shows 16GB cached, and slapd is 16GB.)
> > If I confine my queries to the first 20% of the data then it all
> > fits in RAM and queries are nice and fast.
> > 
> > if I extend the query range to cover more of the data, approaching
> > the size of physical RAM, I see something strange - the FS cache
> > keeps growing, but the slapd process size grows at a slower rate.
> > This is rather puzzling to me since the only thing triggering reads
> > is accesses through the mmap region. Eventually the FS cache grows
> > to basically all of the 32GB of RAM (+/- some text/data space...)
> > but the slapd process only reaches 25GB, at which point it actually
> > starts to shrink - apparently the FS cache is now stealing pages
> > from it. I find that a bit puzzling; if the pages are present in
> > memory, and the only reason they were paged in was to satisfy an
> > mmap reference, why aren't they simply assigned to the slapd
> > process?
> > 
> > The current behavior gets even more aggravating: I can run a test
> > that spans exactly 30GB of the data. One would expect that the slapd
> > process should simply grow to 30GB in size, and then remain static
> > for the remainder of the test. Instead, the server grows to 25GB,
> > the FS cache grows to 32GB, and starts stealing pages from the
> > server, shrinking it back down to 19GB or so.
> > 
> > If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this
> > condition, the FS cache shrinks back to 25GB, matching the slapd
> > process size.
> > This then frees up enough RAM for slapd to grow further. If I don't
> > do this, the test is constantly paging in data from disk. Even so,
> > the FS cache continues to grow faster than the slapd process size,
> > so the system may run out of free RAM again, and I have to drop
> > caches multiple times before slapd finally grows to the full 30GB.
> > Once it gets to that size the test runs entirely from RAM with zero
> > I/Os, but it doesn't get there without a lot of babysitting.
> > 
> > 2 questions:
> >   why is there data in the FS cache that isn't owned by (the mmap
> > of) the process that caused it to be paged in in the first place?

The filesystem cache is shared among processes because the filesystem
is also shared among processes.  If another task were to access the
same file, we still should only have one copy of that data in memory.

It sounds to me like slapd is itself caching all the data it reads.
If that is true, shouldn't it really be using direct IO to prevent
this double buffering of filesystem data in memory?

> >   is there a tunable knob to discourage the page cache from stealing
> > from the process?

Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and
defaults to 60.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08  2:08   ` Johannes Weiner
@ 2013-03-08  7:46     ` Howard Chu
  2013-03-08  8:42       ` Kirill A. Shutemov
  2013-03-09  2:34     ` Ric Mason
  1 sibling, 1 reply; 17+ messages in thread
From: Howard Chu @ 2013-03-08  7:46 UTC (permalink / raw)
  To: Johannes Weiner, Jan Kara; +Cc: linux-kernel, linux-mm

Johannes Weiner wrote:
> On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote:

>>> 2 questions:
>>>    why is there data in the FS cache that isn't owned by (the mmap
>>> of) the process that caused it to be paged in in the first place?
>
> The filesystem cache is shared among processes because the filesystem
> is also shared among processes.  If another task were to access the
> same file, we still should only have one copy of that data in memory.

That's irrelevant to the question. As I already explained, the first 16GB that 
was paged in didn't behave this way. Perhaps "owned" was the wrong word, since 
this is a MAP_SHARED mapping. But the point is that the memory is not being 
accounted in slapd's process size, when it was before, up to 16GB.

> It sounds to me like slapd is itself caching all the data it reads.

You're misreading the information then. slapd is doing no caching of its own, 
its RSS and SHR memory size are both the same. All it is using is the mmap, 
nothing else. The RSS == SHR == FS cache, up to 16GB. RSS is always == SHR, 
but above 16GB they grow more slowly than the FS cache.

> If that is true, shouldn't it really be using direct IO to prevent
> this double buffering of filesystem data in memory?

There is no double buffering.

>>>    is there a tunable knob to discourage the page cache from stealing
>>> from the process?
>
> Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and
> defaults to 60.

I've already tried setting it to 0 with no effect.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08  7:46     ` Howard Chu
@ 2013-03-08  8:42       ` Kirill A. Shutemov
  2013-03-08  9:40         ` Howard Chu
  0 siblings, 1 reply; 17+ messages in thread
From: Kirill A. Shutemov @ 2013-03-08  8:42 UTC (permalink / raw)
  To: Howard Chu; +Cc: Johannes Weiner, Jan Kara, linux-kernel, linux-mm

On Thu, Mar 07, 2013 at 11:46:39PM -0800, Howard Chu wrote:
> You're misreading the information then. slapd is doing no caching of
> its own, its RSS and SHR memory size are both the same. All it is
> using is the mmap, nothing else. The RSS == SHR == FS cache, up to
> 16GB. RSS is always == SHR, but above 16GB they grow more slowly
> than the FS cache.

It only means, that some pages got unmapped from your process. It can
happned, for instance, due page migration. There's nothing worry about: it
will be mapped back on next page fault to the page and it's only minor
page fault since the page is in pagecache anyway.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08  8:42       ` Kirill A. Shutemov
@ 2013-03-08  9:40         ` Howard Chu
  2013-03-08 14:47           ` Chris Friesen
  0 siblings, 1 reply; 17+ messages in thread
From: Howard Chu @ 2013-03-08  9:40 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Johannes Weiner, Jan Kara, linux-kernel, linux-mm

Kirill A. Shutemov wrote:
> On Thu, Mar 07, 2013 at 11:46:39PM -0800, Howard Chu wrote:
>> You're misreading the information then. slapd is doing no caching of
>> its own, its RSS and SHR memory size are both the same. All it is
>> using is the mmap, nothing else. The RSS == SHR == FS cache, up to
>> 16GB. RSS is always == SHR, but above 16GB they grow more slowly
>> than the FS cache.
>
> It only means, that some pages got unmapped from your process. It can
> happned, for instance, due page migration. There's nothing worry about: it
> will be mapped back on next page fault to the page and it's only minor
> page fault since the page is in pagecache anyway.

Unfortunately there *is* something to worry about. As I said already - when 
the test spans 30GB, the FS cache fills up the rest of RAM and the test is 
doing a lot of real I/O even though it shouldn't need to. Please, read the 
entire original post before replying.

There is no way that a process that is accessing only 30GB of a mmap should be 
able to fill up 32GB of RAM. There's nothing else running on the machine, I've 
killed or suspended everything else in userland besides a couple shells 
running top and vmstat. When I manually drop_caches repeatedly, then 
eventually slapd RSS/SHR grows to 30GB and the physical I/O stops.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08  9:40         ` Howard Chu
@ 2013-03-08 14:47           ` Chris Friesen
  2013-03-08 15:00             ` Howard Chu
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Friesen @ 2013-03-08 14:47 UTC (permalink / raw)
  To: Howard Chu
  Cc: Kirill A. Shutemov, Johannes Weiner, Jan Kara, linux-kernel,
	linux-mm

On 03/08/2013 03:40 AM, Howard Chu wrote:

> There is no way that a process that is accessing only 30GB of a mmap
> should be able to fill up 32GB of RAM. There's nothing else running on
> the machine, I've killed or suspended everything else in userland
> besides a couple shells running top and vmstat. When I manually
> drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
> the physical I/O stops.

Is it possible that the kernel is doing some sort of automatic 
readahead, but it ends up reading pages corresponding to data that isn't 
ever queried and so doesn't get mapped by the application?

Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08 14:47           ` Chris Friesen
@ 2013-03-08 15:00             ` Howard Chu
  2013-03-08 15:25               ` Chris Friesen
                                 ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Howard Chu @ 2013-03-08 15:00 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Kirill A. Shutemov, Johannes Weiner, Jan Kara, linux-kernel,
	linux-mm

Chris Friesen wrote:
> On 03/08/2013 03:40 AM, Howard Chu wrote:
>
>> There is no way that a process that is accessing only 30GB of a mmap
>> should be able to fill up 32GB of RAM. There's nothing else running on
>> the machine, I've killed or suspended everything else in userland
>> besides a couple shells running top and vmstat. When I manually
>> drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
>> the physical I/O stops.
>
> Is it possible that the kernel is doing some sort of automatic
> readahead, but it ends up reading pages corresponding to data that isn't
> ever queried and so doesn't get mapped by the application?

Yes, that's what I was thinking. I added a posix_madvise(..POSIX_MADV_RANDOM) 
but that had no effect on the test.

First obvious conclusion - kswapd is being too aggressive. When free memory 
hits the low watermark, the reclaim shrinks slapd down from 25GB to 18-19GB, 
while the page cache still contains ~7GB of unmapped pages. Ideally I'd like a 
tuning knob so I can say to keep no more than 2GB of unmapped pages in the 
cache. (And the desired effect of that would be to allow user processes to 
grow to 30GB total, in this case.)

I mentioned this "unmapped page cache control" post already 
http://lwn.net/Articles/436010/ but it seems that the idea was ultimately 
rejected. Is there anything else similar in current kernels?

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08 15:00             ` Howard Chu
@ 2013-03-08 15:25               ` Chris Friesen
  2013-03-08 16:16               ` Johannes Weiner
  2013-03-09  1:22               ` Phillip Susi
  2 siblings, 0 replies; 17+ messages in thread
From: Chris Friesen @ 2013-03-08 15:25 UTC (permalink / raw)
  To: Howard Chu
  Cc: Kirill A. Shutemov, Johannes Weiner, Jan Kara, linux-kernel,
	linux-mm

On 03/08/2013 09:00 AM, Howard Chu wrote:

> First obvious conclusion - kswapd is being too aggressive. When free
> memory hits the low watermark, the reclaim shrinks slapd down from 25GB
> to 18-19GB, while the page cache still contains ~7GB of unmapped pages.
> Ideally I'd like a tuning knob so I can say to keep no more than 2GB of
> unmapped pages in the cache. (And the desired effect of that would be to
> allow user processes to grow to 30GB total, in this case.)
>
> I mentioned this "unmapped page cache control" post already
> http://lwn.net/Articles/436010/ but it seems that the idea was
> ultimately rejected. Is there anything else similar in current kernels?

Sorry, I'm not aware of anything.  I'm not a filesystem/vm guy though, 
so maybe there's something I don't know about.

I would have expected both posix_madvise(..POSIX_MADV_RANDOM) and 
swappiness to help, but it doesn't sound like they're working.

Chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08 15:00             ` Howard Chu
  2013-03-08 15:25               ` Chris Friesen
@ 2013-03-08 16:16               ` Johannes Weiner
  2013-03-08 20:04                 ` Howard Chu
  2013-03-09  3:28                 ` Ric Mason
  2013-03-09  1:22               ` Phillip Susi
  2 siblings, 2 replies; 17+ messages in thread
From: Johannes Weiner @ 2013-03-08 16:16 UTC (permalink / raw)
  To: Howard Chu
  Cc: Chris Friesen, Kirill A. Shutemov, Jan Kara, Mel Gorman,
	Rik van Riel, linux-kernel, linux-mm

On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote:
> Chris Friesen wrote:
> >On 03/08/2013 03:40 AM, Howard Chu wrote:
> >
> >>There is no way that a process that is accessing only 30GB of a mmap
> >>should be able to fill up 32GB of RAM. There's nothing else running on
> >>the machine, I've killed or suspended everything else in userland
> >>besides a couple shells running top and vmstat. When I manually
> >>drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
> >>the physical I/O stops.
> >
> >Is it possible that the kernel is doing some sort of automatic
> >readahead, but it ends up reading pages corresponding to data that isn't
> >ever queried and so doesn't get mapped by the application?
> 
> Yes, that's what I was thinking. I added a
> posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
> test.
> 
> First obvious conclusion - kswapd is being too aggressive. When free
> memory hits the low watermark, the reclaim shrinks slapd down from
> 25GB to 18-19GB, while the page cache still contains ~7GB of
> unmapped pages. Ideally I'd like a tuning knob so I can say to keep
> no more than 2GB of unmapped pages in the cache. (And the desired
> effect of that would be to allow user processes to grow to 30GB
> total, in this case.)

We should find out where the unmapped page cache is coming from if you
are only accessing mapped file cache and disabled readahead.

How do you arrive at this number of unmapped page cache?

What could happen is that previously used and activated pages do not
get evicted anymore since there is a constant supply of younger
reclaimable cache that is actually thrashing.  Whenever you drop the
caches, you get rid of those stale active pages and allow the
previously thrashing cache to get activated.  However, that would
require that there is already a significant amount of active file
pages before your workload starts (check the nr_active_file number in
/proc/vmstat before launching slapd, try sync; echo 3 >drop_caches
before launching to eliminate this option) OR that the set of pages
accessed during your workload changes and the combined set of pages
accessed by your workload is bigger than available memory -- which you
claimed would not happen because you only access the 30GB file area on
that system.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08 16:16               ` Johannes Weiner
@ 2013-03-08 20:04                 ` Howard Chu
  2013-03-11 12:04                   ` Jan Kara
  2013-03-09  3:28                 ` Ric Mason
  1 sibling, 1 reply; 17+ messages in thread
From: Howard Chu @ 2013-03-08 20:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Chris Friesen, Kirill A. Shutemov, Jan Kara, Mel Gorman,
	Rik van Riel, linux-kernel, linux-mm

Johannes Weiner wrote:
> On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote:
>> Chris Friesen wrote:
>>> On 03/08/2013 03:40 AM, Howard Chu wrote:
>>>
>>>> There is no way that a process that is accessing only 30GB of a mmap
>>>> should be able to fill up 32GB of RAM. There's nothing else running on
>>>> the machine, I've killed or suspended everything else in userland
>>>> besides a couple shells running top and vmstat. When I manually
>>>> drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
>>>> the physical I/O stops.
>>>
>>> Is it possible that the kernel is doing some sort of automatic
>>> readahead, but it ends up reading pages corresponding to data that isn't
>>> ever queried and so doesn't get mapped by the application?
>>
>> Yes, that's what I was thinking. I added a
>> posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
>> test.
>>
>> First obvious conclusion - kswapd is being too aggressive. When free
>> memory hits the low watermark, the reclaim shrinks slapd down from
>> 25GB to 18-19GB, while the page cache still contains ~7GB of
>> unmapped pages. Ideally I'd like a tuning knob so I can say to keep
>> no more than 2GB of unmapped pages in the cache. (And the desired
>> effect of that would be to allow user processes to grow to 30GB
>> total, in this case.)
>
> We should find out where the unmapped page cache is coming from if you
> are only accessing mapped file cache and disabled readahead.
>
> How do you arrive at this number of unmapped page cache?

This number is pretty obvious. When slapd has grown to 25GB, the page cache 
has grown to 32GB (less about 200MB, the minfree). So: 7GB unmapped in the cache.

> What could happen is that previously used and activated pages do not
> get evicted anymore since there is a constant supply of younger
> reclaimable cache that is actually thrashing.  Whenever you drop the
> caches, you get rid of those stale active pages and allow the
> previously thrashing cache to get activated.  However, that would
> require that there is already a significant amount of active file
> pages before your workload starts (check the nr_active_file number in
> /proc/vmstat before launching slapd, try sync; echo 3 >drop_caches
> before launching to eliminate this option) OR that the set of pages
> accessed during your workload changes and the combined set of pages
> accessed by your workload is bigger than available memory -- which you
> claimed would not happen because you only access the 30GB file area on
> that system.

There are no other active pages before the test begins. There's nothing else 
running. caches have been dropped completely at the beginning.

The test clearly is accessing only 30GB of data. Once slapd reaches this 
process size, the test can be stopped and restarted any number of times, run 
for any number of hours continuously, and memory use on the system is 
unchanged, and no pageins occur.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08 15:00             ` Howard Chu
  2013-03-08 15:25               ` Chris Friesen
  2013-03-08 16:16               ` Johannes Weiner
@ 2013-03-09  1:22               ` Phillip Susi
  2013-03-11 11:52                 ` Jan Kara
  2 siblings, 1 reply; 17+ messages in thread
From: Phillip Susi @ 2013-03-09  1:22 UTC (permalink / raw)
  To: Howard Chu
  Cc: Chris Friesen, Kirill A. Shutemov, Johannes Weiner, Jan Kara,
	linux-kernel, linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/08/2013 10:00 AM, Howard Chu wrote:
> Yes, that's what I was thinking. I added a 
> posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
> test.

Yep, that's because it isn't implemented.

You might try MADV_WILLNEED to schedule it to be read in first.  I
believe that will only read in the requested page, without additional
readahead, and then when you fault on the page, it already has IO
scheduled, so the extra readahead will also be skipped.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQEcBAEBAgAGBQJROo7GAAoJEJrBOlT6nu759SAH+wRhoUIZUuzNGrhfUJ6RnwV8
VjFyftBCAsdC+Mzq81Da3KJOi+BdYV8VbkYNPzbKll5AnxzL5Udvbdyf9SkROhug
UgLWHe8pC6ZtHfSvWBCqS1YDLkzw+TiWwJzuL5iUEDC2NGuUJQ5SbhwyTEypvWai
pdPZeFVyhLAKOtAUwD5e/5vhBWSq2M1TG2C7BUCow2fbJ6kil+kWuXtiDeNPvtUk
4FwabL8zHA9pNtMlHB0cUrn5W3VQYGqeTaDngjyLxR1gw7uFQn52G47IPe2LAMGx
58L/tHjbkSY9oukGiMHoF1jiaFqJqV1pw+Q2P7S+0XsU8JdW6CmzotTqDmcozqE=
=DOZT
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08  2:08   ` Johannes Weiner
  2013-03-08  7:46     ` Howard Chu
@ 2013-03-09  2:34     ` Ric Mason
  1 sibling, 0 replies; 17+ messages in thread
From: Ric Mason @ 2013-03-09  2:34 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Jan Kara, Howard Chu, linux-kernel, linux-mm

Hi Johannes,
On 03/08/2013 10:08 AM, Johannes Weiner wrote:
> On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote:
>>    Added mm list to CC.
>>
>> On Tue 05-03-13 09:57:34, Howard Chu wrote:
>>> I'm testing our memory-mapped database code on a small VM. The
>>> machine has 32GB of RAM and the size of the DB on disk is ~44GB. The
>>> database library mmaps the entire file as a single region and starts
>>> accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23
>>> kernel, XFS on a local disk.
>>>
>>> If I start running read-only queries against the DB with a freshly
>>> started server, I see that my process (OpenLDAP slapd) quickly grows
>>> to an RSS of about 16GB in tandem with the FS cache. (I.e., "top"
>>> shows 16GB cached, and slapd is 16GB.)
>>> If I confine my queries to the first 20% of the data then it all
>>> fits in RAM and queries are nice and fast.
>>>
>>> if I extend the query range to cover more of the data, approaching
>>> the size of physical RAM, I see something strange - the FS cache
>>> keeps growing, but the slapd process size grows at a slower rate.
>>> This is rather puzzling to me since the only thing triggering reads
>>> is accesses through the mmap region. Eventually the FS cache grows
>>> to basically all of the 32GB of RAM (+/- some text/data space...)
>>> but the slapd process only reaches 25GB, at which point it actually
>>> starts to shrink - apparently the FS cache is now stealing pages
>>> from it. I find that a bit puzzling; if the pages are present in
>>> memory, and the only reason they were paged in was to satisfy an
>>> mmap reference, why aren't they simply assigned to the slapd
>>> process?
>>>
>>> The current behavior gets even more aggravating: I can run a test
>>> that spans exactly 30GB of the data. One would expect that the slapd
>>> process should simply grow to 30GB in size, and then remain static
>>> for the remainder of the test. Instead, the server grows to 25GB,
>>> the FS cache grows to 32GB, and starts stealing pages from the
>>> server, shrinking it back down to 19GB or so.
>>>
>>> If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this
>>> condition, the FS cache shrinks back to 25GB, matching the slapd
>>> process size.
>>> This then frees up enough RAM for slapd to grow further. If I don't
>>> do this, the test is constantly paging in data from disk. Even so,
>>> the FS cache continues to grow faster than the slapd process size,
>>> so the system may run out of free RAM again, and I have to drop
>>> caches multiple times before slapd finally grows to the full 30GB.
>>> Once it gets to that size the test runs entirely from RAM with zero
>>> I/Os, but it doesn't get there without a lot of babysitting.
>>>
>>> 2 questions:
>>>    why is there data in the FS cache that isn't owned by (the mmap
>>> of) the process that caused it to be paged in in the first place?
> The filesystem cache is shared among processes because the filesystem
> is also shared among processes.  If another task were to access the
> same file, we still should only have one copy of that data in memory.
>
> It sounds to me like slapd is itself caching all the data it reads.
> If that is true, shouldn't it really be using direct IO to prevent
> this double buffering of filesystem data in memory?

When use direct IO is better? When use page cache is better?

>
>>>    is there a tunable knob to discourage the page cache from stealing
>>> from the process?
> Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and
> defaults to 60.

Why redunce? IIUC, swappiness is used to determine how aggressive 
reclaim anonymous pages, if the value is high more anonymous pages will 
be reclaimed.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08 16:16               ` Johannes Weiner
  2013-03-08 20:04                 ` Howard Chu
@ 2013-03-09  3:28                 ` Ric Mason
  1 sibling, 0 replies; 17+ messages in thread
From: Ric Mason @ 2013-03-09  3:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Howard Chu, Chris Friesen, Kirill A. Shutemov, Jan Kara,
	Mel Gorman, Rik van Riel, linux-kernel, linux-mm

Hi Johannes,
On 03/09/2013 12:16 AM, Johannes Weiner wrote:
> On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote:
>> Chris Friesen wrote:
>>> On 03/08/2013 03:40 AM, Howard Chu wrote:
>>>
>>>> There is no way that a process that is accessing only 30GB of a mmap
>>>> should be able to fill up 32GB of RAM. There's nothing else running on
>>>> the machine, I've killed or suspended everything else in userland
>>>> besides a couple shells running top and vmstat. When I manually
>>>> drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
>>>> the physical I/O stops.
>>> Is it possible that the kernel is doing some sort of automatic
>>> readahead, but it ends up reading pages corresponding to data that isn't
>>> ever queried and so doesn't get mapped by the application?
>> Yes, that's what I was thinking. I added a
>> posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
>> test.
>>
>> First obvious conclusion - kswapd is being too aggressive. When free
>> memory hits the low watermark, the reclaim shrinks slapd down from
>> 25GB to 18-19GB, while the page cache still contains ~7GB of
>> unmapped pages. Ideally I'd like a tuning knob so I can say to keep
>> no more than 2GB of unmapped pages in the cache. (And the desired
>> effect of that would be to allow user processes to grow to 30GB
>> total, in this case.)
> We should find out where the unmapped page cache is coming from if you
> are only accessing mapped file cache and disabled readahead.
>
> How do you arrive at this number of unmapped page cache?
>
> What could happen is that previously used and activated pages do not
> get evicted anymore since there is a constant supply of younger

If a user process exit, its file pages and anonymous pages will be freed 
immediately or go through page reclaim?

> reclaimable cache that is actually thrashing.  Whenever you drop the
> caches, you get rid of those stale active pages and allow the
> previously thrashing cache to get activated.  However, that would
> require that there is already a significant amount of active file

Why you emphasize a *significant* amount of active file pages?

> pages before your workload starts (check the nr_active_file number in
> /proc/vmstat before launching slapd, try sync; echo 3 >drop_caches
> before launching to eliminate this option) OR that the set of pages
> accessed during your workload changes and the combined set of pages
> accessed by your workload is bigger than available memory -- which you
> claimed would not happen because you only access the 30GB file area on
> that system.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-09  1:22               ` Phillip Susi
@ 2013-03-11 11:52                 ` Jan Kara
  2013-03-11 15:03                   ` Phillip Susi
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2013-03-11 11:52 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Howard Chu, Chris Friesen, Kirill A. Shutemov, Johannes Weiner,
	Jan Kara, linux-kernel, linux-mm

On Fri 08-03-13 20:22:19, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 03/08/2013 10:00 AM, Howard Chu wrote:
> > Yes, that's what I was thinking. I added a 
> > posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
> > test.
> 
> Yep, that's because it isn't implemented.
  Why do you think so? AFAICS it is implemented by setting VM_RAND_READ
flag in the VMA and do_async_mmap_readahead() and do_sync_mmap_readahead()
check for the flag and don't do anything if it is set...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-08 20:04                 ` Howard Chu
@ 2013-03-11 12:04                   ` Jan Kara
  2013-03-11 12:40                     ` Howard Chu
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2013-03-11 12:04 UTC (permalink / raw)
  To: Howard Chu
  Cc: Johannes Weiner, Chris Friesen, Kirill A. Shutemov, Jan Kara,
	Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Fri 08-03-13 12:04:46, Howard Chu wrote:
> Johannes Weiner wrote:
> >On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote:
> >>Chris Friesen wrote:
> >>>On 03/08/2013 03:40 AM, Howard Chu wrote:
> >>>
> >>>>There is no way that a process that is accessing only 30GB of a mmap
> >>>>should be able to fill up 32GB of RAM. There's nothing else running on
> >>>>the machine, I've killed or suspended everything else in userland
> >>>>besides a couple shells running top and vmstat. When I manually
> >>>>drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
> >>>>the physical I/O stops.
> >>>
> >>>Is it possible that the kernel is doing some sort of automatic
> >>>readahead, but it ends up reading pages corresponding to data that isn't
> >>>ever queried and so doesn't get mapped by the application?
> >>
> >>Yes, that's what I was thinking. I added a
> >>posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
> >>test.
> >>
> >>First obvious conclusion - kswapd is being too aggressive. When free
> >>memory hits the low watermark, the reclaim shrinks slapd down from
> >>25GB to 18-19GB, while the page cache still contains ~7GB of
> >>unmapped pages. Ideally I'd like a tuning knob so I can say to keep
> >>no more than 2GB of unmapped pages in the cache. (And the desired
> >>effect of that would be to allow user processes to grow to 30GB
> >>total, in this case.)
> >
> >We should find out where the unmapped page cache is coming from if you
> >are only accessing mapped file cache and disabled readahead.
> >
> >How do you arrive at this number of unmapped page cache?
> 
> This number is pretty obvious. When slapd has grown to 25GB, the
  This 25G is presumably from /proc/pid/statm, right?

> page cache has grown to 32GB (less about 200MB, the minfree). So:
  And this value is from where? /proc/meminfo - Cached line?

> 7GB unmapped in the cache.
> 
> >What could happen is that previously used and activated pages do not
> >get evicted anymore since there is a constant supply of younger
> >reclaimable cache that is actually thrashing.  Whenever you drop the
> >caches, you get rid of those stale active pages and allow the
> >previously thrashing cache to get activated.  However, that would
> >require that there is already a significant amount of active file
> >pages before your workload starts (check the nr_active_file number in
> >/proc/vmstat before launching slapd, try sync; echo 3 >drop_caches
> >before launching to eliminate this option) OR that the set of pages
> >accessed during your workload changes and the combined set of pages
> >accessed by your workload is bigger than available memory -- which you
> >claimed would not happen because you only access the 30GB file area on
> >that system.
> 
> There are no other active pages before the test begins. There's
> nothing else running. caches have been dropped completely at the
> beginning.
> 
> The test clearly is accessing only 30GB of data. Once slapd reaches
> this process size, the test can be stopped and restarted any number
> of times, run for any number of hours continuously, and memory use
> on the system is unchanged, and no pageins occur.
  Interesting. It might be worth trying what happens if you do
madvise(..., MADV_DONTNEED) on the data file instead of dropping caches
with /proc/sys/vm/drop_caches. That way we can establish whether the extra
cached data is in the data file (things will look the same way as with
drop_caches) or somewhere else (there will be still unmapped page cache).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-11 12:04                   ` Jan Kara
@ 2013-03-11 12:40                     ` Howard Chu
  0 siblings, 0 replies; 17+ messages in thread
From: Howard Chu @ 2013-03-11 12:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Johannes Weiner, Chris Friesen, Kirill A. Shutemov, Mel Gorman,
	Rik van Riel, linux-kernel, linux-mm

Jan Kara wrote:
> On Fri 08-03-13 12:04:46, Howard Chu wrote:
>> The test clearly is accessing only 30GB of data. Once slapd reaches
>> this process size, the test can be stopped and restarted any number
>> of times, run for any number of hours continuously, and memory use
>> on the system is unchanged, and no pageins occur.
>    Interesting. It might be worth trying what happens if you do
> madvise(..., MADV_DONTNEED) on the data file instead of dropping caches
> with /proc/sys/vm/drop_caches. That way we can establish whether the extra
> cached data is in the data file (things will look the same way as with
> drop_caches) or somewhere else (there will be still unmapped page cache).

I screwed up. My madvise(RANDOM) call used the wrong address/len so it didn't 
cover the whole region. After fixing this, the test now runs as expected - the 
slapd process size grows to 30GB without any problem. Sorry for the noise.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: mmap vs fs cache
  2013-03-11 11:52                 ` Jan Kara
@ 2013-03-11 15:03                   ` Phillip Susi
  0 siblings, 0 replies; 17+ messages in thread
From: Phillip Susi @ 2013-03-11 15:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Howard Chu, Chris Friesen, Kirill A. Shutemov, Johannes Weiner,
	linux-kernel, linux-mm

On 3/11/2013 7:52 AM, Jan Kara wrote:
>> Yep, that's because it isn't implemented.
>    Why do you think so? AFAICS it is implemented by setting VM_RAND_READ
> flag in the VMA and do_async_mmap_readahead() and do_sync_mmap_readahead()
> check for the flag and don't do anything if it is set...

Oh, don't know how I missed that... I was just looking for it the other 
day and couldn't find any references to VM_RandomReadHint so I assumed 
it hadn't been implemented.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-03-11 15:03 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <5136320E.8030109@symas.com>
2013-03-07 15:43 ` mmap vs fs cache Jan Kara
2013-03-08  2:08   ` Johannes Weiner
2013-03-08  7:46     ` Howard Chu
2013-03-08  8:42       ` Kirill A. Shutemov
2013-03-08  9:40         ` Howard Chu
2013-03-08 14:47           ` Chris Friesen
2013-03-08 15:00             ` Howard Chu
2013-03-08 15:25               ` Chris Friesen
2013-03-08 16:16               ` Johannes Weiner
2013-03-08 20:04                 ` Howard Chu
2013-03-11 12:04                   ` Jan Kara
2013-03-11 12:40                     ` Howard Chu
2013-03-09  3:28                 ` Ric Mason
2013-03-09  1:22               ` Phillip Susi
2013-03-11 11:52                 ` Jan Kara
2013-03-11 15:03                   ` Phillip Susi
2013-03-09  2:34     ` Ric Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).