* [PATCH 0/3] mm: support bigger cache workingsets and protect against writes @ 2016-04-04 17:13 Johannes Weiner 2016-04-04 17:13 ` [PATCH 1/3] mm: workingset: only do workingset activations on reads Johannes Weiner ` (3 more replies) 0 siblings, 4 replies; 10+ messages in thread From: Johannes Weiner @ 2016-04-04 17:13 UTC (permalink / raw) To: Andres Freund, Rik van Riel, Andrew Morton Cc: linux-mm, linux-kernel, kernel-team Hi, this is a follow-up to http://www.spinics.net/lists/linux-mm/msg101739.html where Andres reported his database workingset being pushed out by the minimum size enforcement of the inactive file list - currently 50% of cache - as well as repeatedly written file pages that are never actually read. Two changes fell out of the discussions. The first change observes that pages that are only ever written don't benefit from caching beyond what the writeback cache does for partial page writes, and so we shouldn't promote them to the active file list where they compete with pages whose cached data is actually accessed repeatedly. This change comes in two patches - one for in-cache write accesses and one for refaults triggered by writes, neither of which should promote a cache page. Second, with the refault detection we don't need to set 50% of the cache aside for used-once cache anymore since we can detect frequently used pages even when they are evicted between accesses. We can allow the active list to be bigger and thus protect a bigger workingset that isn't challenged by streamers. Depending on the access patterns, this can increase major faults during workingset transitions for better performance during stable phases. Andres, I tried reproducing your postgres scenario, but I could never get the WAL to interfere even with wal_log = hot_standby mode. It's a 8G machine, I set shared_buffers = 2GB, ran pgbench -i -s 290, and then -c 32 -j 32 -M prepared -t 150000. Any input on how to trigger the thrashing you observed would be appreciated. But it would be great if you could test these patches on your known-problematic setup as well. Thanks! include/linux/memcontrol.h | 25 ----------- mm/filemap.c | 8 +++- mm/page_alloc.c | 44 ------------------ mm/vmscan.c | 104 +++++++++++++++++-------------------------- 4 files changed, 48 insertions(+), 133 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/3] mm: workingset: only do workingset activations on reads 2016-04-04 17:13 [PATCH 0/3] mm: support bigger cache workingsets and protect against writes Johannes Weiner @ 2016-04-04 17:13 ` Johannes Weiner 2016-04-04 17:13 ` [PATCH 2/3] mm: filemap: only do access " Johannes Weiner ` (2 subsequent siblings) 3 siblings, 0 replies; 10+ messages in thread From: Johannes Weiner @ 2016-04-04 17:13 UTC (permalink / raw) To: Andres Freund, Rik van Riel, Andrew Morton Cc: linux-mm, linux-kernel, kernel-team From: Rik van Riel <riel@redhat.com> When rewriting a page, the data in that page is replaced with new data. This means that evicting something else from the active file list, in order to cache data that will be replaced by something else, is likely to be a waste of memory. It is better to save the active list for frequently read pages, because reads actually use the data that is in the page. This patch ignores partial writes, because it is unclear whether the complexity of identifying those is worth any potential performance gain obtained from better caching pages that see repeated partial writes at large enough intervals to not get caught by the use-twice promotion code used for the inactive file list. Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/filemap.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index a8c69c8..ca33816 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -713,8 +713,12 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, * The page might have been evicted from cache only * recently, in which case it should be activated like * any other repeatedly accessed page. + * The exception is pages getting rewritten; evicting other + * data from the working set, only to cache data that will + * get overwritten with something else, is a waste of memory. */ - if (shadow && workingset_refault(shadow)) { + if (!(gfp_mask & __GFP_WRITE) && + shadow && workingset_refault(shadow)) { SetPageActive(page); workingset_activation(page); } else -- 2.8.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/3] mm: filemap: only do access activations on reads 2016-04-04 17:13 [PATCH 0/3] mm: support bigger cache workingsets and protect against writes Johannes Weiner 2016-04-04 17:13 ` [PATCH 1/3] mm: workingset: only do workingset activations on reads Johannes Weiner @ 2016-04-04 17:13 ` Johannes Weiner 2016-04-04 21:22 ` Andrew Morton 2016-04-04 17:13 ` [PATCH 3/3] mm: vmscan: reduce size of inactive file list Johannes Weiner 2016-04-04 18:52 ` [PATCH 0/3] mm: support bigger cache workingsets and protect against writes Andres Freund 3 siblings, 1 reply; 10+ messages in thread From: Johannes Weiner @ 2016-04-04 17:13 UTC (permalink / raw) To: Andres Freund, Rik van Riel, Andrew Morton Cc: linux-mm, linux-kernel, kernel-team Andres Freund observed that his database workload is struggling with the transaction journal creating pressure on frequently read pages. Access patterns like transaction journals frequently write the same pages over and over, but in the majority of cases those pages are never read back. There are no caching benefits to be had for those pages, so activating them and having them put pressure on pages that do benefit from caching is a bad choice. Leave page activations to read accesses and don't promote pages based on writes alone. It could be said that partially written pages do contain cache-worthy data, because even if *userspace* does not access the unwritten part, the kernel still has to read it from the filesystem for correctness. However, a counter argument is that these pages enjoy at least *some* protection over other inactive file pages through the writeback cache, in the sense that dirty pages are written back with a delay and cache reclaim leaves them alone until they have been written back to disk. Should that turn out to be insufficient and we see increased read IO from partial writes under memory pressure, we can always go back and update grab_cache_page_write_begin() to take (pos, len) so that it can tell partial writes from pages that don't need partial reads. But for now, keep it simple. Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index ca33816..edfec5e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2579,7 +2579,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index, unsigned flags) { struct page *page; - int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT; + int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT; if (flags & AOP_FLAG_NOFS) fgp_flags |= FGP_NOFS; -- 2.8.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] mm: filemap: only do access activations on reads 2016-04-04 17:13 ` [PATCH 2/3] mm: filemap: only do access " Johannes Weiner @ 2016-04-04 21:22 ` Andrew Morton 2016-04-04 21:39 ` Rik van Riel 2016-04-04 22:47 ` Johannes Weiner 0 siblings, 2 replies; 10+ messages in thread From: Andrew Morton @ 2016-04-04 21:22 UTC (permalink / raw) To: Johannes Weiner Cc: Andres Freund, Rik van Riel, linux-mm, linux-kernel, kernel-team On Mon, 4 Apr 2016 13:13:37 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > Andres Freund observed that his database workload is struggling with > the transaction journal creating pressure on frequently read pages. > > Access patterns like transaction journals frequently write the same > pages over and over, but in the majority of cases those pages are > never read back. There are no caching benefits to be had for those > pages, so activating them and having them put pressure on pages that > do benefit from caching is a bad choice. Read-after-write is a pretty common pattern: temporary files for example. What are the opportunities for regressions here? Did you consider providing userspace with a way to hint "this file is probably write-then-not-read"? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] mm: filemap: only do access activations on reads 2016-04-04 21:22 ` Andrew Morton @ 2016-04-04 21:39 ` Rik van Riel 2016-04-04 21:55 ` Andrew Morton 2016-04-05 17:50 ` Johannes Weiner 2016-04-04 22:47 ` Johannes Weiner 1 sibling, 2 replies; 10+ messages in thread From: Rik van Riel @ 2016-04-04 21:39 UTC (permalink / raw) To: Andrew Morton, Johannes Weiner Cc: Andres Freund, linux-mm, linux-kernel, kernel-team [-- Attachment #1: Type: text/plain, Size: 1546 bytes --] On Mon, 2016-04-04 at 14:22 -0700, Andrew Morton wrote: > On Mon, 4 Apr 2016 13:13:37 -0400 Johannes Weiner <hannes@cmpxchg.or > g> wrote: > > > > > Andres Freund observed that his database workload is struggling > > with > > the transaction journal creating pressure on frequently read pages. > > > > Access patterns like transaction journals frequently write the same > > pages over and over, but in the majority of cases those pages are > > never read back. There are no caching benefits to be had for those > > pages, so activating them and having them put pressure on pages > > that > > do benefit from caching is a bad choice. > Read-after-write is a pretty common pattern: temporary files for > example. What are the opportunities for regressions here? > > Did you consider providing userspace with a way to hint "this file is > probably write-then-not-read"? I suspect the opportunity for regressions is fairly small, considering that temporary files usually have a very short life span, and will likely be read-after-written before they get evicted from the inactive list. As for hinting, I suspect it may make sense to differentiate between whole page and partial page writes, where partial page writes use FGP_ACCESSED, and whole page writes do not, under the assumption that if we write a partial page, there may be a higher chance that other parts of the page get accessed again for other writes (or reads). I do not know whether that assumption holds :) -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] mm: filemap: only do access activations on reads 2016-04-04 21:39 ` Rik van Riel @ 2016-04-04 21:55 ` Andrew Morton 2016-04-05 17:50 ` Johannes Weiner 1 sibling, 0 replies; 10+ messages in thread From: Andrew Morton @ 2016-04-04 21:55 UTC (permalink / raw) To: Rik van Riel Cc: Johannes Weiner, Andres Freund, linux-mm, linux-kernel, kernel-team On Mon, 04 Apr 2016 17:39:47 -0400 Rik van Riel <riel@redhat.com> wrote: > On Mon, 2016-04-04 at 14:22 -0700, Andrew Morton wrote: > > On Mon,____4 Apr 2016 13:13:37 -0400 Johannes Weiner <hannes@cmpxchg.or > > g> wrote: > > > > > > > > Andres Freund observed that his database workload is struggling > > > with > > > the transaction journal creating pressure on frequently read pages. > > > > > > Access patterns like transaction journals frequently write the same > > > pages over and over, but in the majority of cases those pages are > > > never read back. There are no caching benefits to be had for those > > > pages, so activating them and having them put pressure on pages > > > that > > > do benefit from caching is a bad choice. > > Read-after-write is a pretty common pattern: temporary files for > > example.____What are the opportunities for regressions here? > > > > Did you consider providing userspace with a way to hint "this file is > > probably write-then-not-read"? > > I suspect the opportunity for regressions is fairly small, > considering that temporary files usually have a very short > life span, and will likely be read-after-written before they > get evicted from the inactive list. The opportunity for regressions in the current code is fairly small, but Andres found one :( If there's any possibility at all, someone will hit it. One possible way to move forward is to write testcases to deliberately hit the predicted problem, gain an understanding of how hard it is to hit, how bad the effects are. > As for hinting, I suspect it may make sense to differentiate > between whole page and partial page writes, where partial > page writes use FGP_ACCESSED, and whole page writes do not, > under the assumption that if we write a partial page, there > may be a higher chance that other parts of the page get > accessed again for other writes (or reads). hm, the FGP_foo documentation is a mess. There's some placed randomly at pagecache_get_page() and FGP_WRITE got missed altogether. The ext4 journal would be a decent (but not very significant) candidate for a "this is never read from" interface. I guess the fs could manually deactivate (or even free?) the pages. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] mm: filemap: only do access activations on reads 2016-04-04 21:39 ` Rik van Riel 2016-04-04 21:55 ` Andrew Morton @ 2016-04-05 17:50 ` Johannes Weiner 1 sibling, 0 replies; 10+ messages in thread From: Johannes Weiner @ 2016-04-05 17:50 UTC (permalink / raw) To: Rik van Riel Cc: Andrew Morton, Andres Freund, linux-mm, linux-kernel, kernel-team On Mon, Apr 04, 2016 at 05:39:47PM -0400, Rik van Riel wrote: > As for hinting, I suspect it may make sense to differentiate > between whole page and partial page writes, where partial > page writes use FGP_ACCESSED, and whole page writes do not, > under the assumption that if we write a partial page, there > may be a higher chance that other parts of the page get > accessed again for other writes (or reads). The writeback cache should handle at least the multiple subpage writes case. What I find a little weird about counting accesses from partial writes only is when a write covers a full page and then parts of the next. We would cache only a small piece of what's likely one coherent chunk. Or when a user writes out several pages in a loop of subpage chunks. This will get even worse to program against when we start having page cache transparently backed by pages of different sizes. Because of that I think it'd be better to apply LRU aging decisions based on type of access rather based on specific request sizes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] mm: filemap: only do access activations on reads 2016-04-04 21:22 ` Andrew Morton 2016-04-04 21:39 ` Rik van Riel @ 2016-04-04 22:47 ` Johannes Weiner 1 sibling, 0 replies; 10+ messages in thread From: Johannes Weiner @ 2016-04-04 22:47 UTC (permalink / raw) To: Andrew Morton Cc: Andres Freund, Rik van Riel, linux-mm, linux-kernel, kernel-team On Mon, Apr 04, 2016 at 02:22:33PM -0700, Andrew Morton wrote: > On Mon, 4 Apr 2016 13:13:37 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > > > Andres Freund observed that his database workload is struggling with > > the transaction journal creating pressure on frequently read pages. > > > > Access patterns like transaction journals frequently write the same > > pages over and over, but in the majority of cases those pages are > > never read back. There are no caching benefits to be had for those > > pages, so activating them and having them put pressure on pages that > > do benefit from caching is a bad choice. > > Read-after-write is a pretty common pattern: temporary files for > example. What are the opportunities for regressions here? The read(s) following the write will call mark_page_accessed() and so promote the pages if their data is in fact repeatedly accessed. That makes sense, because the writes really don't say anything about the cache-worthiness. One write followed by one read shouldn't mean the data is strongly benefiting from being cached. Only multiple reads. What complicates that a little bit is that when the multiple reads do happen on write-instantiated pages, the pages might have already been aged somewhat in between, whereas fresh-faulting reads start counting accesses from the head of the LRU right away. If both have re-use distances shorter than memory, the LRU offset of pages instantiated by writes could push the second access past eviction. In that case, they would likely get picked up by refault detection and promoted after all. So it would be one more IO, but nothing permanent. This is also somewhat compensated by the dirty cache delaying reclaim and giving these pages another round-trip anyway - unless dirty limits cause the pages to be written back before they reach the LRU tail. It's really hard to tell whether that would even be an issue since it depends on whether a workload matching those parameters even exist. A synthetic test doesn't really say us much about that. I think all we can do here is decide whether the cache semantics make logical sense. One thing I proposed in the thread that would compensate for the LRU offset of write-instantiated pages would be to set PageReferenced on these pages but never call mark_page_accessed() from the write. This wouldn't be perfect because the distance between write and read does not necessarily predict the distance between the subsequent reads, but it would mean that the first read would promote the pages, whereas repeatedly written files would never be activated or refault-activate. Would that make sense? Is there something I'm missing? > Did you consider providing userspace with a way to hint "this file is > probably write-then-not-read"? Yes, but I'm not too confident in that working out :( -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 3/3] mm: vmscan: reduce size of inactive file list 2016-04-04 17:13 [PATCH 0/3] mm: support bigger cache workingsets and protect against writes Johannes Weiner 2016-04-04 17:13 ` [PATCH 1/3] mm: workingset: only do workingset activations on reads Johannes Weiner 2016-04-04 17:13 ` [PATCH 2/3] mm: filemap: only do access " Johannes Weiner @ 2016-04-04 17:13 ` Johannes Weiner 2016-04-04 18:52 ` [PATCH 0/3] mm: support bigger cache workingsets and protect against writes Andres Freund 3 siblings, 0 replies; 10+ messages in thread From: Johannes Weiner @ 2016-04-04 17:13 UTC (permalink / raw) To: Andres Freund, Rik van Riel, Andrew Morton Cc: linux-mm, linux-kernel, kernel-team From: Rik van Riel <riel@redhat.com> The inactive file list should still be large enough to contain readahead windows and freshly written file data, but it no longer is the only source for detecting multiple accesses to file pages. The workingset refault measurement code causes recently evicted file pages that get accessed again after a shorter interval to be promoted directly to the active list. With that mechanism in place, we can afford to (on a larger system) dedicate more memory to the active file list, so we can actually cache more of the frequently used file pages in memory, and not have them pushed out by streaming writes, once-used streaming file reads, etc. This can help things like database workloads, where only half the page cache can currently be used to cache the database working set. This patch automatically increases that fraction on larger systems, using the same ratio that has already been used for anonymous memory. Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Rik van Riel <riel@redhat.com> [hannes@cmpxchg.org: cgroup-awareness] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/memcontrol.h | 25 ----------- mm/page_alloc.c | 44 ------------------- mm/vmscan.c | 104 ++++++++++++++++++--------------------------- 3 files changed, 42 insertions(+), 131 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1191d79..3694f88 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -415,25 +415,6 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru) return mz->lru_size[lru]; } -static inline bool mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) -{ - unsigned long inactive_ratio; - unsigned long inactive; - unsigned long active; - unsigned long gb; - - inactive = mem_cgroup_get_lru_size(lruvec, LRU_INACTIVE_ANON); - active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_ANON); - - gb = (inactive + active) >> (30 - PAGE_SHIFT); - if (gb) - inactive_ratio = int_sqrt(10 * gb); - else - inactive_ratio = 1; - - return inactive * inactive_ratio < active; -} - void mem_cgroup_handle_over_high(void); void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, @@ -646,12 +627,6 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) return true; } -static inline bool -mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) -{ - return true; -} - static inline unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 59de90d..67db15d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6395,49 +6395,6 @@ void setup_per_zone_wmarks(void) } /* - * The inactive anon list should be small enough that the VM never has to - * do too much work, but large enough that each inactive page has a chance - * to be referenced again before it is swapped out. - * - * The inactive_anon ratio is the target ratio of ACTIVE_ANON to - * INACTIVE_ANON pages on this zone's LRU, maintained by the - * pageout code. A zone->inactive_ratio of 3 means 3:1 or 25% of - * the anonymous pages are kept on the inactive list. - * - * total target max - * memory ratio inactive anon - * ------------------------------------- - * 10MB 1 5MB - * 100MB 1 50MB - * 1GB 3 250MB - * 10GB 10 0.9GB - * 100GB 31 3GB - * 1TB 101 10GB - * 10TB 320 32GB - */ -static void __meminit calculate_zone_inactive_ratio(struct zone *zone) -{ - unsigned int gb, ratio; - - /* Zone size in gigabytes */ - gb = zone->managed_pages >> (30 - PAGE_SHIFT); - if (gb) - ratio = int_sqrt(10 * gb); - else - ratio = 1; - - zone->inactive_ratio = ratio; -} - -static void __meminit setup_per_zone_inactive_ratio(void) -{ - struct zone *zone; - - for_each_zone(zone) - calculate_zone_inactive_ratio(zone); -} - -/* * Initialise min_free_kbytes. * * For small machines we want it small (128k min). For large machines @@ -6482,7 +6439,6 @@ int __meminit init_per_zone_wmark_min(void) setup_per_zone_wmarks(); refresh_zone_stat_thresholds(); setup_per_zone_lowmem_reserve(); - setup_per_zone_inactive_ratio(); return 0; } module_init(init_per_zone_wmark_min) diff --git a/mm/vmscan.c b/mm/vmscan.c index b934223e..6f4f18c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1865,83 +1865,63 @@ static void shrink_active_list(unsigned long nr_to_scan, free_hot_cold_page_list(&l_hold, true); } -#ifdef CONFIG_SWAP -static bool inactive_anon_is_low_global(struct zone *zone) -{ - unsigned long active, inactive; - - active = zone_page_state(zone, NR_ACTIVE_ANON); - inactive = zone_page_state(zone, NR_INACTIVE_ANON); - - return inactive * zone->inactive_ratio < active; -} - -/** - * inactive_anon_is_low - check if anonymous pages need to be deactivated - * @lruvec: LRU vector to check +/* + * The inactive anon list should be small enough that the VM never has + * to do too much work. * - * Returns true if the zone does not have enough inactive anon pages, - * meaning some active anon pages need to be deactivated. - */ -static bool inactive_anon_is_low(struct lruvec *lruvec) -{ - /* - * If we don't have swap space, anonymous page deactivation - * is pointless. - */ - if (!total_swap_pages) - return false; - - if (!mem_cgroup_disabled()) - return mem_cgroup_inactive_anon_is_low(lruvec); - - return inactive_anon_is_low_global(lruvec_zone(lruvec)); -} -#else -static inline bool inactive_anon_is_low(struct lruvec *lruvec) -{ - return false; -} -#endif - -/** - * inactive_file_is_low - check if file pages need to be deactivated - * @lruvec: LRU vector to check + * The inactive file list should be small enough to leave most memory + * to the established workingset on the scan-resistant active list, + * but large enough to avoid thrashing the aggregate readahead window. * - * When the system is doing streaming IO, memory pressure here - * ensures that active file pages get deactivated, until more - * than half of the file pages are on the inactive list. + * Both inactive lists should also be large enough that each inactive + * page has a chance to be referenced again before it is reclaimed. * - * Once we get to that situation, protect the system's working - * set from being evicted by disabling active file page aging. + * The inactive_ratio is the target ratio of ACTIVE to INACTIVE pages + * on this LRU, maintained by the pageout code. A zone->inactive_ratio + * of 3 means 3:1 or 25% of the pages are kept on the inactive list. * - * This uses a different ratio than the anonymous pages, because - * the page cache uses a use-once replacement algorithm. + * total target max + * memory ratio inactive + * ------------------------------------- + * 10MB 1 5MB + * 100MB 1 50MB + * 1GB 3 250MB + * 10GB 10 0.9GB + * 100GB 31 3GB + * 1TB 101 10GB + * 10TB 320 32GB */ -static bool inactive_file_is_low(struct lruvec *lruvec) +static bool inactive_list_is_low(struct lruvec *lruvec, bool file) { + unsigned long inactive_ratio; unsigned long inactive; unsigned long active; + unsigned long gb; - inactive = lruvec_lru_size(lruvec, LRU_INACTIVE_FILE); - active = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE); + /* + * If we don't have swap space, anonymous page deactivation + * is pointless. + */ + if (!file && !total_swap_pages) + return false; - return active > inactive; -} + inactive = lruvec_lru_size(lruvec, file * LRU_FILE); + active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE); -static bool inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru) -{ - if (is_file_lru(lru)) - return inactive_file_is_low(lruvec); + gb = (inactive + active) >> (30 - PAGE_SHIFT); + if (gb) + inactive_ratio = int_sqrt(10 * gb); else - return inactive_anon_is_low(lruvec); + inactive_ratio = 1; + + return inactive * inactive_ratio < active; } static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc) { if (is_active_lru(lru)) { - if (inactive_list_is_low(lruvec, lru)) + if (inactive_list_is_low(lruvec, is_file_lru(lru))) shrink_active_list(nr_to_scan, lruvec, sc, lru); return 0; } @@ -2062,7 +2042,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, * lruvec even if it has plenty of old anonymous pages unless the * system is under heavy pressure. */ - if (!inactive_file_is_low(lruvec) && + if (!inactive_list_is_low(lruvec, true) && lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) { scan_balance = SCAN_FILE; goto out; @@ -2304,7 +2284,7 @@ static void shrink_zone_memcg(struct zone *zone, struct mem_cgroup *memcg, * Even if we did not try to evict anon pages at all, we want to * rebalance the anon lru active/inactive ratio. */ - if (inactive_anon_is_low(lruvec)) + if (inactive_list_is_low(lruvec, false)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); @@ -2965,7 +2945,7 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc) do { struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); - if (inactive_anon_is_low(lruvec)) + if (inactive_list_is_low(lruvec, false)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); -- 2.8.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 0/3] mm: support bigger cache workingsets and protect against writes 2016-04-04 17:13 [PATCH 0/3] mm: support bigger cache workingsets and protect against writes Johannes Weiner ` (2 preceding siblings ...) 2016-04-04 17:13 ` [PATCH 3/3] mm: vmscan: reduce size of inactive file list Johannes Weiner @ 2016-04-04 18:52 ` Andres Freund 3 siblings, 0 replies; 10+ messages in thread From: Andres Freund @ 2016-04-04 18:52 UTC (permalink / raw) To: Johannes Weiner Cc: Rik van Riel, Andrew Morton, linux-mm, linux-kernel, kernel-team Hi Johannes, On 2016-04-04 13:13:35 -0400, Johannes Weiner wrote: > this is a follow-up to http://www.spinics.net/lists/linux-mm/msg101739.html > where Andres reported his database workingset being pushed out by the > minimum size enforcement of the inactive file list - currently 50% of cache > - as well as repeatedly written file pages that are never actually read. Thanks for following up! > Andres, I tried reproducing your postgres scenario, but I could never get > the WAL to interfere even with wal_log = hot_standby mode. It's a 8G > machine, I set shared_buffers = 2GB, ran pgbench -i -s 290, and then -c 32 > -j 32 -M prepared -t 150000. Any input on how to trigger the thrashing you > observed would be appreciated. But it would be great if you could test these > patches on your known-problematic setup as well. I'm unfortunately in the process of moving to the US (as in, I'm packing boxes), so I can't get back to you just now. I'll try ASAP (early next week). Regards, Andres -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2016-04-05 17:51 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-04-04 17:13 [PATCH 0/3] mm: support bigger cache workingsets and protect against writes Johannes Weiner 2016-04-04 17:13 ` [PATCH 1/3] mm: workingset: only do workingset activations on reads Johannes Weiner 2016-04-04 17:13 ` [PATCH 2/3] mm: filemap: only do access " Johannes Weiner 2016-04-04 21:22 ` Andrew Morton 2016-04-04 21:39 ` Rik van Riel 2016-04-04 21:55 ` Andrew Morton 2016-04-05 17:50 ` Johannes Weiner 2016-04-04 22:47 ` Johannes Weiner 2016-04-04 17:13 ` [PATCH 3/3] mm: vmscan: reduce size of inactive file list Johannes Weiner 2016-04-04 18:52 ` [PATCH 0/3] mm: support bigger cache workingsets and protect against writes Andres Freund
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).