linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Greg Thelen <gthelen@google.com>,
	Christoph Hellwig <hch@infradead.org>,
	Hugh Dickins <hughd@google.com>, Jan Kara <jack@suse.cz>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Mel Gorman <mgorman@suse.de>, Minchan Kim <minchan.kim@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Rik van Riel <riel@redhat.com>,
	Michel Lespinasse <walken@google.com>,
	Seth Jennings <sjenning@linux.vnet.ibm.com>,
	Roman Gushchin <klamm@yandex-team.ru>,
	Ozgun Erdogan <ozgun@citusdata.com>,
	Metin Doslu <metin@citusdata.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [patch 8/9] mm: thrash detection-based file cache sizing
Date: Tue, 20 Aug 2013 13:59:20 -0700	[thread overview]
Message-ID: <20130820135920.b8b7ea0eb2471dfa0034b175@linux-foundation.org> (raw)
In-Reply-To: <1376767883-4411-9-git-send-email-hannes@cmpxchg.org>

On Sat, 17 Aug 2013 15:31:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> The VM maintains cached filesystem pages on two types of lists.  One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past.  We call the recently used
> list "inactive list" and the frequently used list "active list".
> 
> The tricky part of this model is finding the right balance between
> them.  A big inactive list may not leave enough room for the active
> list to protect all the frequently used pages.  A big active list may
> not leave enough room for the inactive list for a new set of
> frequently used pages, "working set", to establish itself because the
> young pages get pushed out of memory before having a chance to get
> promoted.
> 
> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list.  This model gave established
> working sets more gracetime in the face of temporary use once streams,
> but was not satisfactory when use once streaming persisted over longer
> periods of time and the established working set was temporarily
> suspended, like a nightly backup evicting all the interactive user
> program data.
> 
> Subsequently, the rules were changed to only age active pages when
> they exceeded the amount of inactive pages, i.e. leave the working set
> alone as long as the other half of memory is easy to reclaim use once
> pages.  This works well until working set transitions exceed the size
> of half of memory and the average access distance between the pages of
> the new working set is bigger than the inactive list.  The VM will
> mistake the thrashing new working set for use once streaming, while
> the unused old working set pages are stuck on the active list.
> 
> This patch solves this problem by maintaining a history of recently
> evicted file pages, which in turn allows the VM to tell used-once page
> streams from thrashing file cache.
> 
> To accomplish this, a per-zone counter is increased every time a page
> is evicted and a snapshot of that counter is stored as shadow entry in
> the page's now empty page cache radix tree slot.  Upon refault of that
> page, the difference between the current value of that counter and the
> shadow entry value is called the refault distance.  It tells how many
> pages have been evicted from the zone since that page's eviction,
> which is how many page slots at most are missing from the zone's
> inactive list for this page to get accessed twice while in memory.  If
> the number of missing slots is less than or equal to the number of
> active pages, increasing the inactive list at the cost of the active
> list would give this thrashing set a chance to establish itself:
> 
> eviction counter = 4
>                         evicted      inactive           active
>  Page cache data:       [ a b c d ]  [ e f g h i j k ]  [ l m n ]
>   Shadow entries:         0 1 2 3
> Refault distance:         4 3 2 1
> 
> When c is faulted back into memory, it is noted that at most two more
> page slots on the inactive list could have prevented the refault (it
> could be less if c is used out of order).  Thus, the active list needs
> to be challenged as it is possible that c is used more frequently than
> l, m, n.  However, there is no access frequency information available
> on active pages so the pages have to be put in direct competition with
> each other before deciding which one to keep.  Thus, 1) pages can not
> be directly reclaimed from the tail of the active list and b)
> refaulting pages can not be directly activated.  Instead, active pages
> are moved from the tail of the active list to the head of the inactive
> list and placed directly next to the refaulting pages.  This way, they
> both have the same time on the inactive list to prove which page is
> actually used more frequently without incurring unnecessary major
> faults or diluting the active page set in case the previously active
> page is in fact the more frequently used one.
> 
> Also, since the refault of c could have been due to a spurious access,
> only one active page per qualifying refault is challenged.  This will
> keep the impact of outliers low but still detect if bigger groups of
> pages are refaulting.
> 
> ...
>
> +/*
> + *		Double CLOCK lists
> + *
> + * Per zone, two clock lists are maintained for file pages: the
> + * inactive and the active list.  Freshly faulted pages start out at
> + * the head of the inactive list and page reclaim scans pages from the
> + * tail.  Pages that are accessed multiple times on the inactive list
> + * are promoted to the active list, to protect them from reclaim,
> + * whereas active pages are demoted to the inactive list when the
> + * inactive list requires more space to detect repeatedly accessed
> + * pages in the current workload and prevent them from thrashing:
> + *
> + *   fault -----------------------+
> + *                                |
> + *              +-------------+   |            +-------------+
> + *   reclaim <- | inactive    | <-+-- demotion | active      | <--+
> + *              +-------------+                +-------------+    |
> + *                       |                                        |
> + *                       +----------- promotion ------------------+
> + *
> + *
> + *		Access frequency and refault distance
> + *
> + * A workload is thrashing when the distances between the first and
> + * second access of pages that are frequently used is bigger than the
> + * current inactive clock list size, as the pages get reclaimed before
> + * the second access would have promoted them instead:
> + *
> + *    Access #: 1 2 3 4 5 6 7 8 9
> + *     Page ID: x y b c d e f x y
> + *                  | inactive  |
> + *
> + * To prevent this workload from thrashing, a bigger inactive list is
> + * required.  And the only way the inactive list can grow on a full
> + * zone is by taking away space from the corresponding active list.
> + *
> + *      +-inactive--+-active------+
> + *  x y | b c d e f | G H I J K L |
> + *      +-----------+-------------+
> + *
> + * Not every refault should lead to growing the inactive list at the
> + * cost of the active list, however: if the access distances are
> + * bigger than available memory overall, there is little point in
> + * challenging the protected pages on the active list, as those
> + * refaulting pages will not fit completely into memory.
> + *
> + * It is prohibitively expensive to track the access frequency of
> + * in-core pages, but it is possible to track their refault distance,
> + * which is the number of page slots shrunk from the inactive list
> + * between a page's eviction and subsequent refault.  This indicates
> + * how many page slots are missing on the inactive list in order to
> + * prevent future thrashing of that page.  Thus, instead of comparing
> + * access frequency to total available memory, one can compare the
> + * refault distance to the inactive list's potential for growth: the
> + * size of the active list.
> + *
> + *
> + *		Rebalancing the lists
> + *
> + * Shrinking the active list has to be done carefully because the
> + * pages on it may have vastly different access frequencies compared
> + * to the pages on the inactive list.  Thus, pages are not reclaimed
> + * directly from the tail of the active list, but instead moved to the
> + * head of the inactive list.  This way, they are competing directly
> + * with the pages that challenged their protected status.  If they are
> + * unused, they will eventually be reclaimed, but if they are indeed
> + * used more frequently than the challenging inactive pages, they will
> + * be reactivated.  This allows the existing protected set to be
> + * challenged without incurring major faults in case of a mistake.
> + */

The consequences of a 32-bit wraparound of the refault distance still
concern me.  It's a rare occurrence and it is difficult to determine
what will happen.  An explicit design-level description here would be
useful.

> +static void *pack_shadow(unsigned long time, struct zone *zone)
> +{
> +	time = (time << NODES_SHIFT) | zone_to_nid(zone);
> +	time = (time << ZONES_SHIFT) | zone_idx(zone);
> +	time = (time << RADIX_TREE_EXCEPTIONAL_SHIFT);
> +
> +	return (void *)(time | RADIX_TREE_EXCEPTIONAL_ENTRY);
> +}

"time" is normally in jiffies ;)

Some description of the underlying units of workingset_time would be
helpful.

> 
> ...
>
> +/**
> + * workingset_eviction - note the eviction of a page from memory
> + * @mapping: address space the page was backing
> + * @page: the page being evicted
> + *
> + * Returns a shadow entry to be stored in @mapping->page_tree in place
> + * of the evicted @page so that a later refault can be detected.  Or
> + * %NULL when the eviction should not be remembered.
> + */

Describe the locking requirements here.  It's part of the interface. 
And it's the part the compiler can't check, so extra care is needed.

> +void *workingset_eviction(struct address_space *mapping, struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +	unsigned long time;
> +
> +	time = atomic_long_inc_return(&zone->workingset_time);
> +
> +	/*
> +	 * Don't store shadows in an inode that is being reclaimed.
> +	 * This is not just an optizimation, inode reclaim needs to
> +	 * empty out the radix tree or the nodes are lost, so don't
> +	 * plant shadows behind its back.
> +	 */
> +	if (mapping_exiting(mapping))
> +		return NULL;
> +
> +	return pack_shadow(time, zone);
> +}
> +
> +/**
> + * workingset_refault - note the refault of a previously evicted page
> + * @shadow: shadow entry of the evicted page
> + *
> + * Calculates and evaluates the refault distance of the previously
> + * evicted page in the context of the zone it was allocated in.
> + *
> + * This primes page reclaim to rebalance the zone's file lists if
> + * necessary, so it must be called before a page frame for the
> + * refaulting page is allocated.
> + */
> +void workingset_refault(void *shadow)
> +{
> +	unsigned long refault_distance;

Is the "refault distance" described somewhere?  What relationship (if
any) does it have with workingset_time?

> +	struct zone *zone;
> +
> +	unpack_shadow(shadow, &zone, &refault_distance);
> +
> +	inc_zone_state(zone, WORKINGSET_REFAULT);
> +
> +	/*
> +	 * Protected pages should be challenged when the refault
> +	 * distance indicates that thrashing could be stopped by
> +	 * increasing the inactive list at the cost of the active
> +	 * list.
> +	 */
> +	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> +		inc_zone_state(zone, WORKINGSET_STALE);
> +		zone->shrink_active++;
> +	}
> +}
> +EXPORT_SYMBOL(workingset_refault);
> +
> +/**
> + * workingset_activation - note a page activation
> + * @page: page that is being activated
> + */
> +void workingset_activation(struct page *page)
> +{
> +	struct zone *zone = page_zone(page);
> +
> +	/*
> +	 * The lists are rebalanced when the inactive list is observed
> +	 * to be too small for activations.  An activation means that
> +	 * the inactive list is now big enough again for at least one
> +	 * page, so back off further deactivation.
> +	 */
> +	atomic_long_inc(&zone->workingset_time);
> +	if (zone->shrink_active > 0)
> +		zone->shrink_active--;
> +}

Strange mixture of exports and non-exports.  I assume you went with
"enough to make it build".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-08-20 20:59 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-17 19:31 [patch 9/9] mm: thrash detection-based file cache sizing v4 Johannes Weiner
2013-08-17 19:31 ` [patch 1/9] lib: radix-tree: radix_tree_delete_item() Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-17 19:31 ` [patch 2/9] mm: shmem: save one radix tree lookup when truncating swapped pages Johannes Weiner
2013-08-17 19:31 ` [patch 3/9] mm: filemap: move radix tree hole searching here Johannes Weiner
2013-08-17 19:31 ` [patch 4/9] mm + fs: prepare for non-page entries in page cache radix trees Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-22  7:20     ` Johannes Weiner
2013-08-17 19:31 ` [patch 5/9] mm + fs: store shadow entries in page cache Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-17 19:31 ` [patch 6/9] mm + fs: provide shadow pages to page cache allocations Johannes Weiner
2013-08-17 19:31 ` [patch 7/9] mm: make global_dirtyable_memory() available to other mm code Johannes Weiner
2013-08-17 19:31 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
2013-08-20 20:59   ` Andrew Morton [this message]
2013-08-22  8:45     ` Johannes Weiner
2013-08-17 19:31 ` [patch 9/9] mm: workingset: keep shadow entries in check Johannes Weiner
2013-08-20 20:59   ` Andrew Morton
2013-08-22  9:48     ` Johannes Weiner
2013-08-20 21:04 ` [patch 9/9] mm: thrash detection-based file cache sizing v4 Andrew Morton
2013-08-22  9:08   ` Metin Doslu
  -- strict thread matches above, loose matches on Subject: below --
2013-08-06 22:44 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
2013-08-06 22:44 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner
2013-08-09 22:49   ` Andrew Morton
2013-08-12 16:00     ` Johannes Weiner
2013-08-11 21:57   ` Vlastimil Babka
2013-08-12 16:27     ` Johannes Weiner
2013-08-06 22:22 [patch 0/9] mm: thrash detection-based file cache sizing v3 Johannes Weiner
2013-08-06 22:22 ` [patch 8/9] mm: thrash detection-based file cache sizing Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130820135920.b8b7ea0eb2471dfa0034b175@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=aarcange@redhat.com \
    --cc=andi@firstfloor.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=klamm@yandex-team.ru \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=metin@citusdata.com \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=ozgun@citusdata.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=sjenning@linux.vnet.ibm.com \
    --cc=vbabka@suse.cz \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).