Re: [PATCH RFC] mm: lru milestones, timestamps and ages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Zlatko Calusic <zcalusic@bitsync.net>
To: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: linux-mm@kvack.org
Subject: Re: [PATCH RFC] mm: lru milestones, timestamps and ages
Date: Sat, 04 May 2013 23:36:32 +0200	[thread overview]
Message-ID: <51857F60.8000209@bitsync.net> (raw)
In-Reply-To: <5185069A.1080306@openvz.org>

[-- Attachment #1: Type: text/plain, Size: 6155 bytes --]

On 04.05.2013 15:01, Konstantin Khlebnikov wrote:
> Konstantin Khlebnikov wrote:
>> Zlatko Calusic wrote:
>>> On 30.04.2013 13:02, Konstantin Khlebnikov wrote:
>>>> This patch adds engine for estimating rotation time for pages in lru
>>>> lists.
>>>>
>>>> This adds bunch of 'milestones' into each struct lruvec and inserts
>>>> them into
>>>> lru lists periodically. Milestone flows in lru together with pages
>>>> and brings
>>>> timestamp to the end of lru. Because milestones are embedded into
>>>> lruvec they
>>>> can be easily distinguished from pages by comparing pointers.
>>>> Only few functions should care about that.
>>>>
>>>> This machinery provides discrete-time estimation for age of pages
>>>> from the end
>>>> of each lru and average age of each kind of evictable lrus in each
>>>> zone.
>>>
>>> Great stuff!
>>
>> Thanks!
>>
>>>
>>> Believe it or not, I had an idea of writing something similar to
>>> this, but of course having an idea and actually
>>> implementing it are two very different things. Thank you for your work!
>>>
>>> I will use this to prove (or not) that file pages in the normal zone
>>> on a 4GB RAM machine are reused waaaay too soon.
>>> Actually, I already have the patch applied and running on the
>>> desktop, but it should be much more useful on server
>>> workloads. Desktops have erratic load and can go for a long time with
>>> very little I/O activity. But, here are the
>>> current numbers anyway:
>>>
>>> Node 0, zone DMA32
>>> pages free 5371
>>> nr_inactive_anon 4257
>>> nr_active_anon 139719
>>> nr_inactive_file 617537
>>> nr_active_file 51671
>>> inactive_ratio: 5
>>> avg_age_inactive_anon: 2514752
>>> avg_age_active_anon: 2514752
>>> avg_age_inactive_file: 876416
>>> avg_age_active_file: 2514752
>>> Node 0, zone Normal
>>> pages free 424
>>> nr_inactive_anon 253
>>> nr_active_anon 54480
>>> nr_inactive_file 63274
>>> nr_active_file 44116
>>> inactive_ratio: 1
>>> avg_age_inactive_anon: 2531712
>>> avg_age_active_anon: 2531712
>>> avg_age_inactive_file: 901120
>>> avg_age_active_file: 2531712
>>>
>>>> In our kernel we use similar engine as source of statistics for
>>>> scheduler in
>>>> memory reclaimer. This is O(1) scheduler which shifts vmscan
>>>> priorities for lru
>>>> vectors depending on their sizes, limits and ages. It tries to
>>>> balance memory
>>>> pressure among containers. I'll try to rework it for the mainline
>>>> kernel soon.
>>>>
>>>> Seems like these ages also can be used for optimal memory pressure
>>>> distribution
>>>> between file and anon pages, and probably for balancing pressure
>>>> among zones.
>>>
>>> This all sounds very promising. Especially because I currently
>>> observe quite some imbalance among zones.
>>
>> As I see, most likely reason of such imbalances is 'break' condition
>> inside of shrink_lruvec().
>> So can try to disable it see what will happen.
>>
>> But these numbers from your desktop actually doesn't proves this
>> problem. Seems like difference
>> between zones is within the precision of this method. I don't know how
>> to describe this precisely.
>> Probably irregularity between milestones also should be taken into the
>> account to describe current
>> situation and quality of measurement.
>>
>> Here current numbers from my 8Gb node. Main workload is a torrent client.
>>
>> Node 0, zone DMA32
>> nr_inactive_anon 1
>> nr_active_anon 1494
>> nr_inactive_file 404028
>> nr_active_file 365525
>> nr_dirtied 855068
>> nr_written 854991
>> avg_age_inactive_anon: 64942528
>> avg_age_active_anon: 64942528
>> avg_age_inactive_file: 1281317
>> avg_age_active_file: 15813376
>> Node 0, zone Normal
>> nr_inactive_anon 376
>> nr_active_anon 13793
>> nr_inactive_file 542605
>> nr_active_file 542247
>> nr_dirtied 2746747
>> nr_written 2746266
>> avg_age_inactive_anon: 65064192
>> avg_age_active_anon: 65064192
>> avg_age_inactive_file: 1260611
>> avg_age_active_file: 8765240
>>
>> So, here noticeable imbalance in ages of active file lru and
>> nr_dirtied/nr_written.
>> I have no idea why, but torrent client uses syscall fadvise() which
>> messes whole picture.
>
> Hey! I can reproduce this:
>
> Node 0, zone    DMA32
>      nr_inactive_anon 1
>      nr_active_anon 2368
>      nr_inactive_file 373642
>      nr_active_file 375462
>      nr_dirtied   2887369
>      nr_written   2887291
>    inactive_ratio:    5
>    avg_age_inactive_anon: 64942528
>    avg_age_active_anon:   64942528
>    avg_age_inactive_file: 389824
>    avg_age_active_file:   1330368
> Node 0, zone   Normal
>      nr_inactive_anon 376
>      nr_active_anon 17768
>      nr_inactive_file 534695
>      nr_active_file 533685
>      nr_dirtied   12071397
>      nr_written   11940007
>    inactive_ratio:    6
>    avg_age_inactive_anon: 65064192
>    avg_age_active_anon:   65064192
>    avg_age_inactive_file: 28074
>    avg_age_active_file:   1304800
>
> I'm just copying huge files from one disk to another by rsync.
>
> In /proc/vmstat pgsteal_kswapd_normal and pgscan_kswapd_normal are
> rising rapidly,
> other pgscan_* pgsteal_* are standing still. So, bug is somewhere in the
> kswapd.
>

Not necessarily, because processes also do a direct reclaim. Also, if 
you continued the copying, I bet you would see that DMA32 zone also gets 
to play. Just a bit later.

I can now see that effect nicely on the graphs I prepared. Attached is 
one from the desktop. Where the red line suddenly drops, I copied 2GB 
file from the network to the machine. Half an hour later I copied 
another 1.6GB file. That's when the blue line dropped. Though, it all 
makes sense, about 3GB of I/O was needed to expunge all old inactive 
pages from both zones, the first 2GB wasn't enough to push old pages 
from the DMA32 zone.

I'm of the opinion that your instrumentation will be of use only when 
there's a constant reclaim goin' on. Otherwise pages stay in memory for 
a long long time, and then it doesn't matter much if it's one hour or 
two hours before some of them are reclaimed. For the same reason I will 
limit graphs like these to some useful value, so to get precision for 
the important time periods when the reclaim is really active.
-- 
Zlatko

[-- Attachment #2: memage-hourly.png --]
[-- Type: image/png, Size: 11890 bytes --]

next prev parent reply	other threads:[~2013-05-04 21:36 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-30 11:02 [PATCH RFC] mm: lru milestones, timestamps and ages Konstantin Khlebnikov
2013-05-03 14:07 ` Zlatko Calusic
2013-05-04 11:53   ` Konstantin Khlebnikov
2013-05-04 13:01     ` Konstantin Khlebnikov
2013-05-04 21:36       ` Zlatko Calusic [this message]
2013-05-06 19:08       ` Johannes Weiner
2013-05-04 13:32     ` Zlatko Calusic
2013-05-10 10:28 ` Mel Gorman
2013-05-10 14:12   ` Konstantin Khlebnikov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51857F60.8000209@bitsync.net \
    --to=zcalusic@bitsync.net \
    --cc=khlebnikov@openvz.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).