inactive_dirty list - Andrew Morton

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@digeo.com>
To: Rik van Riel <riel@conectiva.com.br>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: inactive_dirty list
Date: Fri, 06 Sep 2002 13:42:06 -0700	[thread overview]
Message-ID: <3D79131E.837F08B3@digeo.com> (raw)

Rik, it seems that the time has come...

I was doing some testing overnight with mem=1024m.  Page reclaim
was pretty inefficient at that level: kswapd consumed 6% of CPU
on a permanent basis (workload was heavy dbench plus looping
make -j6 bzImage).  kswapd was reclaiming only 3% of the pages
which it was looking at.

This doesn't happen at mem=768m, and I'm sure it won't happen at
mem=1.5G.

What is happening here is that the logic which clamps dirty+writeback
pagecache at 40% of memory is working nicely, and the allocate-from-
highmem-first logic is ensuring that all of ZONE_HIGHMEM is dirty
or under writeback all the time.  kswapd isn't allowed to block
against that pagecache, so it's scanning zillions of pages.

This is a fundamental problem when the size of the highmem zone is
approximately equal to 40% of total memory.

We could fix it by changing the page allocator to balance its
allocations across zones, but I don't think we want to do that.

I think it's best to split the inactive list into reclaimable
and unreclaimable.  (inactive_clean/inactive_dirty).

I'll code that tonight; please let me run some thoughts by you:

- inactive_dirty holds pages which are dirty or under writeback.

- end_page_writeback() will move the page onto inactive_clean.

- everywhere where we add a page to the inactive list will now
  add it to either inactive_clean or inactive_dirty, based on
  its PageDirty || PageWriteback state.

- the inactive target logic will remain the same.  So
  zone->nr_inactive_pages will be the sum of the pages on
  zone->inactive_clean and zone->inactive_dirty.

- swapcache pages don't go on inactive_dirty(!).  They remain on
  inactive_clean, so if a page allocator or kswapd hits a swapcache
  page, they block on it (swapout throttling).

  A result of this is that we never need to scan inactive_dirty.
  Those pages will always be written out in balance_dirty_pages
  by the write(2) caller, or by pdflush.

  (Hence: we don't need inactive_dirty at all.  We could just cut
  those pages off the LRU altogether.  But let's not do that).

- Hence: the only pages which are written out from within the VM
  are swapcache.

- So the only real source of throttling for tasks which aren't
  running generic_file_write() is the call to blk_congestion_wait()
  in try_to_free_pages().  Which seems sane to me - this will wake
  up after 1/4 of a second, or after someone frees a write request
  against *any* queue.  We know that the pages which were covered
  by that request were just placed onto inactive_clean, so off
  we go again.  Should work (heh).

- with this scheme, we don't actually need zone->nr_inactive_dirty_pages
  and zone->nr_inactive_clean_pages, but I may as well do that - it's
  easy enough.

- MAP_SHARED pages will be on inactive_clean, but if we change the
  logic in there to give these pages a second round on the LRU then
  the apges will automatically be added to inactive_dirty on the
  way out of shrink_zone().

How does that all sound?

btw, it is approximately the case that the pages will come clean
in LRU order (oldest-first) because of the writeback logic.  fs-writeback.c
walks the inodes in oldest-dirtied to newest-dirtied order, and
it walks the inode pages in oldest-dirtied to newest-dirtied
order.   But I think that end_page_writeback() should still move
cleaned pages onto the far (hot) end of inactive_clean?

I think all of this will not result in the zone balancing logic
going into a tailspin.  I'm just a bit worried about corner cases
when the number of reclaimable pages in highmem is getting low - the
classzone balancing code may keep on trying to refill that zone's free
memory pools too much.   We'll see...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

next             reply	other threads:[~2002-09-06 20:42 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-09-06 20:42 Andrew Morton [this message]
2002-09-06 21:03 ` inactive_dirty list Rik van Riel
2002-09-06 21:40   ` Andrew Morton
2002-09-06 21:49     ` Rik van Riel
2002-09-06 21:58       ` Andrew Morton
2002-09-06 22:04         ` Rik van Riel
2002-09-06 22:19           ` Andrew Morton
2002-09-06 22:23             ` Rik van Riel
2002-09-06 22:48               ` Andrew Morton
2002-09-06 23:03                 ` Rik van Riel
2002-09-06 23:34                   ` Andrew Morton
2002-09-07  0:00                     ` Rik van Riel
2002-09-07  0:29                       ` Andrew Morton
2002-09-08 21:21                     ` Daniel Phillips
2002-09-06 22:22           ` Rik van Riel
2002-09-07  2:14 ` Andrew Morton
2002-09-07  2:10   ` Rik van Riel
2002-09-07  5:28     ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3D79131E.837F08B3@digeo.com \
    --to=akpm@digeo.com \
    --cc=linux-mm@kvack.org \
    --cc=riel@conectiva.com.br \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.