Re: inactive_dirty list - Andrew Morton

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@digeo.com>
To: Rik van Riel <riel@conectiva.com.br>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: inactive_dirty list
Date: Fri, 06 Sep 2002 22:28:44 -0700	[thread overview]
Message-ID: <3D798E8C.3DCD883C@digeo.com> (raw)
In-Reply-To: Pine.LNX.4.44L.0209062309580.1857-100000@imladris.surriel.com

Rik van Riel wrote:
> 
> On Fri, 6 Sep 2002, Andrew Morton wrote:
> 
> > I have a silly feeling that setting DEF_PRIORITY to "12" will
> > simply fix this.
> >
> > Duh.
> 
> Ideally we'd get rid of DEF_PRIORITY alltogether and would
> just scan each zone once.
> 

What I'm doing now is:

#define DEF_PRIORITY 12		/* puke */

        for (priority = DEF_PRIORITY; priority; priority--) {
		int total_scanned = 0;

		shrink_caches(priority, &total_scanned);
		if (that didn't work) {
			wakeup_bdflush(total_scanned);
			blk_congestion_wait(WRITE, HZ/4);
		}
	}

and in shrink_caches():

	max_scan = zone->nr_inactive >> priority;
        if (max_scan < nr_pages * 2)
                max_scan = nr_pages * 2;
        nr_pages = shrink_zone(zone, max_scan, gfp_mask, nr_pages);

So in effect, for a 32-page reclaim attempt we'll scan 64 pages
of ZONE_HIGHMEM, then 128 pages of ZONE_NORMAL/DMA.  If that doesn't
yield 32 pages we ask pdflush to write 3*64 pages.  Then take a nap.

Then do it again: scan 64 pages of ZONE_HIGHMEM, then 128 of ZONE_NORMAL/DMA,
then write back 192 pages then nap.

Then do it again: scan 128 pages of ZONE_HIGHMEM, then 256 of ZONE_NORMAL/DMA,
then write back 384 pages then nap.

etc.  Plus there are the actual pages which we started IO against
during the LRU scan - there can be up to 32 of those.

BTW, it turns out that the main reason why kswapd was going silly was
that the VM is *not* treating the `priority' as a logarithmic thing at
all:

        int max_scan = nr_inactive_pages / priority;

so the claims about scanning 1/64th of the list are crap.  That
thing scans 1/6th of the queue on the first pass.  In the mem=1G
case, that's 30,000 damn pages.   Maybe someone should take a look
at Marcelo's kernel?

There are a few warts: pdflush_operation will fail if all pdflush threads
are out doing something (pretty unlikely with the nonblocking stuff.
Might happen if writeback has to run get_block()).  But we'll be writing
back stuff anyway.

I changed blk_congestion_wait a bit too.  The first version would
return immediately if no queues were congested ( > 75% full). Now,
it will sleep even if no queues are congested.  It will return
as soon as someone puts back a write request.  If someone is silly
enough to call blk_congestion_wait() when there are no write requests
in flight at all, they get to take the full 1/4 second sleep.

The mem=1G corner case is fixed, and page reclaim just doesn't
figure:

c012c034 288      0.317709    do_wp_page              
c0144ae0 316      0.348597    __block_commit_write    
c012c910 342      0.377279    do_anonymous_page       
c0143efc 353      0.389414    __find_get_block        
c012f7e0 356      0.392724    find_lock_page          
c012f9f0 356      0.392724    do_generic_file_read    
c01832bc 367      0.404858    ext2_free_branches      
c0136e70 371      0.409271    __free_pages_ok         
c010e7b4 386      0.425818    timer_interrupt         
c01e3cfc 414      0.456707    radix_tree_lookup       
c0141894 434      0.47877     vfs_write               
c012f580 474      0.522896    unlock_page             
c0134348 500      0.551578    kmem_cache_alloc        
c01347d0 531      0.585776    kmem_cache_free         
c013712c 574      0.633212    rmqueue                 
c0141320 605      0.667409    generic_file_llseek     
c0156924 616      0.679544    count_list              
c0142c04 617      0.680647    fget                    
c01091e0 793      0.874803    system_call             
c0155914 860      0.948714    __d_lookup              
c0144674 1076     1.187       __block_prepare_write   
c014c63c 1184     1.30614     link_path_walk          
c012fcd4 10932    12.0597     file_read_actor         
c0130674 16443    18.1392     generic_file_write_nolock 
c0107048 31293    34.5211     poll_idle               

The balancing of the zones looks OK from a first glance and of course
the change in system behaviour under heavy writeout loads is profound.

Let's do the MAP_SHARED-pages-get-a-second-round thing, and it'd
be good if we could come up with some algorithm for setting the
current dirty pagecache clamping level rather than relying on the
dopey /proc/sys/vm/dirty_async_ratio magic number. 

I'm thinking that dirty_async_ratio becomes a maximum ratio, and
that we dynamically lower it when large amounts of dirty pagecache
would be embarrassing.  Or maybe there's just no need for this.  Dunno.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

     prev parent reply	other threads:[~2002-09-07  5:14 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-09-06 20:42 inactive_dirty list Andrew Morton
2002-09-06 21:03 ` Rik van Riel
2002-09-06 21:40   ` Andrew Morton
2002-09-06 21:49     ` Rik van Riel
2002-09-06 21:58       ` Andrew Morton
2002-09-06 22:04         ` Rik van Riel
2002-09-06 22:19           ` Andrew Morton
2002-09-06 22:23             ` Rik van Riel
2002-09-06 22:48               ` Andrew Morton
2002-09-06 23:03                 ` Rik van Riel
2002-09-06 23:34                   ` Andrew Morton
2002-09-07  0:00                     ` Rik van Riel
2002-09-07  0:29                       ` Andrew Morton
2002-09-08 21:21                     ` Daniel Phillips
2002-09-06 22:22           ` Rik van Riel
2002-09-07  2:14 ` Andrew Morton
2002-09-07  2:10   ` Rik van Riel
2002-09-07  5:28     ` Andrew Morton [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3D798E8C.3DCD883C@digeo.com \
    --to=akpm@digeo.com \
    --cc=linux-mm@kvack.org \
    --cc=riel@conectiva.com.br \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.