All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qian Cai <cai@lca.pw>
To: Waiman Long <longman@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>, Roman Gushchin <guro@fb.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
	Jann Horn <jannh@google.com>, Song Liu <songliubraving@fb.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	 Rafael Aquini <aquini@redhat.com>
Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo
Date: Wed, 23 Oct 2019 10:48:13 -0400	[thread overview]
Message-ID: <1571842093.5937.84.camel@lca.pw> (raw)
In-Reply-To: <2236495a-ead0-e08e-3fb6-f3ab906b75b6@redhat.com>

On Wed, 2019-10-23 at 10:30 -0400, Waiman Long wrote:
> On 10/22/19 5:59 PM, Andrew Morton wrote:
> > On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long <longman@redhat.com> wrote:
> > 
> > > The pagetypeinfo_showfree_print() function prints out the number of
> > > free blocks for each of the page orders and migrate types. The current
> > > code just iterates the each of the free lists to get counts.  There are
> > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > file just because it look too long to iterate all the free lists within
> > > a zone while holing the zone lock with irq disabled.
> > > 
> > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > by any user is a security problem that needs to be addressed.
> > 
> > Yes.
> > 
> > > There is a free_area structure associated with each page order. There
> > > is also a nr_free count within the free_area for all the different
> > > migration types combined. Tracking the number of free list entries
> > > for each migration type will probably add some overhead to the fast
> > > paths like moving pages from one migration type to another which may
> > > not be desirable.
> > > 
> > > we can actually skip iterating the list of one of the migration types
> > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > is usually the largest one on large memory systems, this is the one
> > > to be skipped. Since the printing order is migration-type => order, we
> > > will have to store the counts in an internal 2D array before printing
> > > them out.
> > > 
> > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > zone lock for too long blocking out other zone lock waiters from being
> > > run. This can be problematic for systems with large amount of memory.
> > > So a check is added to temporarily release the lock and reschedule if
> > > more than 64k of list entries have been iterated for each order. With
> > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > entries before releasing the lock.
> > > 
> > > ...
> > > 
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > >  					pg_data_t *pgdat, struct zone *zone)
> > >  {
> > >  	int order, mtype;
> > > +	unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> > 
> > 600+ bytes is a bit much.  I guess it's OK in this situation.
> > 
> 
> This function is called by reading /proc/pagetypeinfo. The call stack is
> rather shallow:
> 
> PID: 58188  TASK: ffff938a4d4f1fa0  CPU: 2   COMMAND: "sosreport"
>  #0 [ffff9483bf488e48] crash_nmi_callback at ffffffffb8c551d7
>  #1 [ffff9483bf488e58] nmi_handle at ffffffffb931d8cc
>  #2 [ffff9483bf488eb0] do_nmi at ffffffffb931dba8
>  #3 [ffff9483bf488ef0] end_repeat_nmi at ffffffffb931cd69
>     [exception RIP: pagetypeinfo_showfree_print+0x73]
>     RIP: ffffffffb8db7173  RSP: ffff938b9fcbfda0  RFLAGS: 00000006
>     RAX: fffff0c9946d7020  RBX: ffff96073ffd5528  RCX: 0000000000000000
>     RDX: 00000000001c7764  RSI: ffffffffb9676ab1  RDI: 0000000000000000
>     RBP: ffff938b9fcbfdd0   R8: 000000000000000a   R9: 00000000fffffffe
>     R10: 0000000000000000  R11: ffff938b9fcbfc36  R12: ffff942b97758240
>     R13: ffffffffb942f730  R14: ffff96073ffd5000  R15: ffff96073ffd5180
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #4 [ffff938b9fcbfda0] pagetypeinfo_showfree_print at ffffffffb8db7173
>  #5 [ffff938b9fcbfdd8] walk_zones_in_node at ffffffffb8db74df
>  #6 [ffff938b9fcbfe20] pagetypeinfo_show at ffffffffb8db7a29
>  #7 [ffff938b9fcbfe48] seq_read at ffffffffb8e45c3c
>  #8 [ffff938b9fcbfeb8] proc_reg_read at ffffffffb8e95070
>  #9 [ffff938b9fcbfed8] vfs_read at ffffffffb8e1f2af
> #10 [ffff938b9fcbff08] sys_read at ffffffffb8e2017f
> #11 [ffff938b9fcbff50] system_call_fastpath at ffffffffb932579b
> 
> So we should not be in any risk of overflowing the stack.
> 
> > > -	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > -		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > -					pgdat->node_id,
> > > -					zone->name,
> > > -					migratetype_names[mtype]);
> > > -		for (order = 0; order < MAX_ORDER; ++order) {
> > > +	lockdep_assert_held(&zone->lock);
> > > +	lockdep_assert_irqs_disabled();
> > > +
> > > +	/*
> > > +	 * MIGRATE_MOVABLE is usually the largest one in large memory
> > > +	 * systems. We skip iterating that list. Instead, we compute it by
> > > +	 * subtracting the total of the rests from free_area->nr_free.
> > > +	 */
> > > +	for (order = 0; order < MAX_ORDER; ++order) {
> > > +		unsigned long nr_total = 0;
> > > +		struct free_area *area = &(zone->free_area[order]);
> > > +
> > > +		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > >  			unsigned long freecount = 0;
> > > -			struct free_area *area;
> > >  			struct list_head *curr;
> > >  
> > > -			area = &(zone->free_area[order]);
> > > -
> > > +			if (mtype == MIGRATE_MOVABLE)
> > > +				continue;
> > >  			list_for_each(curr, &area->free_list[mtype])
> > >  				freecount++;
> > > -			seq_printf(m, "%6lu ", freecount);
> > > +			nfree[order][mtype] = freecount;
> > > +			nr_total += freecount;
> > >  		}
> > > +		nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> > > +
> > > +		/*
> > > +		 * If we have already iterated more than 64k of list
> > > +		 * entries, we might have hold the zone lock for too long.
> > > +		 * Temporarily release the lock and reschedule before
> > > +		 * continuing so that other lock waiters have a chance
> > > +		 * to run.
> > > +		 */
> > > +		if (nr_total > (1 << 16)) {
> > > +			spin_unlock_irq(&zone->lock);
> > > +			cond_resched();
> > > +			spin_lock_irq(&zone->lock);
> > > +		}
> > > +	}
> > > +
> > > +	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > +		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > +					pgdat->node_id,
> > > +					zone->name,
> > > +					migratetype_names[mtype]);
> > > +		for (order = 0; order < MAX_ORDER; ++order)
> > > +			seq_printf(m, "%6lu ", nfree[order][mtype]);
> > >  		seq_putc(m, '\n');
> > 
> > This is not exactly a thing of beauty :( Presumably there might still
> > be situations where the irq-off times remain excessive.
> 
> Yes, that is still possible.
> > 
> > Why are we actually holding zone->lock so much?  Can we get away with
> > holding it across the list_for_each() loop and nothing else?  If so,
> 
> We can certainly do that with the risk that the counts will be less
> reliable for a given order. I can send a v2 patch if you think this is
> safer.
> 
> 
> > this still isn't a bulletproof fix.  Maybe just terminate the list
> > walk if freecount reaches 1024.  Would anyone really care?
> > 
> > Sigh.  I wonder if anyone really uses this thing for anything
> > important.  Can we just remove it all?
> > 
> 
> Removing it will be a breakage of kernel API.

Who cares about breaking this part of the API that essentially nobody will use
this file?

> 
> Another alternative is to mark the migration type in the page structure
> so that we can do per-migration type nr_free tracking. That will be a
> major change to the mm code.
> 
> I consider this patch lesser of the two evils. 
> 
> Cheers,
> Longman
> 
> 


  reply	other threads:[~2019-10-23 14:48 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-22 16:21 [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo Waiman Long
2019-10-22 16:57 ` Michal Hocko
2019-10-22 18:00   ` Waiman Long
2019-10-22 18:40     ` Waiman Long
2019-10-23  0:52       ` David Rientjes
2019-10-23  8:31   ` Mel Gorman
2019-10-23  9:04     ` Michal Hocko
2019-10-23  9:56       ` Mel Gorman
2019-10-23 10:27         ` [RFC PATCH 0/2] " Michal Hocko
2019-10-23 10:27           ` [RFC PATCH 1/2] mm, vmstat: hide /proc/pagetypeinfo from normal users Michal Hocko
2019-10-23 13:13             ` Mel Gorman
2019-10-23 13:27             ` Vlastimil Babka
2019-10-23 14:52             ` Waiman Long
2019-10-23 15:10             ` Rafael Aquini
2019-10-23 16:15             ` Vlastimil Babka
2019-10-24 19:01             ` David Rientjes
2019-10-23 10:27           ` [RFC PATCH 2/2] mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo Michal Hocko
2019-10-23 13:15             ` Mel Gorman
2019-10-23 13:32             ` Vlastimil Babka
2019-10-23 13:37               ` Michal Hocko
2019-10-23 13:48                 ` Vlastimil Babka
2019-10-23 14:31                   ` Michal Hocko
2019-10-23 16:20                     ` Vlastimil Babka
2019-10-23 13:46               ` Rafael Aquini
2019-10-23 14:56             ` Waiman Long
2019-10-23 15:21               ` Waiman Long
2019-10-23 16:10               ` Michal Hocko
2019-10-23 16:17                 ` Waiman Long
2019-10-23 16:21                   ` Waiman Long
2019-10-23 16:15             ` Vlastimil Babka
2019-10-23 16:41             ` Michal Hocko
2019-10-23 16:47               ` Waiman Long
2019-10-23 17:34             ` [PATCH 1/2] mm, vmstat: Release zone lock more frequently when reading /proc/pagetypeinfo Waiman Long
2019-10-23 18:01               ` Michal Hocko
2019-10-23 18:14                 ` Waiman Long
2019-10-23 20:02                   ` Michal Hocko
2019-10-23 17:34             ` [PATCH 2/2] mm, vmstat: List total free blocks for each order in /proc/pagetypeinfo Waiman Long
2019-10-23 18:02               ` Michal Hocko
2019-10-23 18:07                 ` Waiman Long
2019-10-24  8:20           ` [RFC PATCH 0/2] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo Michal Hocko
2019-10-24 16:16             ` Waiman Long
2019-10-23 12:42         ` [PATCH] " Qian Cai
2019-10-23 13:25         ` Vlastimil Babka
2019-10-22 21:59 ` Andrew Morton
2019-10-23  6:15   ` Michal Hocko
2019-10-23 14:30   ` Waiman Long
2019-10-23 14:48     ` Qian Cai [this message]
2019-10-23 15:01       ` Waiman Long
2019-10-23 15:05         ` Qian Cai
2019-10-23 22:30         ` Andrew Morton
2019-10-24  5:33           ` Qian Cai
2019-10-24  7:42             ` Michal Hocko
2019-10-24 11:11               ` Qian Cai
2019-10-24 13:38                 ` Michal Hocko
2019-10-24 14:55                   ` Qian Cai
2019-10-24  3:33         ` Feng Tang
2019-10-24  4:34           ` Qian Cai
2019-10-24  5:34             ` Feng Tang
2019-10-24 10:51               ` Qian Cai
2019-10-25  1:38                 ` Feng Tang
2019-10-23 15:03       ` Rafael Aquini
2019-10-23 15:51         ` Qian Cai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1571842093.5937.84.camel@lca.pw \
    --to=cai@lca.pw \
    --cc=akpm@linux-foundation.org \
    --cc=aquini@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=jannh@google.com \
    --cc=khlebnikov@yandex-team.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mhocko@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.