All of lore.kernel.org
 help / color / mirror / Atom feed
From: kemi <kemi.wang@intel.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Mel Gorman <mgorman@techsingularity.net>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Christopher Lameter <cl@linux.com>,
	YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Nikolay Borisov <nborisov@suse.com>,
	Pavel Tatashin <pasha.tatashin@oracle.com>,
	David Rientjes <rientjes@google.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Dave <dave.hansen@linux.intel.com>,
	Andi Kleen <andi.kleen@intel.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Ying Huang <ying.huang@intel.com>, Aaron Lu <aaron.lu@intel.com>,
	Aubrey Li <aubrey.li@intel.com>, Linux MM <linux-mm@kvack.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
Date: Fri, 8 Dec 2017 16:38:46 +0800	[thread overview]
Message-ID: <9cd6cc9f-252a-3c6f-2f1f-e39d4ec0457b@intel.com> (raw)
In-Reply-To: <20171130094523.vvcljyfqjpbloe5e@dhcp22.suse.cz>



On 2017a1'11ae??30ae?JPY 17:45, Michal Hocko wrote:
> On Thu 30-11-17 17:32:08, kemi wrote:

> Do not get me wrong. If we want to make per-node stats more optimal,
> then by all means let's do that. But having 3 sets of counters is just
> way to much.
> 

Hi, Michal
  Apologize to respond later in this email thread.

After thinking about how to optimize our per-node stats more gracefully, 
we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
we can keep everything in per cpu counter and sum them up when read /proc
or /sys for numa stats. 
What's your idea for that? thanks

The motivation for that modification is listed below:
1) thanks to 0-day system, a bug is reported for the V1 patch:

[    0.000000] BUG: unable to handle kernel paging request at 0392b000
[    0.000000] IP: __inc_numa_state+0x2a/0x34
[    0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff53 
[    0.000000] Oops: 0002 [#1] PREEMPT SMP
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.0-12996-g81611e2 #1
[    0.000000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[    0.000000] task: cbf56000 task.stack: cbf4e000
[    0.000000] EIP: __inc_numa_state+0x2a/0x34
[    0.000000] EFLAGS: 00210006 CPU: 0
[    0.000000] EAX: 0392b000 EBX: 00000000 ECX: 00000000 EDX: cbef90ef
[    0.000000] ESI: cffdb320 EDI: 00000004 EBP: cbf4fd80 ESP: cbf4fd7c
[    0.000000]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[    0.000000] CR0: 80050033 CR2: 0392b000 CR3: 0c0a8000 CR4: 000406b0
[    0.000000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[    0.000000] DR6: fffe0ff0 DR7: 00000400
[    0.000000] Call Trace:
[    0.000000]  zone_statistics+0x4d/0x5b
[    0.000000]  get_page_from_freelist+0x257/0x993
[    0.000000]  __alloc_pages_nodemask+0x108/0x8c8
[    0.000000]  ? __bitmap_weight+0x38/0x41
[    0.000000]  ? pcpu_next_md_free_region+0xe/0xab
[    0.000000]  ? pcpu_chunk_refresh_hint+0x8b/0xbc
[    0.000000]  ? pcpu_chunk_slot+0x1e/0x24
[    0.000000]  ? pcpu_chunk_relocate+0x15/0x6d
[    0.000000]  ? find_next_bit+0xa/0xd
[    0.000000]  ? cpumask_next+0x15/0x18
[    0.000000]  ? pcpu_alloc+0x399/0x538
[    0.000000]  cache_grow_begin+0x85/0x31c
[    0.000000]  ____cache_alloc+0x147/0x1e0
[    0.000000]  ? debug_smp_processor_id+0x12/0x14
[    0.000000]  kmem_cache_alloc+0x80/0x145
[    0.000000]  create_kmalloc_cache+0x22/0x64
[    0.000000]  kmem_cache_init+0xf9/0x16c
[    0.000000]  start_kernel+0x1d4/0x3d6
[    0.000000]  i386_start_kernel+0x9a/0x9e
[    0.000000]  startup_32_smp+0x15f/0x170

That is because u64 percpu pointer vm_numa_stat is used before initialization.

[...]
> +extern u64 __percpu *vm_numa_stat;
[...]
> +#ifdef CONFIG_NUMA
> +	size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
> +	align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
> +	vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
> +#endif

The pointer is used in mm_init->kmem_cache_init->create_kmalloc_cache->...->
__alloc_pages() when CONFIG_SLAB/CONFIG_ZONE_DMA is set in kconfig, while the
vm_numa_stat is initialized in setup_per_cpu_pageset after mm_init is called.
The proposal mentioned above can fix it by making the numa stats counter ready
before calling mm_init (start_kernel->build_all_zonelists() can help to do that)

2) Compare to the V1 patch, this modification makes the semantics of per-node numa
stats more clear for review and maintenance. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: kemi <kemi.wang@intel.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Mel Gorman <mgorman@techsingularity.net>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Christopher Lameter <cl@linux.com>,
	YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Nikolay Borisov <nborisov@suse.com>,
	Pavel Tatashin <pasha.tatashin@oracle.com>,
	David Rientjes <rientjes@google.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Dave <dave.hansen@linux.intel.com>,
	Andi Kleen <andi.kleen@intel.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Ying Huang <ying.huang@intel.com>, Aaron Lu <aaron.lu@intel.com>,
	Aubrey Li <aubrey.li@intel.com>, Linux MM <linux-mm@kvack.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement
Date: Fri, 8 Dec 2017 16:38:46 +0800	[thread overview]
Message-ID: <9cd6cc9f-252a-3c6f-2f1f-e39d4ec0457b@intel.com> (raw)
In-Reply-To: <20171130094523.vvcljyfqjpbloe5e@dhcp22.suse.cz>



On 2017年11月30日 17:45, Michal Hocko wrote:
> On Thu 30-11-17 17:32:08, kemi wrote:

> Do not get me wrong. If we want to make per-node stats more optimal,
> then by all means let's do that. But having 3 sets of counters is just
> way to much.
> 

Hi, Michal
  Apologize to respond later in this email thread.

After thinking about how to optimize our per-node stats more gracefully, 
we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
we can keep everything in per cpu counter and sum them up when read /proc
or /sys for numa stats. 
What's your idea for that? thanks

The motivation for that modification is listed below:
1) thanks to 0-day system, a bug is reported for the V1 patch:

[    0.000000] BUG: unable to handle kernel paging request at 0392b000
[    0.000000] IP: __inc_numa_state+0x2a/0x34
[    0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff53 
[    0.000000] Oops: 0002 [#1] PREEMPT SMP
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.0-12996-g81611e2 #1
[    0.000000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[    0.000000] task: cbf56000 task.stack: cbf4e000
[    0.000000] EIP: __inc_numa_state+0x2a/0x34
[    0.000000] EFLAGS: 00210006 CPU: 0
[    0.000000] EAX: 0392b000 EBX: 00000000 ECX: 00000000 EDX: cbef90ef
[    0.000000] ESI: cffdb320 EDI: 00000004 EBP: cbf4fd80 ESP: cbf4fd7c
[    0.000000]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[    0.000000] CR0: 80050033 CR2: 0392b000 CR3: 0c0a8000 CR4: 000406b0
[    0.000000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[    0.000000] DR6: fffe0ff0 DR7: 00000400
[    0.000000] Call Trace:
[    0.000000]  zone_statistics+0x4d/0x5b
[    0.000000]  get_page_from_freelist+0x257/0x993
[    0.000000]  __alloc_pages_nodemask+0x108/0x8c8
[    0.000000]  ? __bitmap_weight+0x38/0x41
[    0.000000]  ? pcpu_next_md_free_region+0xe/0xab
[    0.000000]  ? pcpu_chunk_refresh_hint+0x8b/0xbc
[    0.000000]  ? pcpu_chunk_slot+0x1e/0x24
[    0.000000]  ? pcpu_chunk_relocate+0x15/0x6d
[    0.000000]  ? find_next_bit+0xa/0xd
[    0.000000]  ? cpumask_next+0x15/0x18
[    0.000000]  ? pcpu_alloc+0x399/0x538
[    0.000000]  cache_grow_begin+0x85/0x31c
[    0.000000]  ____cache_alloc+0x147/0x1e0
[    0.000000]  ? debug_smp_processor_id+0x12/0x14
[    0.000000]  kmem_cache_alloc+0x80/0x145
[    0.000000]  create_kmalloc_cache+0x22/0x64
[    0.000000]  kmem_cache_init+0xf9/0x16c
[    0.000000]  start_kernel+0x1d4/0x3d6
[    0.000000]  i386_start_kernel+0x9a/0x9e
[    0.000000]  startup_32_smp+0x15f/0x170

That is because u64 percpu pointer vm_numa_stat is used before initialization.

[...]
> +extern u64 __percpu *vm_numa_stat;
[...]
> +#ifdef CONFIG_NUMA
> +	size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
> +	align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
> +	vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
> +#endif

The pointer is used in mm_init->kmem_cache_init->create_kmalloc_cache->...->
__alloc_pages() when CONFIG_SLAB/CONFIG_ZONE_DMA is set in kconfig, while the
vm_numa_stat is initialized in setup_per_cpu_pageset after mm_init is called.
The proposal mentioned above can fix it by making the numa stats counter ready
before calling mm_init (start_kernel->build_all_zonelists() can help to do that)

2) Compare to the V1 patch, this modification makes the semantics of per-node numa
stats more clear for review and maintenance. 

  parent reply	other threads:[~2017-12-08  8:40 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-28  6:00 [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Kemi Wang
2017-11-28  6:00 ` Kemi Wang
2017-11-28  6:00 ` [PATCH 2/2] mm: Rename zone_statistics() to numa_statistics() Kemi Wang
2017-11-28  6:00   ` Kemi Wang
2017-11-28  8:09 ` [PATCH 1/2] mm: NUMA stats code cleanup and enhancement Vlastimil Babka
2017-11-28  8:09   ` Vlastimil Babka
2017-11-28  8:33   ` kemi
2017-11-28  8:33     ` kemi
2017-11-28 18:40   ` Andi Kleen
2017-11-28 18:40     ` Andi Kleen
2017-11-28 21:56     ` Andrew Morton
2017-11-28 21:56       ` Andrew Morton
2017-11-28 22:52     ` Vlastimil Babka
2017-11-28 22:52       ` Vlastimil Babka
2017-11-29 12:17 ` Michal Hocko
2017-11-29 12:17   ` Michal Hocko
2017-11-30  5:56   ` kemi
2017-11-30  5:56     ` kemi
2017-11-30  8:53     ` Michal Hocko
2017-11-30  8:53       ` Michal Hocko
2017-11-30  9:32       ` kemi
2017-11-30  9:32         ` kemi
2017-11-30  9:45         ` Michal Hocko
2017-11-30  9:45           ` Michal Hocko
2017-11-30 11:06           ` Wang, Kemi
2017-11-30 11:06             ` Wang, Kemi
2017-12-08  8:38           ` kemi [this message]
2017-12-08  8:38             ` kemi
2017-12-08  8:47             ` Michal Hocko
2017-12-08  8:47               ` Michal Hocko
2017-12-12  2:05               ` kemi
2017-12-12  2:05                 ` kemi
2017-12-12  8:11                 ` Michal Hocko
2017-12-12  8:11                   ` Michal Hocko
2017-12-14  1:40                   ` kemi
2017-12-14  1:40                     ` kemi
2017-12-14  7:29                     ` Michal Hocko
2017-12-14  7:29                       ` Michal Hocko
2017-12-14  8:55                       ` kemi
2017-12-14  8:55                         ` kemi
2017-12-14  9:23                         ` Michal Hocko
2017-12-14  9:23                           ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9cd6cc9f-252a-3c6f-2f1f-e39d4ec0457b@intel.com \
    --to=kemi.wang@intel.com \
    --cc=aaron.lu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi.kleen@intel.com \
    --cc=aryabinin@virtuozzo.com \
    --cc=aubrey.li@intel.com \
    --cc=bigeasy@linutronix.de \
    --cc=brouer@redhat.com \
    --cc=cl@linux.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=nborisov@suse.com \
    --cc=pasha.tatashin@oracle.com \
    --cc=rientjes@google.com \
    --cc=tim.c.chen@intel.com \
    --cc=vbabka@suse.cz \
    --cc=yasu.isimatu@gmail.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.