From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753462AbbCXHIL (ORCPT <rfc822;w@1wt.eu>);
	Tue, 24 Mar 2015 03:08:11 -0400
Received: from mail-wi0-f172.google.com ([209.85.212.172]:37735 "EHLO
	mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753303AbbCXHII (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 24 Mar 2015 03:08:08 -0400
Date: Tue, 24 Mar 2015 08:08:03 +0100
From: Ingo Molnar <mingo@kernel.org>
To: Joonsoo Kim <js1304@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>, Jiri Olsa <jolsa@redhat.com>,
        LKML <linux-kernel@vger.kernel.org>, David Ahern <dsahern@gmail.com>,
        Minchan Kim <minchan@kernel.org>
Subject: Re: [PATCH 2/5] perf kmem: Analyze page allocator events also
Message-ID: <20150324070802.GB28190@gmail.com>
References: <1427092244-22764-1-git-send-email-namhyung@kernel.org>
 <1427092244-22764-3-git-send-email-namhyung@kernel.org>
 <CAAmzW4MJmAfLrQGnhaeXdN3eCBsA8zaxy3q1O_dUkpnTXc+9YQ@mail.gmail.com>
 <20150324001828.GJ2782@sejong>
 <CAAmzW4MkS5M+X+e9w=zv873jjfMxqQxtpNTz3EAkxa+f5WnK4A@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAAmzW4MkS5M+X+e9w=zv873jjfMxqQxtpNTz3EAkxa+f5WnK4A@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


* Joonsoo Kim <js1304@gmail.com> wrote:

> 2015-03-24 9:18 GMT+09:00 Namhyung Kim <namhyung@kernel.org>:
> > On Tue, Mar 24, 2015 at 02:32:17AM +0900, Joonsoo Kim wrote:
> >> 2015-03-23 15:30 GMT+09:00 Namhyung Kim <namhyung@kernel.org>:
> >> > The perf kmem command records and analyze kernel memory allocation
> >> > only for SLAB objects.  This patch implement a simple page allocator
> >> > analyzer using kmem:mm_page_alloc and kmem:mm_page_free events.
> >> >
> >> > It adds two new options of --slab and --page.  The --slab option is
> >> > for analyzing SLAB allocator and that's what perf kmem currently does.
> >> >
> >> > The new --page option enables page allocator events and analyze kernel
> >> > memory usage in page unit.  Currently, 'stat --alloc' subcommand is
> >> > implemented only.
> >> >
> >> > If none of these --slab nor --page is specified, --slab is implied.
> >> >
> >> >   # perf kmem stat --page --alloc --line 10
> >> >
> >> >   -------------------------------------------------------------------------------------
> >> >    Page             | Total alloc (KB) | Hits     | Order | Migration type | GFP flags
> >> >   -------------------------------------------------------------------------------------
> >> >    ffffea0015e48e00 |               16 |        1 |     2 |    RECLAIMABLE |  00285250
> >> >    ffffea0015e47400 |               16 |        1 |     2 |    RECLAIMABLE |  00285250
> >> >    ffffea001440f600 |               16 |        1 |     2 |    RECLAIMABLE |  00285250
> >> >    ffffea001440cc00 |               16 |        1 |     2 |    RECLAIMABLE |  00285250
> >> >    ffffea00140c6300 |               16 |        1 |     2 |    RECLAIMABLE |  00285250
> >> >    ffffea00140c5c00 |               16 |        1 |     2 |    RECLAIMABLE |  00285250
> >> >    ffffea00140c5000 |               16 |        1 |     2 |    RECLAIMABLE |  00285250
> >> >    ffffea00140c4f00 |               16 |        1 |     2 |    RECLAIMABLE |  00285250
> >> >    ffffea00140c4e00 |               16 |        1 |     2 |    RECLAIMABLE |  00285250
> >> >    ffffea00140c4d00 |               16 |        1 |     2 |    RECLAIMABLE |  00285250
> >> >    ...              | ...              | ...      | ...   | ...            | ...
> >> >   -------------------------------------------------------------------------------------
> >>
> >> Tracepoint on mm_page_alloc print out pfn as well as pointer of struct page.
> >> How about printing pfn rather than pointer of struct page?
> >
> > I'd really like to have pfn rather than struct page.  But I don't know
> > how to convert page pointer to pfn in userspace.
> >
> > The output of tracepoint via $debugfs/tracing/trace file is generated
> > from kernel-side, so it can easily have pfn from page pointer.  But
> > tracepoint itself only saves page pointer and we need to convert/print
> > it in userspace.
> 
> Ah...I didn't realize that perf don't use output of $debugfs/tracing/trace
> file. So, perf just uses raw trace buffer directly? If pfn is saved to
> the trace buffer, perf can print pfn rather than pointer of struct page?
> 
> > Yes, perf script (or libtraceevent) shows pfn when printing those
> > events.  But that's bogus since it cannot determine the size of the
> > struct page so the pointer arithmetic in open-coded page_to_pfn()
> > which is saved in the print_fmt of the tracepoint will end up with an
> > normal integer arithmatic.
> 
> How about following change and making 'perf kmem' print pfn?
> If we store pfn on the trace buffer, we can print $debugfs/tracing/trace
> as is and 'perf kmem' can also print pfn.
> 
> Thanks.
> 
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index 4ad10ba..9dcfd0b 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -199,22 +199,22 @@ TRACE_EVENT(mm_page_alloc,
>         TP_ARGS(page, order, gfp_flags, migratetype),
> 
>         TP_STRUCT__entry(
> -               __field(        struct page *,  page            )
> +               __field(        unsigned long,  pfn             )
>                 __field(        unsigned int,   order           )
>                 __field(        gfp_t,          gfp_flags       )
>                 __field(        int,            migratetype     )
>         ),
> 
>         TP_fast_assign(
> -               __entry->page           = page;
> +               __entry->pfn            = page ? page_to_pfn(page) : -1;
>                 __entry->order          = order;
>                 __entry->gfp_flags      = gfp_flags;
>                 __entry->migratetype    = migratetype;
>         ),
> 
>         TP_printk("page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
> -               __entry->page,
> -               __entry->page ? page_to_pfn(__entry->page) : 0,
> +               __entry->pfn != -1 ? pfn_to_page(__entry->pfn) : NULL,
> +               __entry->pfn != -1 ? __entry->pfn : 0,
>                 __entry->order,
>                 __entry->migratetype,
>                 show_gfp_flags(__entry->gfp_flags))

Acked-by: Ingo Molnar <mingo@kernel.org>

It would be very nice to make all the other page granular tracepoints 
output pfn (which is a physical address that can be resolved to 'node' 
and other properties), not 'struct page *' (which is a kernel resource 
with little meaning to user-space tooling).

I.e. the following tracepoints:

triton:~/tip> git grep -E '__field.*struct page *' include/trace/
include/trace/events/filemap.h:         __field(struct page *, page)
include/trace/events/kmem.h:            __field(        struct page *,  page            )
include/trace/events/kmem.h:            __field(        struct page *,  page            )
include/trace/events/kmem.h:            __field(        struct page *,  page            )
include/trace/events/kmem.h:            __field(        struct page *,  page            )
include/trace/events/kmem.h:            __field(        struct page *,  page                    )
include/trace/events/pagemap.h:         __field(struct page *,  page    )
include/trace/events/pagemap.h:         __field(struct page *,  page    )
include/trace/events/vmscan.h:          __field(struct page *, page)

there's very little breakage I can imagine: they have traced pointers 
to 'struct page', which is a pretty opaque page identifier to 
user-space, and they'll trace pfn's in the future, which still serves 
as a page identifier.

One thing would be important: to do all these changes at once, to make 
sure that the various page identifiers can be compared.

Also, we might keep the 'page' field name if anything relies on that - 
but 'pfn' is even better.

Thanks,

	Ingo