From: Xunlei Pang <xpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Dave Young <dyoung-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
Xunlei Pang <xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Harald Hoyer <harald-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
initramfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Pratyush Anand <panand-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module
Date: Thu, 3 Nov 2016 15:28:07 +0800 [thread overview]
Message-ID: <581AE707.9050606@redhat.com> (raw)
In-Reply-To: <581AB9D7.2010905-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
On 2016/11/03 at 12:15, Xunlei Pang wrote:
> On 2016/11/03 at 11:01, Dave Young wrote:
>> Hi Xunlei,
>> On 11/02/16 at 05:03pm, Xunlei Pang wrote:
>>> The current method for memory debug is to use "rd.memdebug=[0-3]",
>>> it is not enough for debugging kernel modules. For example, when we
>>> want to find out which kernel module consumes a large amount of memory,
>>> "rd.memdebug=[0-3]" won't help too much.
>>>
>>> A better way is needed to achieve this requirement, this is useful for
>>> kdump OOM debugging.
>>>
>>> The principle of this patch is to use kernel trace to track slab and
>>> buddy allocation calls during kernel module loading(module_init), thus
>>> we can analyze all the trace data and get the total memory consumption.
>>> As for large slab allocation, it will probably fall into buddy allocation,
>>> thus tracing "mm_page_alloc" alone should be enough for the purpose(this
>>> saves quite some trace buffer memory, also large free is quite unlikey
>>> during module loading, we neglect those memory free events).
>>>
>>> The trace events include memory calls under "tracing/events/":
>>> kmem/mm_page_alloc
>>>
>>> We also inpect the following events to detect the module loading:
>>> module/module_load
>>> module/module_put
>>>
>>> Since we use filters to trace events, the final trace data size won't
>>> be too big. Users can adjust the trace buffer size via "trace_buf_size"
>>> kernel boot command line as needed.
>>>
>>> We can get the module name and task pid from "module_load" event which
>>> also mark the beginning of the loading, and module_put called by the
>>> same task pid implies the end of the loading. So the memory events
>>> recorded in between by the same task pid are consumed by this module
>>> during loading(i.e. modprobe or module_init()).
>>>
>>> With these information, we can record the total memory(the larger, the
>>> more precise the result will be) consumption involved by each kernel
>>> module loading.
>>>
>>> Thus we introduce this dracut module to find out which kernel module
>>> consumes a large amount of memory during loading. It uses "rd.memdebug=4"
>>> as the tigger.
>>>
>>> After applying is patch and specifying "rd.memdebug=4", during booting
>>> it will print out something below:
>>> == debug_mem for kernel modules during loading begin ==
>>> 0 pages consumed by "pata_acpi"
>>> 0 pages consumed by "ata_generic"
>>> 1 pages consumed by "drm"
>>> 0 pages consumed by "ttm"
>>> 0 pages consumed by "drm_kms_helper"
>>> 835 pages consumed by "qxl"
>>> 0 pages consumed by "mii"
>>> 6 pages consumed by "8139cp"
>>> 0 pages consumed by "virtio"
>>> 0 pages consumed by "virtio_ring"
>>> 9 pages consumed by "virtio_pci"
>>> 1 pages consumed by "8139too"
>>> 0 pages consumed by "serio_raw"
>>> 0 pages consumed by "crc32c_intel"
>>> 199 pages consumed by "virtio_console"
>>> 0 pages consumed by "libcrc32c"
>>> 9 pages consumed by "xfs"
>>> == debug_mem for kernel modules during loading end ==
>>>
>>> From the print, we see clearly that "qxl" consumed lots of memory.
>>>
>>> Signed-off-by: Xunlei Pang <xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> ---
>>> dracut.spec | 1 +
>>> modules.d/99memdebug-ko/memdebug-ko-parse.sh | 93 ++++++++++++++++++++++++++
>>> modules.d/99memdebug-ko/memdebug-ko-prepare.sh | 60 +++++++++++++++++
>>> modules.d/99memdebug-ko/module-setup.sh | 25 +++++++
>>> 4 files changed, 179 insertions(+)
>>> create mode 100755 modules.d/99memdebug-ko/memdebug-ko-parse.sh
>>> create mode 100755 modules.d/99memdebug-ko/memdebug-ko-prepare.sh
>>> create mode 100755 modules.d/99memdebug-ko/module-setup.sh
>>>
>>> diff --git a/dracut.spec b/dracut.spec
>>> index 7b9d675..3907b17 100644
>>> --- a/dracut.spec
>>> +++ b/dracut.spec
>>> @@ -407,6 +407,7 @@ rm -rf -- $RPM_BUILD_ROOT
>>> %{dracutlibdir}/modules.d/98usrmount
>>> %{dracutlibdir}/modules.d/99base
>>> %{dracutlibdir}/modules.d/99fs-lib
>>> +%{dracutlibdir}/modules.d/99memdebug-ko
>>> %{dracutlibdir}/modules.d/99shutdown
>>> %attr(0644,root,root) %ghost %config(missingok,noreplace) %{_localstatedir}/log/dracut.log
>>> %dir %{_sharedstatedir}/initramfs
>>> diff --git a/modules.d/99memdebug-ko/memdebug-ko-parse.sh b/modules.d/99memdebug-ko/memdebug-ko-parse.sh
>>> new file mode 100755
>>> index 0000000..1542658
>>> --- /dev/null
>>> +++ b/modules.d/99memdebug-ko/memdebug-ko-parse.sh
>>> @@ -0,0 +1,93 @@
>>> +# Parse the trace data to find out the modules consuming large amount of memory.
>>> +
>>> +is_trace_data_available() {
>>> + # Only "rd.memdebug=4" is for the purpose.
>>> + if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then
>>> + return 1
>>> + fi
> Hi Dave,
>
> Firstly, thanks for your valuable comments.
>
>> The usage of rd.memdebug is different in other place, can we integrate
>> it to the original rd.memdbug code?
>>
>> One concern it in this patch only parse the trace data at pre-pivot
>> hook, so if oom happens earlier than that we will get nothing.
>>
>> Original rd.memdebug being added to several different point, if we
>> integrate this to the framework then we can get the out put early.
> Yes, they can be integrated together. My original thought is that to
> just output the memory information for all the modules at one stage
> where all modules are supposed to be loaded, as such can avoid printing
> redundant/partial information if at multiple stages(hooks).
>
> If OOM happens (mainly for kdump cases), we can enlarge the reserved
> crash memory and try again, because this feature itself might consume
> large trace buffer memory.
>
> But I can do in that way if you think it is fine.
>
>>> +
>>> + TRACE_BASE="/sys/kernel/debug"
>>> + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available.
>>> + if [[ -d "/sys/kernel/tracing" ]]; then
>>> + TRACE_BASE="/sys/kernel"
>>> + fi
>>> +
>>> + if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then
>>> + warn "Trace inactive, skipping kernel module memory analyzing."
>>> + return 1
>>> + fi
>> This function is only for checking if tracing is needed by checking
>> cmdline, so getting the TRACE_BASE logic can be moved to
>> parse_trace_data so that it can be a local variable..
> It is actually in a function, will change it to be a local definition.
>
>>> +
>>> + return 0
>>> +}
>>> +
>>> +parse_trace_data() {
>>> + # Indexed by task pid.
>>> + declare -A current_module
>>> + # Indexed by module name.
>>> + declare -A module_loaded
>>> + declare -A nr_alloc_pages
>>> +
>>> + # For large trace data, parsing "tracing/trace" directly turns out to be time-consuming,
>>> + # hence copy it out first, then parse the copy file instead.
>> Have you get any data about the time used via traceing/trace parsing, if
>> the time is still acceptable then it will be better to avoid the tmp
>> file.
> For large trace data(tested on rhel7, the filter doesn't work on rhel7, and will produce huge trace data),
> the time consumption is huge, I am afraid in minutes because I once suspected the script was in some
> dead loop when parsing "tracing/trace" directly. It is the same situation when turning off tracing_on and
> try again.
Although I don't know why, after I replaced the following scripts
1)
while read pid cpu flags ts function
do
... ...
done < "$TRACE_BASE/tracing/trace"
with
2)
cat "$TRACE_BASE/tracing/trace" | while read pid cpu flags ts function
do
... ...
done
2) became not time-consuming just like parsing the copied filename in 1) ...
Regards,
Xunlei
>
>>> + TMPFILE=/tmp/kdump.trace.tmp.$$$$
>>> + cp $TRACE_BASE/tracing/trace $TMPFILE -f
>>> + while read pid cpu flags ts function ;
>>> + do
>>> + # Skip comment lines
>>> + if [[ $pid = "#" ]]; then
>>> + continue
>>> + fi
>>> +
>>> + if [[ $function = module_load* ]]; then
>>> + # One module is being loaded, save the task pid for tracking.
>>> + module_name=${function#*: }
>>> + # Remove the trailing after whitespace, there may be the module flags.
>>> + module_name=${module_name%% *}
>>> + module_names+=" $module_name"
>>> + current_module[$pid]="$module_name"
>>> + [[ ${module_loaded[$module_name]} ]] && warn "\"$module_name\" was loaded multiple times!"
>>> + unset module_loaded[$module_name]
>>> + nr_alloc_pages[$module_name]=0
>>> + fi
>>> +
>>> + if ! [[ ${current_module[$pid]} ]]; then
>>> + continue
>>> + fi
>>> +
>>> + if [[ $function = module_put* ]]; then
>>> + # Mark the module as loaded when the first module_put event happens after module_load.
>>> + module_loaded[${current_module[$pid]}]=1
>>> + # Module has been loaded when module_put is called, untrack the task
>>> + unset current_module[$pid]
>>> + continue
>>> + fi
>>> +
>>> + # Once we get here, the task is being tracked(is loading a module).
>>> + # Get the module name.
>>> + module_name=${current_module[$pid]}
>>> +
>>> + if [[ $function = mm_page_alloc* ]]; then
>>> + order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/')
>>> + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order))))
>>> + fi
>>> + done < $TMPFILE
>>> +
>>> + # Print the result out.
>>> + echo "== debug_mem for kernel modules during loading begin ==" >&2
>>> + for i in $module_names; do
>>> + status=""
>>> + if ! [[ ${module_loaded[$i]} ]]; then
>>> + status="[still loading]"
>>> + fi
>>> + echo "${nr_alloc_pages[$i]} pages consumed by \"$i\" $status" >&2
>>> + done
>>> + echo "== debug_mem for kernel modules during loading end ==" >&2
>>> +
>>> + unset module_names
>>> + unset module_loaded
>> unset nr_alloc_pages?
> nr_alloc_pages will be set to be 0 when module_load hits, but I will update or try to change them to be local.
>
>
>>> + rm $TMPFILE -f
>>> +}
>>> +
>>> +if is_trace_data_available ; then
>>> + parse_trace_data
>>> +fi
>>> diff --git a/modules.d/99memdebug-ko/memdebug-ko-prepare.sh b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh
>>> new file mode 100755
>>> index 0000000..134821a
>>> --- /dev/null
>>> +++ b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh
>>> @@ -0,0 +1,60 @@
>>> +# Try to find out kernel modules with large total memory allocation during loading.
>>> +# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc"
>>> +# alone should be enough for the purpose.
>>> +#
>>> +# Prepare for the trace setup.
>>> +
>>> +prepare_trace() {
>>> + # Only "rd.memdebug=4" is for the purpose.
>>> + if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then
>>> + return 0
>>> + fi
>>> +
>>> + TRACE_BASE="/sys/kernel/debug"
>>> + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available.
>>> + if [[ -d "/sys/kernel/tracing" ]]; then
>>> + TRACE_BASE="/sys/kernel"
>>> + fi
>> Move TRACE_BASE as a local variable, maybe a function..
> It is in a function, will change it to be local.
>
>>> +
>>> + # old debugfs case.
>>> + if ! [[ -d "$TRACE_BASE/tracing" ]]; then
>>> + mount none -t debugfs $TRACE_BASE
>>> + # new tracefs case.
>>> + elif ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then
>>> + mount none -t tracefs "$TRACE_BASE/tracing"
>>> + fi
>>> +
>>> + if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then
>>> + warn "Mount trace failed for kernel module memory analyzing."
>>> + return 1
>>> + fi
>>> +
>>> + MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc"
>> Do we have module_init event? what is the difference between module_load
>> and module_init?
> module_load event is in some path of finit_module/init_module syscall, quite similar to
> module_init, after the event there is do_init_module()->do_one_initcall() to invoke the
> module init function, then module_put() immediately once finished which produces
> module_put event. So between the two events, there are all the needed page allocation
> events for us. I haven't see other better events.
>
> Regards,
> Xunlei
>
>>> + SET_EVENTS=$(echo $(cat $TRACE_BASE/tracing/set_event))
>>> + # Check if trace was properly setup, prepare it if not.
>>> + if [[ $(cat $TRACE_BASE/tracing/tracing_on) != 1 ]] || \
>>> + [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then
>>> + # Set our trace events.
>>> + echo $MATCH_EVENTS > $TRACE_BASE/tracing/set_event
>>> +
>>> + # There are three kinds of known applications for module loading:
>>> + # "systemd-udevd", "modprobe" and "insmod".
>>> + # Set them to the mm_page_alloc event filter.
>>> + # NOTE: Some kernel may not support this format of filter, anyway
>>> + # the operation will fail and it doesn't matter.
>>> + page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod"
>>> + echo $page_alloc_filter > $TRACE_BASE/tracing/events/kmem/mm_page_alloc/filter
>>> +
>>> + # Set the number of comm-pid.
>>> + # Thanks to filters, 4096 is big enough(also well supported).
>>> + echo 4096 > $TRACE_BASE/tracing/saved_cmdlines_size
>>> +
>>> + # Enable and clear trace data for the first time.
>>> + echo 1 > $TRACE_BASE/tracing/tracing_on
>>> + echo > $TRACE_BASE/tracing/trace
>>> + fi
>>> +
>>> + return 0
>>> +}
>>> +
>>> +prepare_trace
>>> diff --git a/modules.d/99memdebug-ko/module-setup.sh b/modules.d/99memdebug-ko/module-setup.sh
>>> new file mode 100755
>>> index 0000000..d6a7f0c
>>> --- /dev/null
>>> +++ b/modules.d/99memdebug-ko/module-setup.sh
>>> @@ -0,0 +1,25 @@
>>> +#!/bin/bash
>>> +
>>> +# called by dracut
>>> +check() {
>>> + # this is very small, always add it.
>>> + # It has no effect if no "rd.memdebug=4" is specified, also
>>> + # this stays consistent with "rd.memdebug=[0-3]".
>>> + return 0
>>> +}
>>> +
>>> +# called by dracut
>>> +depends() {
>>> + return 0
>>> +}
>>> +
>>> +# called by dracut
>>> +install() {
>>> + # Prepare trace at the early stage before any ko modules is loaded.
>>> + inst_hook cmdline 00 "$moddir/memdebug-ko-prepare.sh"
>>> +
>>> + # Before switching root to the real root, rootfs should be mounted
>>> + # and all dependent ko modules are expected to be loaded.
>>> + # We can parse trace data at this stage.
>>> + inst_hook pre-pivot 99 "$moddir/memdebug-ko-parse.sh"
>>> +}
>>> --
>>> 1.8.3.1
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe initramfs" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe initramfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe initramfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2016-11-03 7:28 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-02 9:03 [RFC PATCH 1/2] add 99memdebug-ko dracut module Xunlei Pang
[not found] ` <1478077386-30039-1-git-send-email-xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-02 9:03 ` [RFC PATCH 2/2] dracut.cmdline.7.asc: update document for rd.memdebug=4 Xunlei Pang
2016-11-03 3:01 ` [RFC PATCH 1/2] add 99memdebug-ko dracut module Dave Young
[not found] ` <20161103030142.GA3201-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2016-11-03 4:15 ` Xunlei Pang
[not found] ` <581AB9D7.2010905-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-03 4:45 ` Xunlei Pang
2016-11-03 7:28 ` Xunlei Pang [this message]
[not found] ` <581AE707.9050606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-03 8:38 ` Dave Young
[not found] ` <20161103083826.GA15431-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2016-11-03 11:52 ` Xunlei Pang
[not found] ` <581B2518.6060503-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-04 5:50 ` Dave Young
[not found] ` <20161104055026.GB2889-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2016-11-04 6:35 ` Xunlei Pang
[not found] ` <581C2C30.8070403-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-04 7:06 ` Dave Young
[not found] ` <20161104070629.GA17145-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2016-11-04 7:55 ` Xunlei Pang
[not found] ` <581C3EDE.9080601-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-04 8:23 ` Dave Young
[not found] ` <20161104082328.GA19275-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2016-11-04 9:12 ` Xunlei Pang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=581AE707.9050606@redhat.com \
--to=xpang-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
--cc=dyoung-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=harald-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=initramfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=panand-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox