* [RFC PATCH 1/2] add 99memdebug-ko dracut module
@ 2016-11-02 9:03 Xunlei Pang
[not found] ` <1478077386-30039-1-git-send-email-xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 14+ messages in thread
From: Xunlei Pang @ 2016-11-02 9:03 UTC (permalink / raw)
To: Harald Hoyer; +Cc: initramfs-u79uwXL29TY76Z2rM5mHXA, Dave Young, Xunlei Pang
The current method for memory debug is to use "rd.memdebug=[0-3]",
it is not enough for debugging kernel modules. For example, when we
want to find out which kernel module consumes a large amount of memory,
"rd.memdebug=[0-3]" won't help too much.
A better way is needed to achieve this requirement, this is useful for
kdump OOM debugging.
The principle of this patch is to use kernel trace to track slab and
buddy allocation calls during kernel module loading(module_init), thus
we can analyze all the trace data and get the total memory consumption.
As for large slab allocation, it will probably fall into buddy allocation,
thus tracing "mm_page_alloc" alone should be enough for the purpose(this
saves quite some trace buffer memory, also large free is quite unlikey
during module loading, we neglect those memory free events).
The trace events include memory calls under "tracing/events/":
kmem/mm_page_alloc
We also inpect the following events to detect the module loading:
module/module_load
module/module_put
Since we use filters to trace events, the final trace data size won't
be too big. Users can adjust the trace buffer size via "trace_buf_size"
kernel boot command line as needed.
We can get the module name and task pid from "module_load" event which
also mark the beginning of the loading, and module_put called by the
same task pid implies the end of the loading. So the memory events
recorded in between by the same task pid are consumed by this module
during loading(i.e. modprobe or module_init()).
With these information, we can record the total memory(the larger, the
more precise the result will be) consumption involved by each kernel
module loading.
Thus we introduce this dracut module to find out which kernel module
consumes a large amount of memory during loading. It uses "rd.memdebug=4"
as the tigger.
After applying is patch and specifying "rd.memdebug=4", during booting
it will print out something below:
== debug_mem for kernel modules during loading begin ==
0 pages consumed by "pata_acpi"
0 pages consumed by "ata_generic"
1 pages consumed by "drm"
0 pages consumed by "ttm"
0 pages consumed by "drm_kms_helper"
835 pages consumed by "qxl"
0 pages consumed by "mii"
6 pages consumed by "8139cp"
0 pages consumed by "virtio"
0 pages consumed by "virtio_ring"
9 pages consumed by "virtio_pci"
1 pages consumed by "8139too"
0 pages consumed by "serio_raw"
0 pages consumed by "crc32c_intel"
199 pages consumed by "virtio_console"
0 pages consumed by "libcrc32c"
9 pages consumed by "xfs"
== debug_mem for kernel modules during loading end ==
From the print, we see clearly that "qxl" consumed lots of memory.
Signed-off-by: Xunlei Pang <xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
dracut.spec | 1 +
modules.d/99memdebug-ko/memdebug-ko-parse.sh | 93 ++++++++++++++++++++++++++
modules.d/99memdebug-ko/memdebug-ko-prepare.sh | 60 +++++++++++++++++
modules.d/99memdebug-ko/module-setup.sh | 25 +++++++
4 files changed, 179 insertions(+)
create mode 100755 modules.d/99memdebug-ko/memdebug-ko-parse.sh
create mode 100755 modules.d/99memdebug-ko/memdebug-ko-prepare.sh
create mode 100755 modules.d/99memdebug-ko/module-setup.sh
diff --git a/dracut.spec b/dracut.spec
index 7b9d675..3907b17 100644
--- a/dracut.spec
+++ b/dracut.spec
@@ -407,6 +407,7 @@ rm -rf -- $RPM_BUILD_ROOT
%{dracutlibdir}/modules.d/98usrmount
%{dracutlibdir}/modules.d/99base
%{dracutlibdir}/modules.d/99fs-lib
+%{dracutlibdir}/modules.d/99memdebug-ko
%{dracutlibdir}/modules.d/99shutdown
%attr(0644,root,root) %ghost %config(missingok,noreplace) %{_localstatedir}/log/dracut.log
%dir %{_sharedstatedir}/initramfs
diff --git a/modules.d/99memdebug-ko/memdebug-ko-parse.sh b/modules.d/99memdebug-ko/memdebug-ko-parse.sh
new file mode 100755
index 0000000..1542658
--- /dev/null
+++ b/modules.d/99memdebug-ko/memdebug-ko-parse.sh
@@ -0,0 +1,93 @@
+# Parse the trace data to find out the modules consuming large amount of memory.
+
+is_trace_data_available() {
+ # Only "rd.memdebug=4" is for the purpose.
+ if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then
+ return 1
+ fi
+
+ TRACE_BASE="/sys/kernel/debug"
+ # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available.
+ if [[ -d "/sys/kernel/tracing" ]]; then
+ TRACE_BASE="/sys/kernel"
+ fi
+
+ if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then
+ warn "Trace inactive, skipping kernel module memory analyzing."
+ return 1
+ fi
+
+ return 0
+}
+
+parse_trace_data() {
+ # Indexed by task pid.
+ declare -A current_module
+ # Indexed by module name.
+ declare -A module_loaded
+ declare -A nr_alloc_pages
+
+ # For large trace data, parsing "tracing/trace" directly turns out to be time-consuming,
+ # hence copy it out first, then parse the copy file instead.
+ TMPFILE=/tmp/kdump.trace.tmp.$$$$
+ cp $TRACE_BASE/tracing/trace $TMPFILE -f
+ while read pid cpu flags ts function ;
+ do
+ # Skip comment lines
+ if [[ $pid = "#" ]]; then
+ continue
+ fi
+
+ if [[ $function = module_load* ]]; then
+ # One module is being loaded, save the task pid for tracking.
+ module_name=${function#*: }
+ # Remove the trailing after whitespace, there may be the module flags.
+ module_name=${module_name%% *}
+ module_names+=" $module_name"
+ current_module[$pid]="$module_name"
+ [[ ${module_loaded[$module_name]} ]] && warn "\"$module_name\" was loaded multiple times!"
+ unset module_loaded[$module_name]
+ nr_alloc_pages[$module_name]=0
+ fi
+
+ if ! [[ ${current_module[$pid]} ]]; then
+ continue
+ fi
+
+ if [[ $function = module_put* ]]; then
+ # Mark the module as loaded when the first module_put event happens after module_load.
+ module_loaded[${current_module[$pid]}]=1
+ # Module has been loaded when module_put is called, untrack the task
+ unset current_module[$pid]
+ continue
+ fi
+
+ # Once we get here, the task is being tracked(is loading a module).
+ # Get the module name.
+ module_name=${current_module[$pid]}
+
+ if [[ $function = mm_page_alloc* ]]; then
+ order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/')
+ nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order))))
+ fi
+ done < $TMPFILE
+
+ # Print the result out.
+ echo "== debug_mem for kernel modules during loading begin ==" >&2
+ for i in $module_names; do
+ status=""
+ if ! [[ ${module_loaded[$i]} ]]; then
+ status="[still loading]"
+ fi
+ echo "${nr_alloc_pages[$i]} pages consumed by \"$i\" $status" >&2
+ done
+ echo "== debug_mem for kernel modules during loading end ==" >&2
+
+ unset module_names
+ unset module_loaded
+ rm $TMPFILE -f
+}
+
+if is_trace_data_available ; then
+ parse_trace_data
+fi
diff --git a/modules.d/99memdebug-ko/memdebug-ko-prepare.sh b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh
new file mode 100755
index 0000000..134821a
--- /dev/null
+++ b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh
@@ -0,0 +1,60 @@
+# Try to find out kernel modules with large total memory allocation during loading.
+# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc"
+# alone should be enough for the purpose.
+#
+# Prepare for the trace setup.
+
+prepare_trace() {
+ # Only "rd.memdebug=4" is for the purpose.
+ if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then
+ return 0
+ fi
+
+ TRACE_BASE="/sys/kernel/debug"
+ # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available.
+ if [[ -d "/sys/kernel/tracing" ]]; then
+ TRACE_BASE="/sys/kernel"
+ fi
+
+ # old debugfs case.
+ if ! [[ -d "$TRACE_BASE/tracing" ]]; then
+ mount none -t debugfs $TRACE_BASE
+ # new tracefs case.
+ elif ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then
+ mount none -t tracefs "$TRACE_BASE/tracing"
+ fi
+
+ if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then
+ warn "Mount trace failed for kernel module memory analyzing."
+ return 1
+ fi
+
+ MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc"
+ SET_EVENTS=$(echo $(cat $TRACE_BASE/tracing/set_event))
+ # Check if trace was properly setup, prepare it if not.
+ if [[ $(cat $TRACE_BASE/tracing/tracing_on) != 1 ]] || \
+ [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then
+ # Set our trace events.
+ echo $MATCH_EVENTS > $TRACE_BASE/tracing/set_event
+
+ # There are three kinds of known applications for module loading:
+ # "systemd-udevd", "modprobe" and "insmod".
+ # Set them to the mm_page_alloc event filter.
+ # NOTE: Some kernel may not support this format of filter, anyway
+ # the operation will fail and it doesn't matter.
+ page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod"
+ echo $page_alloc_filter > $TRACE_BASE/tracing/events/kmem/mm_page_alloc/filter
+
+ # Set the number of comm-pid.
+ # Thanks to filters, 4096 is big enough(also well supported).
+ echo 4096 > $TRACE_BASE/tracing/saved_cmdlines_size
+
+ # Enable and clear trace data for the first time.
+ echo 1 > $TRACE_BASE/tracing/tracing_on
+ echo > $TRACE_BASE/tracing/trace
+ fi
+
+ return 0
+}
+
+prepare_trace
diff --git a/modules.d/99memdebug-ko/module-setup.sh b/modules.d/99memdebug-ko/module-setup.sh
new file mode 100755
index 0000000..d6a7f0c
--- /dev/null
+++ b/modules.d/99memdebug-ko/module-setup.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+
+# called by dracut
+check() {
+ # this is very small, always add it.
+ # It has no effect if no "rd.memdebug=4" is specified, also
+ # this stays consistent with "rd.memdebug=[0-3]".
+ return 0
+}
+
+# called by dracut
+depends() {
+ return 0
+}
+
+# called by dracut
+install() {
+ # Prepare trace at the early stage before any ko modules is loaded.
+ inst_hook cmdline 00 "$moddir/memdebug-ko-prepare.sh"
+
+ # Before switching root to the real root, rootfs should be mounted
+ # and all dependent ko modules are expected to be loaded.
+ # We can parse trace data at this stage.
+ inst_hook pre-pivot 99 "$moddir/memdebug-ko-parse.sh"
+}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 14+ messages in thread[parent not found: <1478077386-30039-1-git-send-email-xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* [RFC PATCH 2/2] dracut.cmdline.7.asc: update document for rd.memdebug=4 [not found] ` <1478077386-30039-1-git-send-email-xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2016-11-02 9:03 ` Xunlei Pang 2016-11-03 3:01 ` [RFC PATCH 1/2] add 99memdebug-ko dracut module Dave Young 1 sibling, 0 replies; 14+ messages in thread From: Xunlei Pang @ 2016-11-02 9:03 UTC (permalink / raw) To: Harald Hoyer; +Cc: initramfs-u79uwXL29TY76Z2rM5mHXA, Dave Young, Xunlei Pang Signed-off-by: Xunlei Pang <xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- dracut.cmdline.7.asc | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/dracut.cmdline.7.asc b/dracut.cmdline.7.asc index 0009ff2..70d8776 100644 --- a/dracut.cmdline.7.asc +++ b/dracut.cmdline.7.asc @@ -187,8 +187,8 @@ It should be attached to any report about dracut problems. _/run/initramfs/init.log_. If "quiet" is set, it also logs to the console. -**rd.memdebug=[0-3]**:: - Print memory usage info at various points, set the verbose level from 0 to 3. +**rd.memdebug=[0-4]**:: + Print memory usage info at various points, set the verbose level from 0 to 4. + Higher level means more debugging output: + @@ -197,6 +197,7 @@ It should be attached to any report about dracut problems. 1 - partial /proc/meminfo 2 - /proc/meminfo 3 - /proc/meminfo + /proc/slabinfo + 4 - to find out kernel modules consuming large memory during loading ---- **rd.break**:: -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <1478077386-30039-1-git-send-email-xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2016-11-02 9:03 ` [RFC PATCH 2/2] dracut.cmdline.7.asc: update document for rd.memdebug=4 Xunlei Pang @ 2016-11-03 3:01 ` Dave Young [not found] ` <20161103030142.GA3201-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org> 1 sibling, 1 reply; 14+ messages in thread From: Dave Young @ 2016-11-03 3:01 UTC (permalink / raw) To: Xunlei Pang; +Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA Hi Xunlei, On 11/02/16 at 05:03pm, Xunlei Pang wrote: > The current method for memory debug is to use "rd.memdebug=[0-3]", > it is not enough for debugging kernel modules. For example, when we > want to find out which kernel module consumes a large amount of memory, > "rd.memdebug=[0-3]" won't help too much. > > A better way is needed to achieve this requirement, this is useful for > kdump OOM debugging. > > The principle of this patch is to use kernel trace to track slab and > buddy allocation calls during kernel module loading(module_init), thus > we can analyze all the trace data and get the total memory consumption. > As for large slab allocation, it will probably fall into buddy allocation, > thus tracing "mm_page_alloc" alone should be enough for the purpose(this > saves quite some trace buffer memory, also large free is quite unlikey > during module loading, we neglect those memory free events). > > The trace events include memory calls under "tracing/events/": > kmem/mm_page_alloc > > We also inpect the following events to detect the module loading: > module/module_load > module/module_put > > Since we use filters to trace events, the final trace data size won't > be too big. Users can adjust the trace buffer size via "trace_buf_size" > kernel boot command line as needed. > > We can get the module name and task pid from "module_load" event which > also mark the beginning of the loading, and module_put called by the > same task pid implies the end of the loading. So the memory events > recorded in between by the same task pid are consumed by this module > during loading(i.e. modprobe or module_init()). > > With these information, we can record the total memory(the larger, the > more precise the result will be) consumption involved by each kernel > module loading. > > Thus we introduce this dracut module to find out which kernel module > consumes a large amount of memory during loading. It uses "rd.memdebug=4" > as the tigger. > > After applying is patch and specifying "rd.memdebug=4", during booting > it will print out something below: > == debug_mem for kernel modules during loading begin == > 0 pages consumed by "pata_acpi" > 0 pages consumed by "ata_generic" > 1 pages consumed by "drm" > 0 pages consumed by "ttm" > 0 pages consumed by "drm_kms_helper" > 835 pages consumed by "qxl" > 0 pages consumed by "mii" > 6 pages consumed by "8139cp" > 0 pages consumed by "virtio" > 0 pages consumed by "virtio_ring" > 9 pages consumed by "virtio_pci" > 1 pages consumed by "8139too" > 0 pages consumed by "serio_raw" > 0 pages consumed by "crc32c_intel" > 199 pages consumed by "virtio_console" > 0 pages consumed by "libcrc32c" > 9 pages consumed by "xfs" > == debug_mem for kernel modules during loading end == > > From the print, we see clearly that "qxl" consumed lots of memory. > > Signed-off-by: Xunlei Pang <xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > dracut.spec | 1 + > modules.d/99memdebug-ko/memdebug-ko-parse.sh | 93 ++++++++++++++++++++++++++ > modules.d/99memdebug-ko/memdebug-ko-prepare.sh | 60 +++++++++++++++++ > modules.d/99memdebug-ko/module-setup.sh | 25 +++++++ > 4 files changed, 179 insertions(+) > create mode 100755 modules.d/99memdebug-ko/memdebug-ko-parse.sh > create mode 100755 modules.d/99memdebug-ko/memdebug-ko-prepare.sh > create mode 100755 modules.d/99memdebug-ko/module-setup.sh > > diff --git a/dracut.spec b/dracut.spec > index 7b9d675..3907b17 100644 > --- a/dracut.spec > +++ b/dracut.spec > @@ -407,6 +407,7 @@ rm -rf -- $RPM_BUILD_ROOT > %{dracutlibdir}/modules.d/98usrmount > %{dracutlibdir}/modules.d/99base > %{dracutlibdir}/modules.d/99fs-lib > +%{dracutlibdir}/modules.d/99memdebug-ko > %{dracutlibdir}/modules.d/99shutdown > %attr(0644,root,root) %ghost %config(missingok,noreplace) %{_localstatedir}/log/dracut.log > %dir %{_sharedstatedir}/initramfs > diff --git a/modules.d/99memdebug-ko/memdebug-ko-parse.sh b/modules.d/99memdebug-ko/memdebug-ko-parse.sh > new file mode 100755 > index 0000000..1542658 > --- /dev/null > +++ b/modules.d/99memdebug-ko/memdebug-ko-parse.sh > @@ -0,0 +1,93 @@ > +# Parse the trace data to find out the modules consuming large amount of memory. > + > +is_trace_data_available() { > + # Only "rd.memdebug=4" is for the purpose. > + if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then > + return 1 > + fi The usage of rd.memdebug is different in other place, can we integrate it to the original rd.memdbug code? One concern it in this patch only parse the trace data at pre-pivot hook, so if oom happens earlier than that we will get nothing. Original rd.memdebug being added to several different point, if we integrate this to the framework then we can get the out put early. > + > + TRACE_BASE="/sys/kernel/debug" > + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. > + if [[ -d "/sys/kernel/tracing" ]]; then > + TRACE_BASE="/sys/kernel" > + fi > + > + if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then > + warn "Trace inactive, skipping kernel module memory analyzing." > + return 1 > + fi This function is only for checking if tracing is needed by checking cmdline, so getting the TRACE_BASE logic can be moved to parse_trace_data so that it can be a local variable.. > + > + return 0 > +} > + > +parse_trace_data() { > + # Indexed by task pid. > + declare -A current_module > + # Indexed by module name. > + declare -A module_loaded > + declare -A nr_alloc_pages > + > + # For large trace data, parsing "tracing/trace" directly turns out to be time-consuming, > + # hence copy it out first, then parse the copy file instead. Have you get any data about the time used via traceing/trace parsing, if the time is still acceptable then it will be better to avoid the tmp file. > + TMPFILE=/tmp/kdump.trace.tmp.$$$$ > + cp $TRACE_BASE/tracing/trace $TMPFILE -f > + while read pid cpu flags ts function ; > + do > + # Skip comment lines > + if [[ $pid = "#" ]]; then > + continue > + fi > + > + if [[ $function = module_load* ]]; then > + # One module is being loaded, save the task pid for tracking. > + module_name=${function#*: } > + # Remove the trailing after whitespace, there may be the module flags. > + module_name=${module_name%% *} > + module_names+=" $module_name" > + current_module[$pid]="$module_name" > + [[ ${module_loaded[$module_name]} ]] && warn "\"$module_name\" was loaded multiple times!" > + unset module_loaded[$module_name] > + nr_alloc_pages[$module_name]=0 > + fi > + > + if ! [[ ${current_module[$pid]} ]]; then > + continue > + fi > + > + if [[ $function = module_put* ]]; then > + # Mark the module as loaded when the first module_put event happens after module_load. > + module_loaded[${current_module[$pid]}]=1 > + # Module has been loaded when module_put is called, untrack the task > + unset current_module[$pid] > + continue > + fi > + > + # Once we get here, the task is being tracked(is loading a module). > + # Get the module name. > + module_name=${current_module[$pid]} > + > + if [[ $function = mm_page_alloc* ]]; then > + order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/') > + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order)))) > + fi > + done < $TMPFILE > + > + # Print the result out. > + echo "== debug_mem for kernel modules during loading begin ==" >&2 > + for i in $module_names; do > + status="" > + if ! [[ ${module_loaded[$i]} ]]; then > + status="[still loading]" > + fi > + echo "${nr_alloc_pages[$i]} pages consumed by \"$i\" $status" >&2 > + done > + echo "== debug_mem for kernel modules during loading end ==" >&2 > + > + unset module_names > + unset module_loaded unset nr_alloc_pages? > + rm $TMPFILE -f > +} > + > +if is_trace_data_available ; then > + parse_trace_data > +fi > diff --git a/modules.d/99memdebug-ko/memdebug-ko-prepare.sh b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh > new file mode 100755 > index 0000000..134821a > --- /dev/null > +++ b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh > @@ -0,0 +1,60 @@ > +# Try to find out kernel modules with large total memory allocation during loading. > +# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc" > +# alone should be enough for the purpose. > +# > +# Prepare for the trace setup. > + > +prepare_trace() { > + # Only "rd.memdebug=4" is for the purpose. > + if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then > + return 0 > + fi > + > + TRACE_BASE="/sys/kernel/debug" > + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. > + if [[ -d "/sys/kernel/tracing" ]]; then > + TRACE_BASE="/sys/kernel" > + fi Move TRACE_BASE as a local variable, maybe a function.. > + > + # old debugfs case. > + if ! [[ -d "$TRACE_BASE/tracing" ]]; then > + mount none -t debugfs $TRACE_BASE > + # new tracefs case. > + elif ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then > + mount none -t tracefs "$TRACE_BASE/tracing" > + fi > + > + if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then > + warn "Mount trace failed for kernel module memory analyzing." > + return 1 > + fi > + > + MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc" Do we have module_init event? what is the difference between module_load and module_init? > + SET_EVENTS=$(echo $(cat $TRACE_BASE/tracing/set_event)) > + # Check if trace was properly setup, prepare it if not. > + if [[ $(cat $TRACE_BASE/tracing/tracing_on) != 1 ]] || \ > + [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then > + # Set our trace events. > + echo $MATCH_EVENTS > $TRACE_BASE/tracing/set_event > + > + # There are three kinds of known applications for module loading: > + # "systemd-udevd", "modprobe" and "insmod". > + # Set them to the mm_page_alloc event filter. > + # NOTE: Some kernel may not support this format of filter, anyway > + # the operation will fail and it doesn't matter. > + page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod" > + echo $page_alloc_filter > $TRACE_BASE/tracing/events/kmem/mm_page_alloc/filter > + > + # Set the number of comm-pid. > + # Thanks to filters, 4096 is big enough(also well supported). > + echo 4096 > $TRACE_BASE/tracing/saved_cmdlines_size > + > + # Enable and clear trace data for the first time. > + echo 1 > $TRACE_BASE/tracing/tracing_on > + echo > $TRACE_BASE/tracing/trace > + fi > + > + return 0 > +} > + > +prepare_trace > diff --git a/modules.d/99memdebug-ko/module-setup.sh b/modules.d/99memdebug-ko/module-setup.sh > new file mode 100755 > index 0000000..d6a7f0c > --- /dev/null > +++ b/modules.d/99memdebug-ko/module-setup.sh > @@ -0,0 +1,25 @@ > +#!/bin/bash > + > +# called by dracut > +check() { > + # this is very small, always add it. > + # It has no effect if no "rd.memdebug=4" is specified, also > + # this stays consistent with "rd.memdebug=[0-3]". > + return 0 > +} > + > +# called by dracut > +depends() { > + return 0 > +} > + > +# called by dracut > +install() { > + # Prepare trace at the early stage before any ko modules is loaded. > + inst_hook cmdline 00 "$moddir/memdebug-ko-prepare.sh" > + > + # Before switching root to the real root, rootfs should be mounted > + # and all dependent ko modules are expected to be loaded. > + # We can parse trace data at this stage. > + inst_hook pre-pivot 99 "$moddir/memdebug-ko-parse.sh" > +} > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe initramfs" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20161103030142.GA3201-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>]
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <20161103030142.GA3201-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org> @ 2016-11-03 4:15 ` Xunlei Pang [not found] ` <581AB9D7.2010905-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Xunlei Pang @ 2016-11-03 4:15 UTC (permalink / raw) To: Dave Young, Xunlei Pang; +Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA On 2016/11/03 at 11:01, Dave Young wrote: > Hi Xunlei, > On 11/02/16 at 05:03pm, Xunlei Pang wrote: >> The current method for memory debug is to use "rd.memdebug=[0-3]", >> it is not enough for debugging kernel modules. For example, when we >> want to find out which kernel module consumes a large amount of memory, >> "rd.memdebug=[0-3]" won't help too much. >> >> A better way is needed to achieve this requirement, this is useful for >> kdump OOM debugging. >> >> The principle of this patch is to use kernel trace to track slab and >> buddy allocation calls during kernel module loading(module_init), thus >> we can analyze all the trace data and get the total memory consumption. >> As for large slab allocation, it will probably fall into buddy allocation, >> thus tracing "mm_page_alloc" alone should be enough for the purpose(this >> saves quite some trace buffer memory, also large free is quite unlikey >> during module loading, we neglect those memory free events). >> >> The trace events include memory calls under "tracing/events/": >> kmem/mm_page_alloc >> >> We also inpect the following events to detect the module loading: >> module/module_load >> module/module_put >> >> Since we use filters to trace events, the final trace data size won't >> be too big. Users can adjust the trace buffer size via "trace_buf_size" >> kernel boot command line as needed. >> >> We can get the module name and task pid from "module_load" event which >> also mark the beginning of the loading, and module_put called by the >> same task pid implies the end of the loading. So the memory events >> recorded in between by the same task pid are consumed by this module >> during loading(i.e. modprobe or module_init()). >> >> With these information, we can record the total memory(the larger, the >> more precise the result will be) consumption involved by each kernel >> module loading. >> >> Thus we introduce this dracut module to find out which kernel module >> consumes a large amount of memory during loading. It uses "rd.memdebug=4" >> as the tigger. >> >> After applying is patch and specifying "rd.memdebug=4", during booting >> it will print out something below: >> == debug_mem for kernel modules during loading begin == >> 0 pages consumed by "pata_acpi" >> 0 pages consumed by "ata_generic" >> 1 pages consumed by "drm" >> 0 pages consumed by "ttm" >> 0 pages consumed by "drm_kms_helper" >> 835 pages consumed by "qxl" >> 0 pages consumed by "mii" >> 6 pages consumed by "8139cp" >> 0 pages consumed by "virtio" >> 0 pages consumed by "virtio_ring" >> 9 pages consumed by "virtio_pci" >> 1 pages consumed by "8139too" >> 0 pages consumed by "serio_raw" >> 0 pages consumed by "crc32c_intel" >> 199 pages consumed by "virtio_console" >> 0 pages consumed by "libcrc32c" >> 9 pages consumed by "xfs" >> == debug_mem for kernel modules during loading end == >> >> From the print, we see clearly that "qxl" consumed lots of memory. >> >> Signed-off-by: Xunlei Pang <xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> --- >> dracut.spec | 1 + >> modules.d/99memdebug-ko/memdebug-ko-parse.sh | 93 ++++++++++++++++++++++++++ >> modules.d/99memdebug-ko/memdebug-ko-prepare.sh | 60 +++++++++++++++++ >> modules.d/99memdebug-ko/module-setup.sh | 25 +++++++ >> 4 files changed, 179 insertions(+) >> create mode 100755 modules.d/99memdebug-ko/memdebug-ko-parse.sh >> create mode 100755 modules.d/99memdebug-ko/memdebug-ko-prepare.sh >> create mode 100755 modules.d/99memdebug-ko/module-setup.sh >> >> diff --git a/dracut.spec b/dracut.spec >> index 7b9d675..3907b17 100644 >> --- a/dracut.spec >> +++ b/dracut.spec >> @@ -407,6 +407,7 @@ rm -rf -- $RPM_BUILD_ROOT >> %{dracutlibdir}/modules.d/98usrmount >> %{dracutlibdir}/modules.d/99base >> %{dracutlibdir}/modules.d/99fs-lib >> +%{dracutlibdir}/modules.d/99memdebug-ko >> %{dracutlibdir}/modules.d/99shutdown >> %attr(0644,root,root) %ghost %config(missingok,noreplace) %{_localstatedir}/log/dracut.log >> %dir %{_sharedstatedir}/initramfs >> diff --git a/modules.d/99memdebug-ko/memdebug-ko-parse.sh b/modules.d/99memdebug-ko/memdebug-ko-parse.sh >> new file mode 100755 >> index 0000000..1542658 >> --- /dev/null >> +++ b/modules.d/99memdebug-ko/memdebug-ko-parse.sh >> @@ -0,0 +1,93 @@ >> +# Parse the trace data to find out the modules consuming large amount of memory. >> + >> +is_trace_data_available() { >> + # Only "rd.memdebug=4" is for the purpose. >> + if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then >> + return 1 >> + fi Hi Dave, Firstly, thanks for your valuable comments. > The usage of rd.memdebug is different in other place, can we integrate > it to the original rd.memdbug code? > > One concern it in this patch only parse the trace data at pre-pivot > hook, so if oom happens earlier than that we will get nothing. > > Original rd.memdebug being added to several different point, if we > integrate this to the framework then we can get the out put early. Yes, they can be integrated together. My original thought is that to just output the memory information for all the modules at one stage where all modules are supposed to be loaded, as such can avoid printing redundant/partial information if at multiple stages(hooks). If OOM happens (mainly for kdump cases), we can enlarge the reserved crash memory and try again, because this feature itself might consume large trace buffer memory. But I can do in that way if you think it is fine. >> + >> + TRACE_BASE="/sys/kernel/debug" >> + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. >> + if [[ -d "/sys/kernel/tracing" ]]; then >> + TRACE_BASE="/sys/kernel" >> + fi >> + >> + if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then >> + warn "Trace inactive, skipping kernel module memory analyzing." >> + return 1 >> + fi > This function is only for checking if tracing is needed by checking > cmdline, so getting the TRACE_BASE logic can be moved to > parse_trace_data so that it can be a local variable.. It is actually in a function, will change it to be a local definition. > >> + >> + return 0 >> +} >> + >> +parse_trace_data() { >> + # Indexed by task pid. >> + declare -A current_module >> + # Indexed by module name. >> + declare -A module_loaded >> + declare -A nr_alloc_pages >> + >> + # For large trace data, parsing "tracing/trace" directly turns out to be time-consuming, >> + # hence copy it out first, then parse the copy file instead. > Have you get any data about the time used via traceing/trace parsing, if > the time is still acceptable then it will be better to avoid the tmp > file. For large trace data(tested on rhel7, the filter doesn't work on rhel7, and will produce huge trace data), the time consumption is huge, I am afraid in minutes because I once suspected the script was in some dead loop when parsing "tracing/trace" directly. It is the same situation when turning off tracing_on and try again. > >> + TMPFILE=/tmp/kdump.trace.tmp.$$$$ >> + cp $TRACE_BASE/tracing/trace $TMPFILE -f >> + while read pid cpu flags ts function ; >> + do >> + # Skip comment lines >> + if [[ $pid = "#" ]]; then >> + continue >> + fi >> + >> + if [[ $function = module_load* ]]; then >> + # One module is being loaded, save the task pid for tracking. >> + module_name=${function#*: } >> + # Remove the trailing after whitespace, there may be the module flags. >> + module_name=${module_name%% *} >> + module_names+=" $module_name" >> + current_module[$pid]="$module_name" >> + [[ ${module_loaded[$module_name]} ]] && warn "\"$module_name\" was loaded multiple times!" >> + unset module_loaded[$module_name] >> + nr_alloc_pages[$module_name]=0 >> + fi >> + >> + if ! [[ ${current_module[$pid]} ]]; then >> + continue >> + fi >> + >> + if [[ $function = module_put* ]]; then >> + # Mark the module as loaded when the first module_put event happens after module_load. >> + module_loaded[${current_module[$pid]}]=1 >> + # Module has been loaded when module_put is called, untrack the task >> + unset current_module[$pid] >> + continue >> + fi >> + >> + # Once we get here, the task is being tracked(is loading a module). >> + # Get the module name. >> + module_name=${current_module[$pid]} >> + >> + if [[ $function = mm_page_alloc* ]]; then >> + order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/') >> + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order)))) >> + fi >> + done < $TMPFILE >> + >> + # Print the result out. >> + echo "== debug_mem for kernel modules during loading begin ==" >&2 >> + for i in $module_names; do >> + status="" >> + if ! [[ ${module_loaded[$i]} ]]; then >> + status="[still loading]" >> + fi >> + echo "${nr_alloc_pages[$i]} pages consumed by \"$i\" $status" >&2 >> + done >> + echo "== debug_mem for kernel modules during loading end ==" >&2 >> + >> + unset module_names >> + unset module_loaded > unset nr_alloc_pages? nr_alloc_pages will be set to be 0 when module_load hits, but I will update or try to change them to be local. > >> + rm $TMPFILE -f >> +} >> + >> +if is_trace_data_available ; then >> + parse_trace_data >> +fi >> diff --git a/modules.d/99memdebug-ko/memdebug-ko-prepare.sh b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh >> new file mode 100755 >> index 0000000..134821a >> --- /dev/null >> +++ b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh >> @@ -0,0 +1,60 @@ >> +# Try to find out kernel modules with large total memory allocation during loading. >> +# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc" >> +# alone should be enough for the purpose. >> +# >> +# Prepare for the trace setup. >> + >> +prepare_trace() { >> + # Only "rd.memdebug=4" is for the purpose. >> + if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then >> + return 0 >> + fi >> + >> + TRACE_BASE="/sys/kernel/debug" >> + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. >> + if [[ -d "/sys/kernel/tracing" ]]; then >> + TRACE_BASE="/sys/kernel" >> + fi > Move TRACE_BASE as a local variable, maybe a function.. It is in a function, will change it to be local. > >> + >> + # old debugfs case. >> + if ! [[ -d "$TRACE_BASE/tracing" ]]; then >> + mount none -t debugfs $TRACE_BASE >> + # new tracefs case. >> + elif ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then >> + mount none -t tracefs "$TRACE_BASE/tracing" >> + fi >> + >> + if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then >> + warn "Mount trace failed for kernel module memory analyzing." >> + return 1 >> + fi >> + >> + MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc" > Do we have module_init event? what is the difference between module_load > and module_init? module_load event is in some path of finit_module/init_module syscall, quite similar to module_init, after the event there is do_init_module()->do_one_initcall() to invoke the module init function, then module_put() immediately once finished which produces module_put event. So between the two events, there are all the needed page allocation events for us. I haven't see other better events. Regards, Xunlei > >> + SET_EVENTS=$(echo $(cat $TRACE_BASE/tracing/set_event)) >> + # Check if trace was properly setup, prepare it if not. >> + if [[ $(cat $TRACE_BASE/tracing/tracing_on) != 1 ]] || \ >> + [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then >> + # Set our trace events. >> + echo $MATCH_EVENTS > $TRACE_BASE/tracing/set_event >> + >> + # There are three kinds of known applications for module loading: >> + # "systemd-udevd", "modprobe" and "insmod". >> + # Set them to the mm_page_alloc event filter. >> + # NOTE: Some kernel may not support this format of filter, anyway >> + # the operation will fail and it doesn't matter. >> + page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod" >> + echo $page_alloc_filter > $TRACE_BASE/tracing/events/kmem/mm_page_alloc/filter >> + >> + # Set the number of comm-pid. >> + # Thanks to filters, 4096 is big enough(also well supported). >> + echo 4096 > $TRACE_BASE/tracing/saved_cmdlines_size >> + >> + # Enable and clear trace data for the first time. >> + echo 1 > $TRACE_BASE/tracing/tracing_on >> + echo > $TRACE_BASE/tracing/trace >> + fi >> + >> + return 0 >> +} >> + >> +prepare_trace >> diff --git a/modules.d/99memdebug-ko/module-setup.sh b/modules.d/99memdebug-ko/module-setup.sh >> new file mode 100755 >> index 0000000..d6a7f0c >> --- /dev/null >> +++ b/modules.d/99memdebug-ko/module-setup.sh >> @@ -0,0 +1,25 @@ >> +#!/bin/bash >> + >> +# called by dracut >> +check() { >> + # this is very small, always add it. >> + # It has no effect if no "rd.memdebug=4" is specified, also >> + # this stays consistent with "rd.memdebug=[0-3]". >> + return 0 >> +} >> + >> +# called by dracut >> +depends() { >> + return 0 >> +} >> + >> +# called by dracut >> +install() { >> + # Prepare trace at the early stage before any ko modules is loaded. >> + inst_hook cmdline 00 "$moddir/memdebug-ko-prepare.sh" >> + >> + # Before switching root to the real root, rootfs should be mounted >> + # and all dependent ko modules are expected to be loaded. >> + # We can parse trace data at this stage. >> + inst_hook pre-pivot 99 "$moddir/memdebug-ko-parse.sh" >> +} >> -- >> 1.8.3.1 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe initramfs" in >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe initramfs" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <581AB9D7.2010905-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <581AB9D7.2010905-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2016-11-03 4:45 ` Xunlei Pang 2016-11-03 7:28 ` Xunlei Pang 1 sibling, 0 replies; 14+ messages in thread From: Xunlei Pang @ 2016-11-03 4:45 UTC (permalink / raw) To: Dave Young, Xunlei Pang Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA, Pratyush Anand Ccing Pratyush who is familiar with tracing On 2016/11/03 at 12:15, Xunlei Pang wrote: > On 2016/11/03 at 11:01, Dave Young wrote: >> Hi Xunlei, >> On 11/02/16 at 05:03pm, Xunlei Pang wrote: >>> The current method for memory debug is to use "rd.memdebug=[0-3]", >>> it is not enough for debugging kernel modules. For example, when we >>> want to find out which kernel module consumes a large amount of memory, >>> "rd.memdebug=[0-3]" won't help too much. >>> >>> A better way is needed to achieve this requirement, this is useful for >>> kdump OOM debugging. >>> >>> The principle of this patch is to use kernel trace to track slab and >>> buddy allocation calls during kernel module loading(module_init), thus >>> we can analyze all the trace data and get the total memory consumption. >>> As for large slab allocation, it will probably fall into buddy allocation, >>> thus tracing "mm_page_alloc" alone should be enough for the purpose(this >>> saves quite some trace buffer memory, also large free is quite unlikey >>> during module loading, we neglect those memory free events). >>> >>> The trace events include memory calls under "tracing/events/": >>> kmem/mm_page_alloc >>> >>> We also inpect the following events to detect the module loading: >>> module/module_load >>> module/module_put >>> >>> Since we use filters to trace events, the final trace data size won't >>> be too big. Users can adjust the trace buffer size via "trace_buf_size" >>> kernel boot command line as needed. >>> >>> We can get the module name and task pid from "module_load" event which >>> also mark the beginning of the loading, and module_put called by the >>> same task pid implies the end of the loading. So the memory events >>> recorded in between by the same task pid are consumed by this module >>> during loading(i.e. modprobe or module_init()). >>> >>> With these information, we can record the total memory(the larger, the >>> more precise the result will be) consumption involved by each kernel >>> module loading. >>> >>> Thus we introduce this dracut module to find out which kernel module >>> consumes a large amount of memory during loading. It uses "rd.memdebug=4" >>> as the tigger. >>> >>> After applying is patch and specifying "rd.memdebug=4", during booting >>> it will print out something below: >>> == debug_mem for kernel modules during loading begin == >>> 0 pages consumed by "pata_acpi" >>> 0 pages consumed by "ata_generic" >>> 1 pages consumed by "drm" >>> 0 pages consumed by "ttm" >>> 0 pages consumed by "drm_kms_helper" >>> 835 pages consumed by "qxl" >>> 0 pages consumed by "mii" >>> 6 pages consumed by "8139cp" >>> 0 pages consumed by "virtio" >>> 0 pages consumed by "virtio_ring" >>> 9 pages consumed by "virtio_pci" >>> 1 pages consumed by "8139too" >>> 0 pages consumed by "serio_raw" >>> 0 pages consumed by "crc32c_intel" >>> 199 pages consumed by "virtio_console" >>> 0 pages consumed by "libcrc32c" >>> 9 pages consumed by "xfs" >>> == debug_mem for kernel modules during loading end == >>> >>> From the print, we see clearly that "qxl" consumed lots of memory. >>> >>> Signed-off-by: Xunlei Pang <xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>> --- >>> dracut.spec | 1 + >>> modules.d/99memdebug-ko/memdebug-ko-parse.sh | 93 ++++++++++++++++++++++++++ >>> modules.d/99memdebug-ko/memdebug-ko-prepare.sh | 60 +++++++++++++++++ >>> modules.d/99memdebug-ko/module-setup.sh | 25 +++++++ >>> 4 files changed, 179 insertions(+) >>> create mode 100755 modules.d/99memdebug-ko/memdebug-ko-parse.sh >>> create mode 100755 modules.d/99memdebug-ko/memdebug-ko-prepare.sh >>> create mode 100755 modules.d/99memdebug-ko/module-setup.sh >>> >>> diff --git a/dracut.spec b/dracut.spec >>> index 7b9d675..3907b17 100644 >>> --- a/dracut.spec >>> +++ b/dracut.spec >>> @@ -407,6 +407,7 @@ rm -rf -- $RPM_BUILD_ROOT >>> %{dracutlibdir}/modules.d/98usrmount >>> %{dracutlibdir}/modules.d/99base >>> %{dracutlibdir}/modules.d/99fs-lib >>> +%{dracutlibdir}/modules.d/99memdebug-ko >>> %{dracutlibdir}/modules.d/99shutdown >>> %attr(0644,root,root) %ghost %config(missingok,noreplace) %{_localstatedir}/log/dracut.log >>> %dir %{_sharedstatedir}/initramfs >>> diff --git a/modules.d/99memdebug-ko/memdebug-ko-parse.sh b/modules.d/99memdebug-ko/memdebug-ko-parse.sh >>> new file mode 100755 >>> index 0000000..1542658 >>> --- /dev/null >>> +++ b/modules.d/99memdebug-ko/memdebug-ko-parse.sh >>> @@ -0,0 +1,93 @@ >>> +# Parse the trace data to find out the modules consuming large amount of memory. >>> + >>> +is_trace_data_available() { >>> + # Only "rd.memdebug=4" is for the purpose. >>> + if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then >>> + return 1 >>> + fi > Hi Dave, > > Firstly, thanks for your valuable comments. > >> The usage of rd.memdebug is different in other place, can we integrate >> it to the original rd.memdbug code? >> >> One concern it in this patch only parse the trace data at pre-pivot >> hook, so if oom happens earlier than that we will get nothing. >> >> Original rd.memdebug being added to several different point, if we >> integrate this to the framework then we can get the out put early. > Yes, they can be integrated together. My original thought is that to > just output the memory information for all the modules at one stage > where all modules are supposed to be loaded, as such can avoid printing > redundant/partial information if at multiple stages(hooks). > > If OOM happens (mainly for kdump cases), we can enlarge the reserved > crash memory and try again, because this feature itself might consume > large trace buffer memory. > > But I can do in that way if you think it is fine. > >>> + >>> + TRACE_BASE="/sys/kernel/debug" >>> + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. >>> + if [[ -d "/sys/kernel/tracing" ]]; then >>> + TRACE_BASE="/sys/kernel" >>> + fi >>> + >>> + if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then >>> + warn "Trace inactive, skipping kernel module memory analyzing." >>> + return 1 >>> + fi >> This function is only for checking if tracing is needed by checking >> cmdline, so getting the TRACE_BASE logic can be moved to >> parse_trace_data so that it can be a local variable.. > It is actually in a function, will change it to be a local definition. > >>> + >>> + return 0 >>> +} >>> + >>> +parse_trace_data() { >>> + # Indexed by task pid. >>> + declare -A current_module >>> + # Indexed by module name. >>> + declare -A module_loaded >>> + declare -A nr_alloc_pages >>> + >>> + # For large trace data, parsing "tracing/trace" directly turns out to be time-consuming, >>> + # hence copy it out first, then parse the copy file instead. >> Have you get any data about the time used via traceing/trace parsing, if >> the time is still acceptable then it will be better to avoid the tmp >> file. > For large trace data(tested on rhel7, the filter doesn't work on rhel7, and will produce huge trace data), > the time consumption is huge, I am afraid in minutes because I once suspected the script was in some > dead loop when parsing "tracing/trace" directly. It is the same situation when turning off tracing_on and > try again. > >>> + TMPFILE=/tmp/kdump.trace.tmp.$$$$ >>> + cp $TRACE_BASE/tracing/trace $TMPFILE -f >>> + while read pid cpu flags ts function ; >>> + do >>> + # Skip comment lines >>> + if [[ $pid = "#" ]]; then >>> + continue >>> + fi >>> + >>> + if [[ $function = module_load* ]]; then >>> + # One module is being loaded, save the task pid for tracking. >>> + module_name=${function#*: } >>> + # Remove the trailing after whitespace, there may be the module flags. >>> + module_name=${module_name%% *} >>> + module_names+=" $module_name" >>> + current_module[$pid]="$module_name" >>> + [[ ${module_loaded[$module_name]} ]] && warn "\"$module_name\" was loaded multiple times!" >>> + unset module_loaded[$module_name] >>> + nr_alloc_pages[$module_name]=0 >>> + fi >>> + >>> + if ! [[ ${current_module[$pid]} ]]; then >>> + continue >>> + fi >>> + >>> + if [[ $function = module_put* ]]; then >>> + # Mark the module as loaded when the first module_put event happens after module_load. >>> + module_loaded[${current_module[$pid]}]=1 >>> + # Module has been loaded when module_put is called, untrack the task >>> + unset current_module[$pid] >>> + continue >>> + fi >>> + >>> + # Once we get here, the task is being tracked(is loading a module). >>> + # Get the module name. >>> + module_name=${current_module[$pid]} >>> + >>> + if [[ $function = mm_page_alloc* ]]; then >>> + order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/') >>> + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order)))) >>> + fi >>> + done < $TMPFILE >>> + >>> + # Print the result out. >>> + echo "== debug_mem for kernel modules during loading begin ==" >&2 >>> + for i in $module_names; do >>> + status="" >>> + if ! [[ ${module_loaded[$i]} ]]; then >>> + status="[still loading]" >>> + fi >>> + echo "${nr_alloc_pages[$i]} pages consumed by \"$i\" $status" >&2 >>> + done >>> + echo "== debug_mem for kernel modules during loading end ==" >&2 >>> + >>> + unset module_names >>> + unset module_loaded >> unset nr_alloc_pages? > nr_alloc_pages will be set to be 0 when module_load hits, but I will update or try to change them to be local. > > >>> + rm $TMPFILE -f >>> +} >>> + >>> +if is_trace_data_available ; then >>> + parse_trace_data >>> +fi >>> diff --git a/modules.d/99memdebug-ko/memdebug-ko-prepare.sh b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh >>> new file mode 100755 >>> index 0000000..134821a >>> --- /dev/null >>> +++ b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh >>> @@ -0,0 +1,60 @@ >>> +# Try to find out kernel modules with large total memory allocation during loading. >>> +# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc" >>> +# alone should be enough for the purpose. >>> +# >>> +# Prepare for the trace setup. >>> + >>> +prepare_trace() { >>> + # Only "rd.memdebug=4" is for the purpose. >>> + if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then >>> + return 0 >>> + fi >>> + >>> + TRACE_BASE="/sys/kernel/debug" >>> + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. >>> + if [[ -d "/sys/kernel/tracing" ]]; then >>> + TRACE_BASE="/sys/kernel" >>> + fi >> Move TRACE_BASE as a local variable, maybe a function.. > It is in a function, will change it to be local. > >>> + >>> + # old debugfs case. >>> + if ! [[ -d "$TRACE_BASE/tracing" ]]; then >>> + mount none -t debugfs $TRACE_BASE >>> + # new tracefs case. >>> + elif ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then >>> + mount none -t tracefs "$TRACE_BASE/tracing" >>> + fi >>> + >>> + if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then >>> + warn "Mount trace failed for kernel module memory analyzing." >>> + return 1 >>> + fi >>> + >>> + MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc" >> Do we have module_init event? what is the difference between module_load >> and module_init? > module_load event is in some path of finit_module/init_module syscall, quite similar to > module_init, after the event there is do_init_module()->do_one_initcall() to invoke the > module init function, then module_put() immediately once finished which produces > module_put event. So between the two events, there are all the needed page allocation > events for us. I haven't see other better events. > > Regards, > Xunlei > >>> + SET_EVENTS=$(echo $(cat $TRACE_BASE/tracing/set_event)) >>> + # Check if trace was properly setup, prepare it if not. >>> + if [[ $(cat $TRACE_BASE/tracing/tracing_on) != 1 ]] || \ >>> + [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then >>> + # Set our trace events. >>> + echo $MATCH_EVENTS > $TRACE_BASE/tracing/set_event >>> + >>> + # There are three kinds of known applications for module loading: >>> + # "systemd-udevd", "modprobe" and "insmod". >>> + # Set them to the mm_page_alloc event filter. >>> + # NOTE: Some kernel may not support this format of filter, anyway >>> + # the operation will fail and it doesn't matter. >>> + page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod" >>> + echo $page_alloc_filter > $TRACE_BASE/tracing/events/kmem/mm_page_alloc/filter >>> + >>> + # Set the number of comm-pid. >>> + # Thanks to filters, 4096 is big enough(also well supported). >>> + echo 4096 > $TRACE_BASE/tracing/saved_cmdlines_size >>> + >>> + # Enable and clear trace data for the first time. >>> + echo 1 > $TRACE_BASE/tracing/tracing_on >>> + echo > $TRACE_BASE/tracing/trace >>> + fi >>> + >>> + return 0 >>> +} >>> + >>> +prepare_trace >>> diff --git a/modules.d/99memdebug-ko/module-setup.sh b/modules.d/99memdebug-ko/module-setup.sh >>> new file mode 100755 >>> index 0000000..d6a7f0c >>> --- /dev/null >>> +++ b/modules.d/99memdebug-ko/module-setup.sh >>> @@ -0,0 +1,25 @@ >>> +#!/bin/bash >>> + >>> +# called by dracut >>> +check() { >>> + # this is very small, always add it. >>> + # It has no effect if no "rd.memdebug=4" is specified, also >>> + # this stays consistent with "rd.memdebug=[0-3]". >>> + return 0 >>> +} >>> + >>> +# called by dracut >>> +depends() { >>> + return 0 >>> +} >>> + >>> +# called by dracut >>> +install() { >>> + # Prepare trace at the early stage before any ko modules is loaded. >>> + inst_hook cmdline 00 "$moddir/memdebug-ko-prepare.sh" >>> + >>> + # Before switching root to the real root, rootfs should be mounted >>> + # and all dependent ko modules are expected to be loaded. >>> + # We can parse trace data at this stage. >>> + inst_hook pre-pivot 99 "$moddir/memdebug-ko-parse.sh" >>> +} >>> -- >>> 1.8.3.1 >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe initramfs" in >>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe initramfs" in >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe initramfs" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <581AB9D7.2010905-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2016-11-03 4:45 ` Xunlei Pang @ 2016-11-03 7:28 ` Xunlei Pang [not found] ` <581AE707.9050606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 14+ messages in thread From: Xunlei Pang @ 2016-11-03 7:28 UTC (permalink / raw) To: Dave Young, Xunlei Pang Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA, Pratyush Anand On 2016/11/03 at 12:15, Xunlei Pang wrote: > On 2016/11/03 at 11:01, Dave Young wrote: >> Hi Xunlei, >> On 11/02/16 at 05:03pm, Xunlei Pang wrote: >>> The current method for memory debug is to use "rd.memdebug=[0-3]", >>> it is not enough for debugging kernel modules. For example, when we >>> want to find out which kernel module consumes a large amount of memory, >>> "rd.memdebug=[0-3]" won't help too much. >>> >>> A better way is needed to achieve this requirement, this is useful for >>> kdump OOM debugging. >>> >>> The principle of this patch is to use kernel trace to track slab and >>> buddy allocation calls during kernel module loading(module_init), thus >>> we can analyze all the trace data and get the total memory consumption. >>> As for large slab allocation, it will probably fall into buddy allocation, >>> thus tracing "mm_page_alloc" alone should be enough for the purpose(this >>> saves quite some trace buffer memory, also large free is quite unlikey >>> during module loading, we neglect those memory free events). >>> >>> The trace events include memory calls under "tracing/events/": >>> kmem/mm_page_alloc >>> >>> We also inpect the following events to detect the module loading: >>> module/module_load >>> module/module_put >>> >>> Since we use filters to trace events, the final trace data size won't >>> be too big. Users can adjust the trace buffer size via "trace_buf_size" >>> kernel boot command line as needed. >>> >>> We can get the module name and task pid from "module_load" event which >>> also mark the beginning of the loading, and module_put called by the >>> same task pid implies the end of the loading. So the memory events >>> recorded in between by the same task pid are consumed by this module >>> during loading(i.e. modprobe or module_init()). >>> >>> With these information, we can record the total memory(the larger, the >>> more precise the result will be) consumption involved by each kernel >>> module loading. >>> >>> Thus we introduce this dracut module to find out which kernel module >>> consumes a large amount of memory during loading. It uses "rd.memdebug=4" >>> as the tigger. >>> >>> After applying is patch and specifying "rd.memdebug=4", during booting >>> it will print out something below: >>> == debug_mem for kernel modules during loading begin == >>> 0 pages consumed by "pata_acpi" >>> 0 pages consumed by "ata_generic" >>> 1 pages consumed by "drm" >>> 0 pages consumed by "ttm" >>> 0 pages consumed by "drm_kms_helper" >>> 835 pages consumed by "qxl" >>> 0 pages consumed by "mii" >>> 6 pages consumed by "8139cp" >>> 0 pages consumed by "virtio" >>> 0 pages consumed by "virtio_ring" >>> 9 pages consumed by "virtio_pci" >>> 1 pages consumed by "8139too" >>> 0 pages consumed by "serio_raw" >>> 0 pages consumed by "crc32c_intel" >>> 199 pages consumed by "virtio_console" >>> 0 pages consumed by "libcrc32c" >>> 9 pages consumed by "xfs" >>> == debug_mem for kernel modules during loading end == >>> >>> From the print, we see clearly that "qxl" consumed lots of memory. >>> >>> Signed-off-by: Xunlei Pang <xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >>> --- >>> dracut.spec | 1 + >>> modules.d/99memdebug-ko/memdebug-ko-parse.sh | 93 ++++++++++++++++++++++++++ >>> modules.d/99memdebug-ko/memdebug-ko-prepare.sh | 60 +++++++++++++++++ >>> modules.d/99memdebug-ko/module-setup.sh | 25 +++++++ >>> 4 files changed, 179 insertions(+) >>> create mode 100755 modules.d/99memdebug-ko/memdebug-ko-parse.sh >>> create mode 100755 modules.d/99memdebug-ko/memdebug-ko-prepare.sh >>> create mode 100755 modules.d/99memdebug-ko/module-setup.sh >>> >>> diff --git a/dracut.spec b/dracut.spec >>> index 7b9d675..3907b17 100644 >>> --- a/dracut.spec >>> +++ b/dracut.spec >>> @@ -407,6 +407,7 @@ rm -rf -- $RPM_BUILD_ROOT >>> %{dracutlibdir}/modules.d/98usrmount >>> %{dracutlibdir}/modules.d/99base >>> %{dracutlibdir}/modules.d/99fs-lib >>> +%{dracutlibdir}/modules.d/99memdebug-ko >>> %{dracutlibdir}/modules.d/99shutdown >>> %attr(0644,root,root) %ghost %config(missingok,noreplace) %{_localstatedir}/log/dracut.log >>> %dir %{_sharedstatedir}/initramfs >>> diff --git a/modules.d/99memdebug-ko/memdebug-ko-parse.sh b/modules.d/99memdebug-ko/memdebug-ko-parse.sh >>> new file mode 100755 >>> index 0000000..1542658 >>> --- /dev/null >>> +++ b/modules.d/99memdebug-ko/memdebug-ko-parse.sh >>> @@ -0,0 +1,93 @@ >>> +# Parse the trace data to find out the modules consuming large amount of memory. >>> + >>> +is_trace_data_available() { >>> + # Only "rd.memdebug=4" is for the purpose. >>> + if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then >>> + return 1 >>> + fi > Hi Dave, > > Firstly, thanks for your valuable comments. > >> The usage of rd.memdebug is different in other place, can we integrate >> it to the original rd.memdbug code? >> >> One concern it in this patch only parse the trace data at pre-pivot >> hook, so if oom happens earlier than that we will get nothing. >> >> Original rd.memdebug being added to several different point, if we >> integrate this to the framework then we can get the out put early. > Yes, they can be integrated together. My original thought is that to > just output the memory information for all the modules at one stage > where all modules are supposed to be loaded, as such can avoid printing > redundant/partial information if at multiple stages(hooks). > > If OOM happens (mainly for kdump cases), we can enlarge the reserved > crash memory and try again, because this feature itself might consume > large trace buffer memory. > > But I can do in that way if you think it is fine. > >>> + >>> + TRACE_BASE="/sys/kernel/debug" >>> + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. >>> + if [[ -d "/sys/kernel/tracing" ]]; then >>> + TRACE_BASE="/sys/kernel" >>> + fi >>> + >>> + if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then >>> + warn "Trace inactive, skipping kernel module memory analyzing." >>> + return 1 >>> + fi >> This function is only for checking if tracing is needed by checking >> cmdline, so getting the TRACE_BASE logic can be moved to >> parse_trace_data so that it can be a local variable.. > It is actually in a function, will change it to be a local definition. > >>> + >>> + return 0 >>> +} >>> + >>> +parse_trace_data() { >>> + # Indexed by task pid. >>> + declare -A current_module >>> + # Indexed by module name. >>> + declare -A module_loaded >>> + declare -A nr_alloc_pages >>> + >>> + # For large trace data, parsing "tracing/trace" directly turns out to be time-consuming, >>> + # hence copy it out first, then parse the copy file instead. >> Have you get any data about the time used via traceing/trace parsing, if >> the time is still acceptable then it will be better to avoid the tmp >> file. > For large trace data(tested on rhel7, the filter doesn't work on rhel7, and will produce huge trace data), > the time consumption is huge, I am afraid in minutes because I once suspected the script was in some > dead loop when parsing "tracing/trace" directly. It is the same situation when turning off tracing_on and > try again. Although I don't know why, after I replaced the following scripts 1) while read pid cpu flags ts function do ... ... done < "$TRACE_BASE/tracing/trace" with 2) cat "$TRACE_BASE/tracing/trace" | while read pid cpu flags ts function do ... ... done 2) became not time-consuming just like parsing the copied filename in 1) ... Regards, Xunlei > >>> + TMPFILE=/tmp/kdump.trace.tmp.$$$$ >>> + cp $TRACE_BASE/tracing/trace $TMPFILE -f >>> + while read pid cpu flags ts function ; >>> + do >>> + # Skip comment lines >>> + if [[ $pid = "#" ]]; then >>> + continue >>> + fi >>> + >>> + if [[ $function = module_load* ]]; then >>> + # One module is being loaded, save the task pid for tracking. >>> + module_name=${function#*: } >>> + # Remove the trailing after whitespace, there may be the module flags. >>> + module_name=${module_name%% *} >>> + module_names+=" $module_name" >>> + current_module[$pid]="$module_name" >>> + [[ ${module_loaded[$module_name]} ]] && warn "\"$module_name\" was loaded multiple times!" >>> + unset module_loaded[$module_name] >>> + nr_alloc_pages[$module_name]=0 >>> + fi >>> + >>> + if ! [[ ${current_module[$pid]} ]]; then >>> + continue >>> + fi >>> + >>> + if [[ $function = module_put* ]]; then >>> + # Mark the module as loaded when the first module_put event happens after module_load. >>> + module_loaded[${current_module[$pid]}]=1 >>> + # Module has been loaded when module_put is called, untrack the task >>> + unset current_module[$pid] >>> + continue >>> + fi >>> + >>> + # Once we get here, the task is being tracked(is loading a module). >>> + # Get the module name. >>> + module_name=${current_module[$pid]} >>> + >>> + if [[ $function = mm_page_alloc* ]]; then >>> + order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/') >>> + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order)))) >>> + fi >>> + done < $TMPFILE >>> + >>> + # Print the result out. >>> + echo "== debug_mem for kernel modules during loading begin ==" >&2 >>> + for i in $module_names; do >>> + status="" >>> + if ! [[ ${module_loaded[$i]} ]]; then >>> + status="[still loading]" >>> + fi >>> + echo "${nr_alloc_pages[$i]} pages consumed by \"$i\" $status" >&2 >>> + done >>> + echo "== debug_mem for kernel modules during loading end ==" >&2 >>> + >>> + unset module_names >>> + unset module_loaded >> unset nr_alloc_pages? > nr_alloc_pages will be set to be 0 when module_load hits, but I will update or try to change them to be local. > > >>> + rm $TMPFILE -f >>> +} >>> + >>> +if is_trace_data_available ; then >>> + parse_trace_data >>> +fi >>> diff --git a/modules.d/99memdebug-ko/memdebug-ko-prepare.sh b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh >>> new file mode 100755 >>> index 0000000..134821a >>> --- /dev/null >>> +++ b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh >>> @@ -0,0 +1,60 @@ >>> +# Try to find out kernel modules with large total memory allocation during loading. >>> +# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc" >>> +# alone should be enough for the purpose. >>> +# >>> +# Prepare for the trace setup. >>> + >>> +prepare_trace() { >>> + # Only "rd.memdebug=4" is for the purpose. >>> + if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then >>> + return 0 >>> + fi >>> + >>> + TRACE_BASE="/sys/kernel/debug" >>> + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. >>> + if [[ -d "/sys/kernel/tracing" ]]; then >>> + TRACE_BASE="/sys/kernel" >>> + fi >> Move TRACE_BASE as a local variable, maybe a function.. > It is in a function, will change it to be local. > >>> + >>> + # old debugfs case. >>> + if ! [[ -d "$TRACE_BASE/tracing" ]]; then >>> + mount none -t debugfs $TRACE_BASE >>> + # new tracefs case. >>> + elif ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then >>> + mount none -t tracefs "$TRACE_BASE/tracing" >>> + fi >>> + >>> + if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then >>> + warn "Mount trace failed for kernel module memory analyzing." >>> + return 1 >>> + fi >>> + >>> + MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc" >> Do we have module_init event? what is the difference between module_load >> and module_init? > module_load event is in some path of finit_module/init_module syscall, quite similar to > module_init, after the event there is do_init_module()->do_one_initcall() to invoke the > module init function, then module_put() immediately once finished which produces > module_put event. So between the two events, there are all the needed page allocation > events for us. I haven't see other better events. > > Regards, > Xunlei > >>> + SET_EVENTS=$(echo $(cat $TRACE_BASE/tracing/set_event)) >>> + # Check if trace was properly setup, prepare it if not. >>> + if [[ $(cat $TRACE_BASE/tracing/tracing_on) != 1 ]] || \ >>> + [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then >>> + # Set our trace events. >>> + echo $MATCH_EVENTS > $TRACE_BASE/tracing/set_event >>> + >>> + # There are three kinds of known applications for module loading: >>> + # "systemd-udevd", "modprobe" and "insmod". >>> + # Set them to the mm_page_alloc event filter. >>> + # NOTE: Some kernel may not support this format of filter, anyway >>> + # the operation will fail and it doesn't matter. >>> + page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod" >>> + echo $page_alloc_filter > $TRACE_BASE/tracing/events/kmem/mm_page_alloc/filter >>> + >>> + # Set the number of comm-pid. >>> + # Thanks to filters, 4096 is big enough(also well supported). >>> + echo 4096 > $TRACE_BASE/tracing/saved_cmdlines_size >>> + >>> + # Enable and clear trace data for the first time. >>> + echo 1 > $TRACE_BASE/tracing/tracing_on >>> + echo > $TRACE_BASE/tracing/trace >>> + fi >>> + >>> + return 0 >>> +} >>> + >>> +prepare_trace >>> diff --git a/modules.d/99memdebug-ko/module-setup.sh b/modules.d/99memdebug-ko/module-setup.sh >>> new file mode 100755 >>> index 0000000..d6a7f0c >>> --- /dev/null >>> +++ b/modules.d/99memdebug-ko/module-setup.sh >>> @@ -0,0 +1,25 @@ >>> +#!/bin/bash >>> + >>> +# called by dracut >>> +check() { >>> + # this is very small, always add it. >>> + # It has no effect if no "rd.memdebug=4" is specified, also >>> + # this stays consistent with "rd.memdebug=[0-3]". >>> + return 0 >>> +} >>> + >>> +# called by dracut >>> +depends() { >>> + return 0 >>> +} >>> + >>> +# called by dracut >>> +install() { >>> + # Prepare trace at the early stage before any ko modules is loaded. >>> + inst_hook cmdline 00 "$moddir/memdebug-ko-prepare.sh" >>> + >>> + # Before switching root to the real root, rootfs should be mounted >>> + # and all dependent ko modules are expected to be loaded. >>> + # We can parse trace data at this stage. >>> + inst_hook pre-pivot 99 "$moddir/memdebug-ko-parse.sh" >>> +} >>> -- >>> 1.8.3.1 >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe initramfs" in >>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe initramfs" in >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe initramfs" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <581AE707.9050606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <581AE707.9050606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2016-11-03 8:38 ` Dave Young [not found] ` <20161103083826.GA15431-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Dave Young @ 2016-11-03 8:38 UTC (permalink / raw) To: xlpang-H+wXaHxf7aLQT0dZR+AlfA Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA, Pratyush Anand On 11/03/16 at 03:28pm, Xunlei Pang wrote: [snip] > > For large trace data(tested on rhel7, the filter doesn't work on rhel7, and will produce huge trace data), > > the time consumption is huge, I am afraid in minutes because I once suspected the script was in some > > dead loop when parsing "tracing/trace" directly. It is the same situation when turning off tracing_on and > > try again. > > Although I don't know why, after I replaced the following scripts > 1) > while read pid cpu flags ts function > do > ... ... > done < "$TRACE_BASE/tracing/trace" > > with > > 2) > cat "$TRACE_BASE/tracing/trace" | while read pid cpu flags ts function > do > ... ... > done > > 2) became not time-consuming just like parsing the copied filename in 1) ... Maybe 1) read the sysfs file a lot of times, but 2) only once then parsing them in pipe which is quiker. It should be fine if 2) is acceptable, but if the data is very large it may worth to use some external program like awk which will be faster. Thanks Dave ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20161103083826.GA15431-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>]
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <20161103083826.GA15431-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org> @ 2016-11-03 11:52 ` Xunlei Pang [not found] ` <581B2518.6060503-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Xunlei Pang @ 2016-11-03 11:52 UTC (permalink / raw) To: Dave Young, xlpang-H+wXaHxf7aLQT0dZR+AlfA Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA, Pratyush Anand On 2016/11/03 at 16:38, Dave Young wrote: > On 11/03/16 at 03:28pm, Xunlei Pang wrote: > [snip] >>> For large trace data(tested on rhel7, the filter doesn't work on rhel7, and will produce huge trace data), >>> the time consumption is huge, I am afraid in minutes because I once suspected the script was in some >>> dead loop when parsing "tracing/trace" directly. It is the same situation when turning off tracing_on and >>> try again. >> Although I don't know why, after I replaced the following scripts >> 1) >> while read pid cpu flags ts function >> do >> ... ... >> done < "$TRACE_BASE/tracing/trace" >> >> with >> >> 2) >> cat "$TRACE_BASE/tracing/trace" | while read pid cpu flags ts function >> do >> ... ... >> done >> >> 2) became not time-consuming just like parsing the copied filename in 1) ... > Maybe 1) read the sysfs file a lot of times, but 2) only once then > parsing them in pipe which is quiker. > > It should be fine if 2) is acceptable, but if the data is very large it > may worth to use some external program like awk which will be faster. Hi Dave, What do you think the following approach? ============== [PATCH 1/2] ================ --- modules.d/99base/memdebug-ko.sh | 119 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100755 modules.d/99base/memdebug-ko.sh diff --git a/modules.d/99base/memdebug-ko.sh b/modules.d/99base/memdebug-ko.sh new file mode 100755 index 0000000..2839966 --- /dev/null +++ b/modules.d/99base/memdebug-ko.sh @@ -0,0 +1,119 @@ +# Try to find out kernel modules with large total memory allocation during loading. +# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc" +# alone should be enough for the purpose. + +# "sys/kernel/tracing" has the priority if exists. +get_trace_base() { + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. + if [[ -d "/sys/kernel/tracing" ]]; then + echo "/sys/kernel" + else + echo "/sys/kernel/debug" + fi +} + +is_trace_data_prepared() { + local trace_base + + trace_base=$(get_trace_base) + # old debugfs interface case. + if ! [[ -d "$trace_base/tracing" ]]; then + mount none -t debugfs $trace_base + # new tracefs interface case. + elif ! [[ -f "$trace_base/tracing/trace" ]]; then + mount none -t tracefs "$trace_base/tracing" + fi + + if ! [[ -f "$trace_base/tracing/trace" ]]; then + echo "WARN: Mount trace failed for kernel module memory analyzing." + return 1 + fi + + MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc" + SET_EVENTS=$(echo $(cat $trace_base/tracing/set_event)) + # Check if trace was properly setup, prepare it if not. + if [[ $(cat $trace_base/tracing/tracing_on) != 1 ]] || \ + [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then + # Set our trace events. + echo $MATCH_EVENTS > $trace_base/tracing/set_event + + # There are three kinds of known applications for module loading: + # "systemd-udevd", "modprobe" and "insmod". + # Set them to the mm_page_alloc event filter. + # NOTE: Some kernel may not support this format of filter, anyway + # the operation will fail and it doesn't matter. + page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod" + echo $page_alloc_filter > $trace_base/tracing/events/kmem/mm_page_alloc/filter + + # Set the number of comm-pid if supported. + if [[ -f "$trace_base/tracing/saved_cmdlines_size" ]]; then + # Thanks to filters, 4096 is big enough(also well supported). + echo 4096 > $trace_base/tracing/saved_cmdlines_size + fi + + # Enable and clear trace data for the first time. + echo 1 > $trace_base/tracing/tracing_on + echo > $trace_base/tracing/trace + echo "Prepare trace success." + return 1 + fi + + return 0 +} + +parse_trace_data() { + local module_name + # Indexed by task pid. + local -A current_module + # Indexed by module name. + local -A module_loaded + local -A nr_alloc_pages + + cat "$(get_trace_base)/tracing/trace" | while read pid cpu flags ts function + do + # Skip comment lines + if [[ $pid = "#" ]]; then + continue + fi + + if [[ $function = module_load* ]]; then + # One module is being loaded, save the task pid for tracking. + module_name=${function#*: } + # Remove the trailing after whitespace, there may be the module flags. + module_name=${module_name%% *} + # Mark current_module to track the task. + current_module[$pid]="$module_name" + [[ ${module_loaded[$module_name]} ]] && echo "WARN: \"$module_name\" was loaded multiple times!" + unset module_loaded[$module_name] + nr_alloc_pages[$module_name]=0 + continue + fi + + if ! [[ ${current_module[$pid]} ]]; then + continue + fi + + # Once we get here, the task is being tracked(is loading a module). + # Get the module name. + module_name=${current_module[$pid]} + + if [[ $function = module_put* ]]; then + # Mark the module as loaded when the first module_put event happens after module_load. + echo "${nr_alloc_pages[$module_name]} pages consumed by \"$module_name\"" + module_loaded[$module_name]=1 + # Module loading finished, so untrack the task. + unset current_module[$pid] + continue + fi + + if [[ $function = mm_page_alloc* ]]; then + order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/') + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order)))) + fi + done +} + +if is_trace_data_prepared ; then + echo "showkomem - memory consumption of loading kernel modules(the larger, the more precise)" + parse_trace_data +fi -- 1.8.3.1 ============== [PATCH 2/2] ================ --- modules.d/98dracut-systemd/dracut-cmdline.sh | 2 +- modules.d/98dracut-systemd/dracut-mount.sh | 2 +- modules.d/98dracut-systemd/dracut-pre-mount.sh | 2 +- modules.d/98dracut-systemd/dracut-pre-pivot.sh | 2 +- modules.d/98dracut-systemd/dracut-pre-trigger.sh | 2 +- modules.d/98dracut-systemd/dracut-pre-udev.sh | 2 +- modules.d/99base/dracut-lib.sh | 5 ++++- modules.d/99base/init.sh | 10 +++++----- modules.d/99base/module-setup.sh | 1 + 9 files changed, 16 insertions(+), 12 deletions(-) diff --git a/modules.d/98dracut-systemd/dracut-cmdline.sh b/modules.d/98dracut-systemd/dracut-cmdline.sh index 6c6ee02..bff9435 100755 --- a/modules.d/98dracut-systemd/dracut-cmdline.sh +++ b/modules.d/98dracut-systemd/dracut-cmdline.sh @@ -42,7 +42,7 @@ export root export rflags export fstype -make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' +make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' # run scriptlets to parse the command line getarg 'rd.break=cmdline' -d 'rdbreak=cmdline' && emergency_shell -n cmdline "Break before cmdline" source_hook cmdline diff --git a/modules.d/98dracut-systemd/dracut-mount.sh b/modules.d/98dracut-systemd/dracut-mount.sh index c4febfe..89ebc31 100755 --- a/modules.d/98dracut-systemd/dracut-mount.sh +++ b/modules.d/98dracut-systemd/dracut-mount.sh @@ -7,7 +7,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh source_conf /etc/conf.d -make_trace_mem "hook mount" '1:shortmem' '2+:mem' '3+:slab' +make_trace_mem "hook mount" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' getarg 'rd.break=mount' -d 'rdbreak=mount' && emergency_shell -n mount "Break mount" # mount scripts actually try to mount the root filesystem, and may diff --git a/modules.d/98dracut-systemd/dracut-pre-mount.sh b/modules.d/98dracut-systemd/dracut-pre-mount.sh index ae51128..a3b9d29 100755 --- a/modules.d/98dracut-systemd/dracut-pre-mount.sh +++ b/modules.d/98dracut-systemd/dracut-pre-mount.sh @@ -8,7 +8,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh source_conf /etc/conf.d -make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' +make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' # pre pivot scripts are sourced just before we doing cleanup and switch over # to the new root. getarg 'rd.break=pre-mount' 'rdbreak=pre-mount' && emergency_shell -n pre-mount "Break pre-mount" diff --git a/modules.d/98dracut-systemd/dracut-pre-pivot.sh b/modules.d/98dracut-systemd/dracut-pre-pivot.sh index cc70e3c..dfd328c 100755 --- a/modules.d/98dracut-systemd/dracut-pre-pivot.sh +++ b/modules.d/98dracut-systemd/dracut-pre-pivot.sh @@ -8,7 +8,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh source_conf /etc/conf.d -make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' +make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' # pre pivot scripts are sourced just before we doing cleanup and switch over # to the new root. getarg 'rd.break=pre-pivot' 'rdbreak=pre-pivot' && emergency_shell -n pre-pivot "Break pre-pivot" diff --git a/modules.d/98dracut-systemd/dracut-pre-trigger.sh b/modules.d/98dracut-systemd/dracut-pre-trigger.sh index ac1ec36..7cd821e 100755 --- a/modules.d/98dracut-systemd/dracut-pre-trigger.sh +++ b/modules.d/98dracut-systemd/dracut-pre-trigger.sh @@ -8,7 +8,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh source_conf /etc/conf.d -make_trace_mem "hook pre-trigger" "1:shortmem" "2+:mem" "3+:slab" +make_trace_mem "hook pre-trigger" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' source_hook pre-trigger diff --git a/modules.d/98dracut-systemd/dracut-pre-udev.sh b/modules.d/98dracut-systemd/dracut-pre-udev.sh index ca13048..17268a1 100755 --- a/modules.d/98dracut-systemd/dracut-pre-udev.sh +++ b/modules.d/98dracut-systemd/dracut-pre-udev.sh @@ -7,7 +7,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh source_conf /etc/conf.d -make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' +make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' # pre pivot scripts are sourced just before we doing cleanup and switch over # to the new root. getarg 'rd.break=pre-udev' 'rdbreak=pre-udev' && emergency_shell -n pre-udev "Break pre-udev" diff --git a/modules.d/99base/dracut-lib.sh b/modules.d/99base/dracut-lib.sh index 060b3fe..833ed5f 100755 --- a/modules.d/99base/dracut-lib.sh +++ b/modules.d/99base/dracut-lib.sh @@ -1206,7 +1206,7 @@ are_lists_eq() { setmemdebug() { if [ -z "$DEBUG_MEM_LEVEL" ]; then - export DEBUG_MEM_LEVEL=$(getargnum 0 0 3 rd.memdebug) + export DEBUG_MEM_LEVEL=$(getargnum 0 0 4 rd.memdebug) fi } @@ -1296,6 +1296,9 @@ show_memstats() iomem) cat /proc/iomem ;; + komem) + showkomem + ;; esac } diff --git a/modules.d/99base/init.sh b/modules.d/99base/init.sh index a563393..f0195d8 100755 --- a/modules.d/99base/init.sh +++ b/modules.d/99base/init.sh @@ -131,7 +131,7 @@ if ! getargbool 1 'rd.hostonly'; then fi # run scriptlets to parse the command line -make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' +make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' getarg 'rd.break=cmdline' -d 'rdbreak=cmdline' && emergency_shell -n cmdline "Break before cmdline" source_hook cmdline @@ -141,7 +141,7 @@ source_hook cmdline export root rflags fstype netroot NEWROOT # pre-udev scripts run before udev starts, and are run only once. -make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' +make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' getarg 'rd.break=pre-udev' -d 'rdbreak=pre-udev' && emergency_shell -n pre-udev "Break before pre-udev" source_hook pre-udev @@ -160,7 +160,7 @@ fi udevproperty "hookdir=$hookdir" -make_trace_mem "hook pre-trigger" '1:shortmem' '2+:mem' '3+:slab' +make_trace_mem "hook pre-trigger" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' getarg 'rd.break=pre-trigger' -d 'rdbreak=pre-trigger' && emergency_shell -n pre-trigger "Break before pre-trigger" source_hook pre-trigger @@ -230,7 +230,7 @@ unset RDRETRY # pre-mount happens before we try to mount the root filesystem, # and happens once. -make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' +make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' getarg 'rd.break=pre-mount' -d 'rdbreak=pre-mount' && emergency_shell -n pre-mount "Break pre-mount" source_hook pre-mount @@ -266,7 +266,7 @@ done # pre pivot scripts are sourced just before we doing cleanup and switch over # to the new root. -make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' +make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' getarg 'rd.break=pre-pivot' -d 'rdbreak=pre-pivot' && emergency_shell -n pre-pivot "Break pre-pivot" source_hook pre-pivot diff --git a/modules.d/99base/module-setup.sh b/modules.d/99base/module-setup.sh index b03772e..13019f0 100755 --- a/modules.d/99base/module-setup.sh +++ b/modules.d/99base/module-setup.sh @@ -35,6 +35,7 @@ install() { inst_script "$moddir/initqueue.sh" "/sbin/initqueue" inst_script "$moddir/loginit.sh" "/sbin/loginit" inst_script "$moddir/rdsosreport.sh" "/sbin/rdsosreport" + inst_script "$moddir/memdebug-ko.sh" "/sbin/showkomem" [ -e "${initdir}/lib" ] || mkdir -m 0755 -p ${initdir}/lib mkdir -m 0755 -p ${initdir}/lib/dracut -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
[parent not found: <581B2518.6060503-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <581B2518.6060503-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2016-11-04 5:50 ` Dave Young [not found] ` <20161104055026.GB2889-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Dave Young @ 2016-11-04 5:50 UTC (permalink / raw) To: xlpang-H+wXaHxf7aLQT0dZR+AlfA Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA, Pratyush Anand On 11/03/16 at 07:52pm, Xunlei Pang wrote: > On 2016/11/03 at 16:38, Dave Young wrote: > > On 11/03/16 at 03:28pm, Xunlei Pang wrote: > > [snip] > >>> For large trace data(tested on rhel7, the filter doesn't work on rhel7, and will produce huge trace data), > >>> the time consumption is huge, I am afraid in minutes because I once suspected the script was in some > >>> dead loop when parsing "tracing/trace" directly. It is the same situation when turning off tracing_on and > >>> try again. > >> Although I don't know why, after I replaced the following scripts > >> 1) > >> while read pid cpu flags ts function > >> do > >> ... ... > >> done < "$TRACE_BASE/tracing/trace" > >> > >> with > >> > >> 2) > >> cat "$TRACE_BASE/tracing/trace" | while read pid cpu flags ts function > >> do > >> ... ... > >> done > >> > >> 2) became not time-consuming just like parsing the copied filename in 1) ... > > Maybe 1) read the sysfs file a lot of times, but 2) only once then > > parsing them in pipe which is quiker. > > > > It should be fine if 2) is acceptable, but if the data is very large it > > may worth to use some external program like awk which will be faster. > > Hi Dave, > > What do you think the following approach? > > ============== [PATCH 1/2] ================ > --- > modules.d/99base/memdebug-ko.sh | 119 ++++++++++++++++++++++++++++++++++++++++ memtrace-ko.sh sounds better > 1 file changed, 119 insertions(+) > create mode 100755 modules.d/99base/memdebug-ko.sh > > diff --git a/modules.d/99base/memdebug-ko.sh b/modules.d/99base/memdebug-ko.sh > new file mode 100755 > index 0000000..2839966 > --- /dev/null > +++ b/modules.d/99base/memdebug-ko.sh > @@ -0,0 +1,119 @@ > +# Try to find out kernel modules with large total memory allocation during loading. > +# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc" > +# alone should be enough for the purpose. > + > +# "sys/kernel/tracing" has the priority if exists. > +get_trace_base() { > + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. > + if [[ -d "/sys/kernel/tracing" ]]; then > + echo "/sys/kernel" > + else > + echo "/sys/kernel/debug" > + fi > +} > + > +is_trace_data_prepared() { > + local trace_base > + > + trace_base=$(get_trace_base) > + # old debugfs interface case. > + if ! [[ -d "$trace_base/tracing" ]]; then > + mount none -t debugfs $trace_base > + # new tracefs interface case. > + elif ! [[ -f "$trace_base/tracing/trace" ]]; then > + mount none -t tracefs "$trace_base/tracing" > + fi > + > + if ! [[ -f "$trace_base/tracing/trace" ]]; then > + echo "WARN: Mount trace failed for kernel module memory analyzing." > + return 1 > + fi > + > + MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc" > + SET_EVENTS=$(echo $(cat $trace_base/tracing/set_event)) > + # Check if trace was properly setup, prepare it if not. > + if [[ $(cat $trace_base/tracing/tracing_on) != 1 ]] || \ How about return in case tracing_on == 1, then no need indent later. > + [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then > + # Set our trace events. > + echo $MATCH_EVENTS > $trace_base/tracing/set_event > + > + # There are three kinds of known applications for module loading: > + # "systemd-udevd", "modprobe" and "insmod". > + # Set them to the mm_page_alloc event filter. > + # NOTE: Some kernel may not support this format of filter, anyway > + # the operation will fail and it doesn't matter. > + page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod" > + echo $page_alloc_filter > $trace_base/tracing/events/kmem/mm_page_alloc/filter > + > + # Set the number of comm-pid if supported. > + if [[ -f "$trace_base/tracing/saved_cmdlines_size" ]]; then > + # Thanks to filters, 4096 is big enough(also well supported). > + echo 4096 > $trace_base/tracing/saved_cmdlines_size > + fi > + > + # Enable and clear trace data for the first time. > + echo 1 > $trace_base/tracing/tracing_on > + echo > $trace_base/tracing/trace > + echo "Prepare trace success." > + return 1 Should return 0? > + fi > + > + return 0 > +} > + > +parse_trace_data() { > + local module_name > + # Indexed by task pid. > + local -A current_module > + # Indexed by module name. > + local -A module_loaded > + local -A nr_alloc_pages > + > + cat "$(get_trace_base)/tracing/trace" | while read pid cpu flags ts function > + do > + # Skip comment lines > + if [[ $pid = "#" ]]; then > + continue > + fi > + > + if [[ $function = module_load* ]]; then > + # One module is being loaded, save the task pid for tracking. > + module_name=${function#*: } > + # Remove the trailing after whitespace, there may be the module flags. > + module_name=${module_name%% *} > + # Mark current_module to track the task. > + current_module[$pid]="$module_name" > + [[ ${module_loaded[$module_name]} ]] && echo "WARN: \"$module_name\" was loaded multiple times!" > + unset module_loaded[$module_name] > + nr_alloc_pages[$module_name]=0 > + continue > + fi > + > + if ! [[ ${current_module[$pid]} ]]; then > + continue > + fi > + > + # Once we get here, the task is being tracked(is loading a module). > + # Get the module name. > + module_name=${current_module[$pid]} > + > + if [[ $function = module_put* ]]; then > + # Mark the module as loaded when the first module_put event happens after module_load. > + echo "${nr_alloc_pages[$module_name]} pages consumed by \"$module_name\"" > + module_loaded[$module_name]=1 > + # Module loading finished, so untrack the task. > + unset current_module[$pid] > + continue > + fi > + > + if [[ $function = mm_page_alloc* ]]; then > + order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/') > + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order)))) > + fi > + done > +} > + > +if is_trace_data_prepared ; then > + echo "showkomem - memory consumption of loading kernel modules(the larger, the more precise)" > + parse_trace_data > +fi I feel the original design is better, prepare and enalbe trace at the very early of cmdline hook. But below will be useless for komem trace: +make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' So in cmdline hook we can just prepare and enable tracing without parsing. Later when we parse it just check if the trace is on or off, if it is off then just do nothing. And we should disable tracing before pivot root.. > -- > 1.8.3.1 > > ============== [PATCH 2/2] ================ > --- > modules.d/98dracut-systemd/dracut-cmdline.sh | 2 +- > modules.d/98dracut-systemd/dracut-mount.sh | 2 +- > modules.d/98dracut-systemd/dracut-pre-mount.sh | 2 +- > modules.d/98dracut-systemd/dracut-pre-pivot.sh | 2 +- > modules.d/98dracut-systemd/dracut-pre-trigger.sh | 2 +- > modules.d/98dracut-systemd/dracut-pre-udev.sh | 2 +- > modules.d/99base/dracut-lib.sh | 5 ++++- > modules.d/99base/init.sh | 10 +++++----- > modules.d/99base/module-setup.sh | 1 + > 9 files changed, 16 insertions(+), 12 deletions(-) > > diff --git a/modules.d/98dracut-systemd/dracut-cmdline.sh b/modules.d/98dracut-systemd/dracut-cmdline.sh > index 6c6ee02..bff9435 100755 > --- a/modules.d/98dracut-systemd/dracut-cmdline.sh > +++ b/modules.d/98dracut-systemd/dracut-cmdline.sh > @@ -42,7 +42,7 @@ export root > export rflags > export fstype > > -make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' > +make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' > # run scriptlets to parse the command line > getarg 'rd.break=cmdline' -d 'rdbreak=cmdline' && emergency_shell -n cmdline "Break before cmdline" > source_hook cmdline > diff --git a/modules.d/98dracut-systemd/dracut-mount.sh b/modules.d/98dracut-systemd/dracut-mount.sh > index c4febfe..89ebc31 100755 > --- a/modules.d/98dracut-systemd/dracut-mount.sh > +++ b/modules.d/98dracut-systemd/dracut-mount.sh > @@ -7,7 +7,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh > > source_conf /etc/conf.d > > -make_trace_mem "hook mount" '1:shortmem' '2+:mem' '3+:slab' > +make_trace_mem "hook mount" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' > > getarg 'rd.break=mount' -d 'rdbreak=mount' && emergency_shell -n mount "Break mount" > # mount scripts actually try to mount the root filesystem, and may > diff --git a/modules.d/98dracut-systemd/dracut-pre-mount.sh b/modules.d/98dracut-systemd/dracut-pre-mount.sh > index ae51128..a3b9d29 100755 > --- a/modules.d/98dracut-systemd/dracut-pre-mount.sh > +++ b/modules.d/98dracut-systemd/dracut-pre-mount.sh > @@ -8,7 +8,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh > > source_conf /etc/conf.d > > -make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' > +make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' > # pre pivot scripts are sourced just before we doing cleanup and switch over > # to the new root. > getarg 'rd.break=pre-mount' 'rdbreak=pre-mount' && emergency_shell -n pre-mount "Break pre-mount" > diff --git a/modules.d/98dracut-systemd/dracut-pre-pivot.sh b/modules.d/98dracut-systemd/dracut-pre-pivot.sh > index cc70e3c..dfd328c 100755 > --- a/modules.d/98dracut-systemd/dracut-pre-pivot.sh > +++ b/modules.d/98dracut-systemd/dracut-pre-pivot.sh > @@ -8,7 +8,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh > > source_conf /etc/conf.d > > -make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' > +make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' > # pre pivot scripts are sourced just before we doing cleanup and switch over > # to the new root. > getarg 'rd.break=pre-pivot' 'rdbreak=pre-pivot' && emergency_shell -n pre-pivot "Break pre-pivot" > diff --git a/modules.d/98dracut-systemd/dracut-pre-trigger.sh b/modules.d/98dracut-systemd/dracut-pre-trigger.sh > index ac1ec36..7cd821e 100755 > --- a/modules.d/98dracut-systemd/dracut-pre-trigger.sh > +++ b/modules.d/98dracut-systemd/dracut-pre-trigger.sh > @@ -8,7 +8,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh > > source_conf /etc/conf.d > > -make_trace_mem "hook pre-trigger" "1:shortmem" "2+:mem" "3+:slab" > +make_trace_mem "hook pre-trigger" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' > > source_hook pre-trigger > > diff --git a/modules.d/98dracut-systemd/dracut-pre-udev.sh b/modules.d/98dracut-systemd/dracut-pre-udev.sh > index ca13048..17268a1 100755 > --- a/modules.d/98dracut-systemd/dracut-pre-udev.sh > +++ b/modules.d/98dracut-systemd/dracut-pre-udev.sh > @@ -7,7 +7,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh > > source_conf /etc/conf.d > > -make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' > +make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' > # pre pivot scripts are sourced just before we doing cleanup and switch over > # to the new root. > getarg 'rd.break=pre-udev' 'rdbreak=pre-udev' && emergency_shell -n pre-udev "Break pre-udev" > diff --git a/modules.d/99base/dracut-lib.sh b/modules.d/99base/dracut-lib.sh > index 060b3fe..833ed5f 100755 > --- a/modules.d/99base/dracut-lib.sh > +++ b/modules.d/99base/dracut-lib.sh > @@ -1206,7 +1206,7 @@ are_lists_eq() { > > setmemdebug() { > if [ -z "$DEBUG_MEM_LEVEL" ]; then > - export DEBUG_MEM_LEVEL=$(getargnum 0 0 3 rd.memdebug) > + export DEBUG_MEM_LEVEL=$(getargnum 0 0 4 rd.memdebug) > fi > } > > @@ -1296,6 +1296,9 @@ show_memstats() > iomem) > cat /proc/iomem > ;; > + komem) > + showkomem > + ;; > esac > } > > diff --git a/modules.d/99base/init.sh b/modules.d/99base/init.sh > index a563393..f0195d8 100755 > --- a/modules.d/99base/init.sh > +++ b/modules.d/99base/init.sh > @@ -131,7 +131,7 @@ if ! getargbool 1 'rd.hostonly'; then > fi > > # run scriptlets to parse the command line > -make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' > +make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' > getarg 'rd.break=cmdline' -d 'rdbreak=cmdline' && emergency_shell -n cmdline "Break before cmdline" > source_hook cmdline > > @@ -141,7 +141,7 @@ source_hook cmdline > export root rflags fstype netroot NEWROOT > > # pre-udev scripts run before udev starts, and are run only once. > -make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' > +make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' > getarg 'rd.break=pre-udev' -d 'rdbreak=pre-udev' && emergency_shell -n pre-udev "Break before pre-udev" > source_hook pre-udev > > @@ -160,7 +160,7 @@ fi > > udevproperty "hookdir=$hookdir" > > -make_trace_mem "hook pre-trigger" '1:shortmem' '2+:mem' '3+:slab' > +make_trace_mem "hook pre-trigger" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' > getarg 'rd.break=pre-trigger' -d 'rdbreak=pre-trigger' && emergency_shell -n pre-trigger "Break before pre-trigger" > source_hook pre-trigger > > @@ -230,7 +230,7 @@ unset RDRETRY > > # pre-mount happens before we try to mount the root filesystem, > # and happens once. > -make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' > +make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' > getarg 'rd.break=pre-mount' -d 'rdbreak=pre-mount' && emergency_shell -n pre-mount "Break pre-mount" > source_hook pre-mount > > @@ -266,7 +266,7 @@ done > > # pre pivot scripts are sourced just before we doing cleanup and switch over > # to the new root. > -make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' > +make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' > getarg 'rd.break=pre-pivot' -d 'rdbreak=pre-pivot' && emergency_shell -n pre-pivot "Break pre-pivot" > source_hook pre-pivot > > diff --git a/modules.d/99base/module-setup.sh b/modules.d/99base/module-setup.sh > index b03772e..13019f0 100755 > --- a/modules.d/99base/module-setup.sh > +++ b/modules.d/99base/module-setup.sh > @@ -35,6 +35,7 @@ install() { > inst_script "$moddir/initqueue.sh" "/sbin/initqueue" > inst_script "$moddir/loginit.sh" "/sbin/loginit" > inst_script "$moddir/rdsosreport.sh" "/sbin/rdsosreport" > + inst_script "$moddir/memdebug-ko.sh" "/sbin/showkomem" > > [ -e "${initdir}/lib" ] || mkdir -m 0755 -p ${initdir}/lib > mkdir -m 0755 -p ${initdir}/lib/dracut > -- > 1.8.3.1 > > Thanks Dave ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20161104055026.GB2889-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>]
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <20161104055026.GB2889-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org> @ 2016-11-04 6:35 ` Xunlei Pang [not found] ` <581C2C30.8070403-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Xunlei Pang @ 2016-11-04 6:35 UTC (permalink / raw) To: Dave Young, xlpang-H+wXaHxf7aLQT0dZR+AlfA Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA, Pratyush Anand On 2016/11/04 at 13:50, Dave Young wrote: > On 11/03/16 at 07:52pm, Xunlei Pang wrote: >> On 2016/11/03 at 16:38, Dave Young wrote: >>> On 11/03/16 at 03:28pm, Xunlei Pang wrote: >>> [snip] >>>>> For large trace data(tested on rhel7, the filter doesn't work on rhel7, and will produce huge trace data), >>>>> the time consumption is huge, I am afraid in minutes because I once suspected the script was in some >>>>> dead loop when parsing "tracing/trace" directly. It is the same situation when turning off tracing_on and >>>>> try again. >>>> Although I don't know why, after I replaced the following scripts >>>> 1) >>>> while read pid cpu flags ts function >>>> do >>>> ... ... >>>> done < "$TRACE_BASE/tracing/trace" >>>> >>>> with >>>> >>>> 2) >>>> cat "$TRACE_BASE/tracing/trace" | while read pid cpu flags ts function >>>> do >>>> ... ... >>>> done >>>> >>>> 2) became not time-consuming just like parsing the copied filename in 1) ... >>> Maybe 1) read the sysfs file a lot of times, but 2) only once then >>> parsing them in pipe which is quiker. >>> >>> It should be fine if 2) is acceptable, but if the data is very large it >>> may worth to use some external program like awk which will be faster. >> Hi Dave, >> >> What do you think the following approach? >> >> ============== [PATCH 1/2] ================ >> --- >> modules.d/99base/memdebug-ko.sh | 119 ++++++++++++++++++++++++++++++++++++++++ > memtrace-ko.sh sounds better I choose this because of "rd.memdebug", anyway will rename :-) > >> 1 file changed, 119 insertions(+) >> create mode 100755 modules.d/99base/memdebug-ko.sh >> >> diff --git a/modules.d/99base/memdebug-ko.sh b/modules.d/99base/memdebug-ko.sh >> new file mode 100755 >> index 0000000..2839966 >> --- /dev/null >> +++ b/modules.d/99base/memdebug-ko.sh >> @@ -0,0 +1,119 @@ >> +# Try to find out kernel modules with large total memory allocation during loading. >> +# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc" >> +# alone should be enough for the purpose. >> + >> +# "sys/kernel/tracing" has the priority if exists. >> +get_trace_base() { >> + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. >> + if [[ -d "/sys/kernel/tracing" ]]; then >> + echo "/sys/kernel" >> + else >> + echo "/sys/kernel/debug" >> + fi >> +} >> + >> +is_trace_data_prepared() { >> + local trace_base >> + >> + trace_base=$(get_trace_base) >> + # old debugfs interface case. >> + if ! [[ -d "$trace_base/tracing" ]]; then >> + mount none -t debugfs $trace_base >> + # new tracefs interface case. >> + elif ! [[ -f "$trace_base/tracing/trace" ]]; then >> + mount none -t tracefs "$trace_base/tracing" >> + fi >> + >> + if ! [[ -f "$trace_base/tracing/trace" ]]; then >> + echo "WARN: Mount trace failed for kernel module memory analyzing." >> + return 1 >> + fi >> + >> + MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc" >> + SET_EVENTS=$(echo $(cat $trace_base/tracing/set_event)) >> + # Check if trace was properly setup, prepare it if not. >> + if [[ $(cat $trace_base/tracing/tracing_on) != 1 ]] || \ > How about return in case tracing_on == 1, then no need indent later. We still better to make sure the events are the ones we really need, as others may utilize tracing to do things they want to. > >> + [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then >> + # Set our trace events. >> + echo $MATCH_EVENTS > $trace_base/tracing/set_event >> + >> + # There are three kinds of known applications for module loading: >> + # "systemd-udevd", "modprobe" and "insmod". >> + # Set them to the mm_page_alloc event filter. >> + # NOTE: Some kernel may not support this format of filter, anyway >> + # the operation will fail and it doesn't matter. >> + page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod" >> + echo $page_alloc_filter > $trace_base/tracing/events/kmem/mm_page_alloc/filter >> + >> + # Set the number of comm-pid if supported. >> + if [[ -f "$trace_base/tracing/saved_cmdlines_size" ]]; then >> + # Thanks to filters, 4096 is big enough(also well supported). >> + echo 4096 > $trace_base/tracing/saved_cmdlines_size >> + fi >> + >> + # Enable and clear trace data for the first time. >> + echo 1 > $trace_base/tracing/tracing_on >> + echo > $trace_base/tracing/trace >> + echo "Prepare trace success." >> + return 1 > Should return 0? This is the first time we turn off trace and the trace data is cleared, so return 1 on purpose. > >> + fi >> + >> + return 0 >> +} >> + >> +parse_trace_data() { >> + local module_name >> + # Indexed by task pid. >> + local -A current_module >> + # Indexed by module name. >> + local -A module_loaded >> + local -A nr_alloc_pages >> + >> + cat "$(get_trace_base)/tracing/trace" | while read pid cpu flags ts function >> + do >> + # Skip comment lines >> + if [[ $pid = "#" ]]; then >> + continue >> + fi >> + >> + if [[ $function = module_load* ]]; then >> + # One module is being loaded, save the task pid for tracking. >> + module_name=${function#*: } >> + # Remove the trailing after whitespace, there may be the module flags. >> + module_name=${module_name%% *} >> + # Mark current_module to track the task. >> + current_module[$pid]="$module_name" >> + [[ ${module_loaded[$module_name]} ]] && echo "WARN: \"$module_name\" was loaded multiple times!" >> + unset module_loaded[$module_name] >> + nr_alloc_pages[$module_name]=0 >> + continue >> + fi >> + >> + if ! [[ ${current_module[$pid]} ]]; then >> + continue >> + fi >> + >> + # Once we get here, the task is being tracked(is loading a module). >> + # Get the module name. >> + module_name=${current_module[$pid]} >> + >> + if [[ $function = module_put* ]]; then >> + # Mark the module as loaded when the first module_put event happens after module_load. >> + echo "${nr_alloc_pages[$module_name]} pages consumed by \"$module_name\"" >> + module_loaded[$module_name]=1 >> + # Module loading finished, so untrack the task. >> + unset current_module[$pid] >> + continue >> + fi >> + >> + if [[ $function = mm_page_alloc* ]]; then >> + order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/') >> + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order)))) >> + fi >> + done >> +} >> + >> +if is_trace_data_prepared ; then >> + echo "showkomem - memory consumption of loading kernel modules(the larger, the more precise)" >> + parse_trace_data >> +fi > I feel the original design is better, prepare and enalbe trace at the > very early of cmdline hook. But below will be useless for komem trace: > +make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' > > So in cmdline hook we can just prepare and enable tracing without > parsing. > > Later when we parse it just check if the trace is on or off, if it is > off then just do nothing. I can think of one advantage of the design, that is it doesn't rely on "rd.memdebug", therefore "/sbin/showkomem" can be used as a independent binary(command) to debug your own modules, just run it two times if tracing is not prepared: the first time performs setup, then show the result. Do you mean the way below? 1) Split the script into two parts like before in a new dracut module. Install the preparing part as one cmdline hook in case of "rd.memdebug=4", then the parsing part as "/sbin/showkomem". 2) Then use it in 99base show_memstats() and make_trace_mem() like the way in this approach. But in the way above-mentioned, "/sbin/showkomem" only works if there is "rd.memdebug=4" or require some manual trace configuration before using. > > And we should disable tracing before pivot root.. I think we can just leave the trace there, it does little harm. Also others may still utilize "showkomem" to show extra information, for example if there is some script (like pre-kdump script) modprobe some special modules, we can simply call showkomem to show them out. Regards, Xunlei > >> -- >> 1.8.3.1 >> >> ============== [PATCH 2/2] ================ >> --- >> modules.d/98dracut-systemd/dracut-cmdline.sh | 2 +- >> modules.d/98dracut-systemd/dracut-mount.sh | 2 +- >> modules.d/98dracut-systemd/dracut-pre-mount.sh | 2 +- >> modules.d/98dracut-systemd/dracut-pre-pivot.sh | 2 +- >> modules.d/98dracut-systemd/dracut-pre-trigger.sh | 2 +- >> modules.d/98dracut-systemd/dracut-pre-udev.sh | 2 +- >> modules.d/99base/dracut-lib.sh | 5 ++++- >> modules.d/99base/init.sh | 10 +++++----- >> modules.d/99base/module-setup.sh | 1 + >> 9 files changed, 16 insertions(+), 12 deletions(-) >> >> diff --git a/modules.d/98dracut-systemd/dracut-cmdline.sh b/modules.d/98dracut-systemd/dracut-cmdline.sh >> index 6c6ee02..bff9435 100755 >> --- a/modules.d/98dracut-systemd/dracut-cmdline.sh >> +++ b/modules.d/98dracut-systemd/dracut-cmdline.sh >> @@ -42,7 +42,7 @@ export root >> export rflags >> export fstype >> >> -make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' >> +make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' >> # run scriptlets to parse the command line >> getarg 'rd.break=cmdline' -d 'rdbreak=cmdline' && emergency_shell -n cmdline "Break before cmdline" >> source_hook cmdline >> diff --git a/modules.d/98dracut-systemd/dracut-mount.sh b/modules.d/98dracut-systemd/dracut-mount.sh >> index c4febfe..89ebc31 100755 >> --- a/modules.d/98dracut-systemd/dracut-mount.sh >> +++ b/modules.d/98dracut-systemd/dracut-mount.sh >> @@ -7,7 +7,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh >> >> source_conf /etc/conf.d >> >> -make_trace_mem "hook mount" '1:shortmem' '2+:mem' '3+:slab' >> +make_trace_mem "hook mount" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' >> >> getarg 'rd.break=mount' -d 'rdbreak=mount' && emergency_shell -n mount "Break mount" >> # mount scripts actually try to mount the root filesystem, and may >> diff --git a/modules.d/98dracut-systemd/dracut-pre-mount.sh b/modules.d/98dracut-systemd/dracut-pre-mount.sh >> index ae51128..a3b9d29 100755 >> --- a/modules.d/98dracut-systemd/dracut-pre-mount.sh >> +++ b/modules.d/98dracut-systemd/dracut-pre-mount.sh >> @@ -8,7 +8,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh >> >> source_conf /etc/conf.d >> >> -make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' >> +make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' >> # pre pivot scripts are sourced just before we doing cleanup and switch over >> # to the new root. >> getarg 'rd.break=pre-mount' 'rdbreak=pre-mount' && emergency_shell -n pre-mount "Break pre-mount" >> diff --git a/modules.d/98dracut-systemd/dracut-pre-pivot.sh b/modules.d/98dracut-systemd/dracut-pre-pivot.sh >> index cc70e3c..dfd328c 100755 >> --- a/modules.d/98dracut-systemd/dracut-pre-pivot.sh >> +++ b/modules.d/98dracut-systemd/dracut-pre-pivot.sh >> @@ -8,7 +8,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh >> >> source_conf /etc/conf.d >> >> -make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' >> +make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' >> # pre pivot scripts are sourced just before we doing cleanup and switch over >> # to the new root. >> getarg 'rd.break=pre-pivot' 'rdbreak=pre-pivot' && emergency_shell -n pre-pivot "Break pre-pivot" >> diff --git a/modules.d/98dracut-systemd/dracut-pre-trigger.sh b/modules.d/98dracut-systemd/dracut-pre-trigger.sh >> index ac1ec36..7cd821e 100755 >> --- a/modules.d/98dracut-systemd/dracut-pre-trigger.sh >> +++ b/modules.d/98dracut-systemd/dracut-pre-trigger.sh >> @@ -8,7 +8,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh >> >> source_conf /etc/conf.d >> >> -make_trace_mem "hook pre-trigger" "1:shortmem" "2+:mem" "3+:slab" >> +make_trace_mem "hook pre-trigger" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' >> >> source_hook pre-trigger >> >> diff --git a/modules.d/98dracut-systemd/dracut-pre-udev.sh b/modules.d/98dracut-systemd/dracut-pre-udev.sh >> index ca13048..17268a1 100755 >> --- a/modules.d/98dracut-systemd/dracut-pre-udev.sh >> +++ b/modules.d/98dracut-systemd/dracut-pre-udev.sh >> @@ -7,7 +7,7 @@ type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh >> >> source_conf /etc/conf.d >> >> -make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' >> +make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' >> # pre pivot scripts are sourced just before we doing cleanup and switch over >> # to the new root. >> getarg 'rd.break=pre-udev' 'rdbreak=pre-udev' && emergency_shell -n pre-udev "Break pre-udev" >> diff --git a/modules.d/99base/dracut-lib.sh b/modules.d/99base/dracut-lib.sh >> index 060b3fe..833ed5f 100755 >> --- a/modules.d/99base/dracut-lib.sh >> +++ b/modules.d/99base/dracut-lib.sh >> @@ -1206,7 +1206,7 @@ are_lists_eq() { >> >> setmemdebug() { >> if [ -z "$DEBUG_MEM_LEVEL" ]; then >> - export DEBUG_MEM_LEVEL=$(getargnum 0 0 3 rd.memdebug) >> + export DEBUG_MEM_LEVEL=$(getargnum 0 0 4 rd.memdebug) >> fi >> } >> >> @@ -1296,6 +1296,9 @@ show_memstats() >> iomem) >> cat /proc/iomem >> ;; >> + komem) >> + showkomem >> + ;; >> esac >> } >> >> diff --git a/modules.d/99base/init.sh b/modules.d/99base/init.sh >> index a563393..f0195d8 100755 >> --- a/modules.d/99base/init.sh >> +++ b/modules.d/99base/init.sh >> @@ -131,7 +131,7 @@ if ! getargbool 1 'rd.hostonly'; then >> fi >> >> # run scriptlets to parse the command line >> -make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' >> +make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' >> getarg 'rd.break=cmdline' -d 'rdbreak=cmdline' && emergency_shell -n cmdline "Break before cmdline" >> source_hook cmdline >> >> @@ -141,7 +141,7 @@ source_hook cmdline >> export root rflags fstype netroot NEWROOT >> >> # pre-udev scripts run before udev starts, and are run only once. >> -make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' >> +make_trace_mem "hook pre-udev" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' >> getarg 'rd.break=pre-udev' -d 'rdbreak=pre-udev' && emergency_shell -n pre-udev "Break before pre-udev" >> source_hook pre-udev >> >> @@ -160,7 +160,7 @@ fi >> >> udevproperty "hookdir=$hookdir" >> >> -make_trace_mem "hook pre-trigger" '1:shortmem' '2+:mem' '3+:slab' >> +make_trace_mem "hook pre-trigger" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' >> getarg 'rd.break=pre-trigger' -d 'rdbreak=pre-trigger' && emergency_shell -n pre-trigger "Break before pre-trigger" >> source_hook pre-trigger >> >> @@ -230,7 +230,7 @@ unset RDRETRY >> >> # pre-mount happens before we try to mount the root filesystem, >> # and happens once. >> -make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' >> +make_trace_mem "hook pre-mount" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' >> getarg 'rd.break=pre-mount' -d 'rdbreak=pre-mount' && emergency_shell -n pre-mount "Break pre-mount" >> source_hook pre-mount >> >> @@ -266,7 +266,7 @@ done >> >> # pre pivot scripts are sourced just before we doing cleanup and switch over >> # to the new root. >> -make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' >> +make_trace_mem "hook pre-pivot" '1:shortmem' '2+:mem' '3+:slab' '4+:komem' >> getarg 'rd.break=pre-pivot' -d 'rdbreak=pre-pivot' && emergency_shell -n pre-pivot "Break pre-pivot" >> source_hook pre-pivot >> >> diff --git a/modules.d/99base/module-setup.sh b/modules.d/99base/module-setup.sh >> index b03772e..13019f0 100755 >> --- a/modules.d/99base/module-setup.sh >> +++ b/modules.d/99base/module-setup.sh >> @@ -35,6 +35,7 @@ install() { >> inst_script "$moddir/initqueue.sh" "/sbin/initqueue" >> inst_script "$moddir/loginit.sh" "/sbin/loginit" >> inst_script "$moddir/rdsosreport.sh" "/sbin/rdsosreport" >> + inst_script "$moddir/memdebug-ko.sh" "/sbin/showkomem" >> >> [ -e "${initdir}/lib" ] || mkdir -m 0755 -p ${initdir}/lib >> mkdir -m 0755 -p ${initdir}/lib/dracut >> -- >> 1.8.3.1 >> >> > Thanks > Dave > -- > To unsubscribe from this list: send the line "unsubscribe initramfs" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <581C2C30.8070403-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <581C2C30.8070403-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2016-11-04 7:06 ` Dave Young [not found] ` <20161104070629.GA17145-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Dave Young @ 2016-11-04 7:06 UTC (permalink / raw) To: xlpang-H+wXaHxf7aLQT0dZR+AlfA Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA, Pratyush Anand On 11/04/16 at 02:35pm, Xunlei Pang wrote: > On 2016/11/04 at 13:50, Dave Young wrote: > > On 11/03/16 at 07:52pm, Xunlei Pang wrote: > >> On 2016/11/03 at 16:38, Dave Young wrote: > >>> On 11/03/16 at 03:28pm, Xunlei Pang wrote: > >>> [snip] > >>>>> For large trace data(tested on rhel7, the filter doesn't work on rhel7, and will produce huge trace data), > >>>>> the time consumption is huge, I am afraid in minutes because I once suspected the script was in some > >>>>> dead loop when parsing "tracing/trace" directly. It is the same situation when turning off tracing_on and > >>>>> try again. > >>>> Although I don't know why, after I replaced the following scripts > >>>> 1) > >>>> while read pid cpu flags ts function > >>>> do > >>>> ... ... > >>>> done < "$TRACE_BASE/tracing/trace" > >>>> > >>>> with > >>>> > >>>> 2) > >>>> cat "$TRACE_BASE/tracing/trace" | while read pid cpu flags ts function > >>>> do > >>>> ... ... > >>>> done > >>>> > >>>> 2) became not time-consuming just like parsing the copied filename in 1) ... > >>> Maybe 1) read the sysfs file a lot of times, but 2) only once then > >>> parsing them in pipe which is quiker. > >>> > >>> It should be fine if 2) is acceptable, but if the data is very large it > >>> may worth to use some external program like awk which will be faster. > >> Hi Dave, > >> > >> What do you think the following approach? > >> > >> ============== [PATCH 1/2] ================ > >> --- > >> modules.d/99base/memdebug-ko.sh | 119 ++++++++++++++++++++++++++++++++++++++++ > > memtrace-ko.sh sounds better > > I choose this because of "rd.memdebug", anyway will rename :-) > > > > >> 1 file changed, 119 insertions(+) > >> create mode 100755 modules.d/99base/memdebug-ko.sh > >> > >> diff --git a/modules.d/99base/memdebug-ko.sh b/modules.d/99base/memdebug-ko.sh > >> new file mode 100755 > >> index 0000000..2839966 > >> --- /dev/null > >> +++ b/modules.d/99base/memdebug-ko.sh > >> @@ -0,0 +1,119 @@ > >> +# Try to find out kernel modules with large total memory allocation during loading. > >> +# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc" > >> +# alone should be enough for the purpose. > >> + > >> +# "sys/kernel/tracing" has the priority if exists. > >> +get_trace_base() { > >> + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. > >> + if [[ -d "/sys/kernel/tracing" ]]; then > >> + echo "/sys/kernel" > >> + else > >> + echo "/sys/kernel/debug" > >> + fi > >> +} > >> + > >> +is_trace_data_prepared() { > >> + local trace_base > >> + > >> + trace_base=$(get_trace_base) > >> + # old debugfs interface case. > >> + if ! [[ -d "$trace_base/tracing" ]]; then > >> + mount none -t debugfs $trace_base > >> + # new tracefs interface case. > >> + elif ! [[ -f "$trace_base/tracing/trace" ]]; then > >> + mount none -t tracefs "$trace_base/tracing" > >> + fi > >> + > >> + if ! [[ -f "$trace_base/tracing/trace" ]]; then > >> + echo "WARN: Mount trace failed for kernel module memory analyzing." > >> + return 1 > >> + fi > >> + > >> + MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc" > >> + SET_EVENTS=$(echo $(cat $trace_base/tracing/set_event)) > >> + # Check if trace was properly setup, prepare it if not. > >> + if [[ $(cat $trace_base/tracing/tracing_on) != 1 ]] || \ > > How about return in case tracing_on == 1, then no need indent later. > > We still better to make sure the events are the ones we really need, as others may utilize tracing > to do things they want to. > Yes, I missed it.. So like below? if traccing == on && events-matched; then return 0; fi [...] > > > >> + [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then > >> + # Set our trace events. > >> + echo $MATCH_EVENTS > $trace_base/tracing/set_event > >> + > >> + # There are three kinds of known applications for module loading: > >> + # "systemd-udevd", "modprobe" and "insmod". > >> + # Set them to the mm_page_alloc event filter. > >> + # NOTE: Some kernel may not support this format of filter, anyway > >> + # the operation will fail and it doesn't matter. > >> + page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod" > >> + echo $page_alloc_filter > $trace_base/tracing/events/kmem/mm_page_alloc/filter > >> + > >> + # Set the number of comm-pid if supported. > >> + if [[ -f "$trace_base/tracing/saved_cmdlines_size" ]]; then > >> + # Thanks to filters, 4096 is big enough(also well supported). > >> + echo 4096 > $trace_base/tracing/saved_cmdlines_size > >> + fi > >> + > >> + # Enable and clear trace data for the first time. > >> + echo 1 > $trace_base/tracing/tracing_on > >> + echo > $trace_base/tracing/trace > >> + echo "Prepare trace success." > >> + return 1 > > Should return 0? > > This is the first time we turn off trace and the trace data is cleared, so return 1 on purpose. It is not easy to get and conflicts with the fuction name ;) If change the logic a bit, it will be clear: is_trace_prepared() { this function only check if tracing is enabled and events matched. } prepare_trace() { enable and prepare trace } parse_trace_data() { } if is_trace_prepared == true; then parse_trace_data else prepare_trace fi So in cmdline hook is_trace_prepared is false so the script only prepare trace and enable tracing, in latter hooks it will parse trace data. But there should be another function to disable trace before pivot root. > > > > >> + fi > >> + > >> + return 0 > >> +} > >> + > >> +parse_trace_data() { > >> + local module_name > >> + # Indexed by task pid. > >> + local -A current_module > >> + # Indexed by module name. > >> + local -A module_loaded > >> + local -A nr_alloc_pages > >> + > >> + cat "$(get_trace_base)/tracing/trace" | while read pid cpu flags ts function > >> + do > >> + # Skip comment lines > >> + if [[ $pid = "#" ]]; then > >> + continue > >> + fi > >> + > >> + if [[ $function = module_load* ]]; then > >> + # One module is being loaded, save the task pid for tracking. > >> + module_name=${function#*: } > >> + # Remove the trailing after whitespace, there may be the module flags. > >> + module_name=${module_name%% *} > >> + # Mark current_module to track the task. > >> + current_module[$pid]="$module_name" > >> + [[ ${module_loaded[$module_name]} ]] && echo "WARN: \"$module_name\" was loaded multiple times!" > >> + unset module_loaded[$module_name] > >> + nr_alloc_pages[$module_name]=0 > >> + continue > >> + fi > >> + > >> + if ! [[ ${current_module[$pid]} ]]; then > >> + continue > >> + fi > >> + > >> + # Once we get here, the task is being tracked(is loading a module). > >> + # Get the module name. > >> + module_name=${current_module[$pid]} > >> + > >> + if [[ $function = module_put* ]]; then > >> + # Mark the module as loaded when the first module_put event happens after module_load. > >> + echo "${nr_alloc_pages[$module_name]} pages consumed by \"$module_name\"" > >> + module_loaded[$module_name]=1 > >> + # Module loading finished, so untrack the task. > >> + unset current_module[$pid] > >> + continue > >> + fi > >> + > >> + if [[ $function = mm_page_alloc* ]]; then > >> + order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/') > >> + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order)))) > >> + fi > >> + done > >> +} > >> + > >> +if is_trace_data_prepared ; then > >> + echo "showkomem - memory consumption of loading kernel modules(the larger, the more precise)" > >> + parse_trace_data > >> +fi > > I feel the original design is better, prepare and enalbe trace at the > > very early of cmdline hook. But below will be useless for komem trace: > > +make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' > > > > So in cmdline hook we can just prepare and enable tracing without > > parsing. > > > > Later when we parse it just check if the trace is on or off, if it is > > off then just do nothing. > > I can think of one advantage of the design, that is it doesn't rely on "rd.memdebug", therefore > "/sbin/showkomem" can be used as a independent binary(command) to debug your own modules, > just run it two times if tracing is not prepared: the first time performs setup, then show the result. I did not notice it, agreed current way in this patch is better... > > Do you mean the way below? > 1) Split the script into two parts like before in a new dracut module. Install the preparing part > as one cmdline hook in case of "rd.memdebug=4", then the parsing part as "/sbin/showkomem". > > 2) Then use it in 99base show_memstats() and make_trace_mem() like the way in this approach. > > But in the way above-mentioned, "/sbin/showkomem" only works if there is "rd.memdebug=4" or > require some manual trace configuration before using. > > > > > And we should disable tracing before pivot root.. > > I think we can just leave the trace there, it does little harm. > > Also others may still utilize "showkomem" to show extra information, for example if there is some > script (like pre-kdump script) modprobe some special modules, we can simply call showkomem to > show them out. Yes, it makes sense for user like kdump, in Fedora kdump add a pre-pivot hook to capture vmcore. But we can still disable trace after all hooks finishing (include kdump though it will never return) in dracut code. Thanks Dave ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20161104070629.GA17145-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>]
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <20161104070629.GA17145-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org> @ 2016-11-04 7:55 ` Xunlei Pang [not found] ` <581C3EDE.9080601-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Xunlei Pang @ 2016-11-04 7:55 UTC (permalink / raw) To: Dave Young, xlpang-H+wXaHxf7aLQT0dZR+AlfA Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA, Pratyush Anand On 2016/11/04 at 15:06, Dave Young wrote: > On 11/04/16 at 02:35pm, Xunlei Pang wrote: >> On 2016/11/04 at 13:50, Dave Young wrote: >>> On 11/03/16 at 07:52pm, Xunlei Pang wrote: >>>> On 2016/11/03 at 16:38, Dave Young wrote: >>>>> On 11/03/16 at 03:28pm, Xunlei Pang wrote: >>>>> [snip] >>>>>>> For large trace data(tested on rhel7, the filter doesn't work on rhel7, and will produce huge trace data), >>>>>>> the time consumption is huge, I am afraid in minutes because I once suspected the script was in some >>>>>>> dead loop when parsing "tracing/trace" directly. It is the same situation when turning off tracing_on and >>>>>>> try again. >>>>>> Although I don't know why, after I replaced the following scripts >>>>>> 1) >>>>>> while read pid cpu flags ts function >>>>>> do >>>>>> ... ... >>>>>> done < "$TRACE_BASE/tracing/trace" >>>>>> >>>>>> with >>>>>> >>>>>> 2) >>>>>> cat "$TRACE_BASE/tracing/trace" | while read pid cpu flags ts function >>>>>> do >>>>>> ... ... >>>>>> done >>>>>> >>>>>> 2) became not time-consuming just like parsing the copied filename in 1) ... >>>>> Maybe 1) read the sysfs file a lot of times, but 2) only once then >>>>> parsing them in pipe which is quiker. >>>>> >>>>> It should be fine if 2) is acceptable, but if the data is very large it >>>>> may worth to use some external program like awk which will be faster. >>>> Hi Dave, >>>> >>>> What do you think the following approach? >>>> >>>> ============== [PATCH 1/2] ================ >>>> --- >>>> modules.d/99base/memdebug-ko.sh | 119 ++++++++++++++++++++++++++++++++++++++++ >>> memtrace-ko.sh sounds better >> I choose this because of "rd.memdebug", anyway will rename :-) >> >>>> 1 file changed, 119 insertions(+) >>>> create mode 100755 modules.d/99base/memdebug-ko.sh >>>> >>>> diff --git a/modules.d/99base/memdebug-ko.sh b/modules.d/99base/memdebug-ko.sh >>>> new file mode 100755 >>>> index 0000000..2839966 >>>> --- /dev/null >>>> +++ b/modules.d/99base/memdebug-ko.sh >>>> @@ -0,0 +1,119 @@ >>>> +# Try to find out kernel modules with large total memory allocation during loading. >>>> +# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc" >>>> +# alone should be enough for the purpose. >>>> + >>>> +# "sys/kernel/tracing" has the priority if exists. >>>> +get_trace_base() { >>>> + # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available. >>>> + if [[ -d "/sys/kernel/tracing" ]]; then >>>> + echo "/sys/kernel" >>>> + else >>>> + echo "/sys/kernel/debug" >>>> + fi >>>> +} >>>> + >>>> +is_trace_data_prepared() { >>>> + local trace_base >>>> + >>>> + trace_base=$(get_trace_base) >>>> + # old debugfs interface case. >>>> + if ! [[ -d "$trace_base/tracing" ]]; then >>>> + mount none -t debugfs $trace_base >>>> + # new tracefs interface case. >>>> + elif ! [[ -f "$trace_base/tracing/trace" ]]; then >>>> + mount none -t tracefs "$trace_base/tracing" >>>> + fi >>>> + >>>> + if ! [[ -f "$trace_base/tracing/trace" ]]; then >>>> + echo "WARN: Mount trace failed for kernel module memory analyzing." >>>> + return 1 >>>> + fi >>>> + >>>> + MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc" >>>> + SET_EVENTS=$(echo $(cat $trace_base/tracing/set_event)) >>>> + # Check if trace was properly setup, prepare it if not. >>>> + if [[ $(cat $trace_base/tracing/tracing_on) != 1 ]] || \ >>> How about return in case tracing_on == 1, then no need indent later. >> We still better to make sure the events are the ones we really need, as others may utilize tracing >> to do things they want to. >> > Yes, I missed it.. So like below? > > if traccing == on && events-matched; then > return 0; > fi > > [...] This is better, can avoid the indentation, will update. > >>>> + [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then >>>> + # Set our trace events. >>>> + echo $MATCH_EVENTS > $trace_base/tracing/set_event >>>> + >>>> + # There are three kinds of known applications for module loading: >>>> + # "systemd-udevd", "modprobe" and "insmod". >>>> + # Set them to the mm_page_alloc event filter. >>>> + # NOTE: Some kernel may not support this format of filter, anyway >>>> + # the operation will fail and it doesn't matter. >>>> + page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod" >>>> + echo $page_alloc_filter > $trace_base/tracing/events/kmem/mm_page_alloc/filter >>>> + >>>> + # Set the number of comm-pid if supported. >>>> + if [[ -f "$trace_base/tracing/saved_cmdlines_size" ]]; then >>>> + # Thanks to filters, 4096 is big enough(also well supported). >>>> + echo 4096 > $trace_base/tracing/saved_cmdlines_size >>>> + fi >>>> + >>>> + # Enable and clear trace data for the first time. >>>> + echo 1 > $trace_base/tracing/tracing_on >>>> + echo > $trace_base/tracing/trace >>>> + echo "Prepare trace success." >>>> + return 1 >>> Should return 0? >> This is the first time we turn off trace and the trace data is cleared, so return 1 on purpose. > It is not easy to get and conflicts with the fuction name ;) > > If change the logic a bit, it will be clear: > > is_trace_prepared() > { > this function only check if tracing is enabled and events matched. > } > > prepare_trace() > { > enable and prepare trace > } > > parse_trace_data() > { > > } > > if is_trace_prepared == true; then > parse_trace_data > else > prepare_trace > fi > > So in cmdline hook is_trace_prepared is false so the script only prepare > trace and enable tracing, in latter hooks it will parse trace data. > > But there should be another function to disable trace before pivot root. OK We should find a neat way to disable the trace function, what do you think the following way? Change show_memstats() and make_trace_mem() as follows: make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' ... ... make_trace_mem "hook pre-pivot" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' make_trace_mem "hook pre-pivot" '4+:komemfinish' # cleanup trace used by showkomem after use. show_memstats() ... ... + komem) + showkomem + ;; + komemfinish) + # disable trace after use. + showkomem finish + ;; esac Regards, Xunlei >>>> + fi >>>> + >>>> + return 0 >>>> +} >>>> + >>>> +parse_trace_data() { >>>> + local module_name >>>> + # Indexed by task pid. >>>> + local -A current_module >>>> + # Indexed by module name. >>>> + local -A module_loaded >>>> + local -A nr_alloc_pages >>>> + >>>> + cat "$(get_trace_base)/tracing/trace" | while read pid cpu flags ts function >>>> + do >>>> + # Skip comment lines >>>> + if [[ $pid = "#" ]]; then >>>> + continue >>>> + fi >>>> + >>>> + if [[ $function = module_load* ]]; then >>>> + # One module is being loaded, save the task pid for tracking. >>>> + module_name=${function#*: } >>>> + # Remove the trailing after whitespace, there may be the module flags. >>>> + module_name=${module_name%% *} >>>> + # Mark current_module to track the task. >>>> + current_module[$pid]="$module_name" >>>> + [[ ${module_loaded[$module_name]} ]] && echo "WARN: \"$module_name\" was loaded multiple times!" >>>> + unset module_loaded[$module_name] >>>> + nr_alloc_pages[$module_name]=0 >>>> + continue >>>> + fi >>>> + >>>> + if ! [[ ${current_module[$pid]} ]]; then >>>> + continue >>>> + fi >>>> + >>>> + # Once we get here, the task is being tracked(is loading a module). >>>> + # Get the module name. >>>> + module_name=${current_module[$pid]} >>>> + >>>> + if [[ $function = module_put* ]]; then >>>> + # Mark the module as loaded when the first module_put event happens after module_load. >>>> + echo "${nr_alloc_pages[$module_name]} pages consumed by \"$module_name\"" >>>> + module_loaded[$module_name]=1 >>>> + # Module loading finished, so untrack the task. >>>> + unset current_module[$pid] >>>> + continue >>>> + fi >>>> + >>>> + if [[ $function = mm_page_alloc* ]]; then >>>> + order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/') >>>> + nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order)))) >>>> + fi >>>> + done >>>> +} >>>> + >>>> +if is_trace_data_prepared ; then >>>> + echo "showkomem - memory consumption of loading kernel modules(the larger, the more precise)" >>>> + parse_trace_data >>>> +fi >>> I feel the original design is better, prepare and enalbe trace at the >>> very early of cmdline hook. But below will be useless for komem trace: >>> +make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' >>> >>> So in cmdline hook we can just prepare and enable tracing without >>> parsing. >>> >>> Later when we parse it just check if the trace is on or off, if it is >>> off then just do nothing. >> I can think of one advantage of the design, that is it doesn't rely on "rd.memdebug", therefore >> "/sbin/showkomem" can be used as a independent binary(command) to debug your own modules, >> just run it two times if tracing is not prepared: the first time performs setup, then show the result. > I did not notice it, agreed current way in this patch is better... > >> Do you mean the way below? >> 1) Split the script into two parts like before in a new dracut module. Install the preparing part >> as one cmdline hook in case of "rd.memdebug=4", then the parsing part as "/sbin/showkomem". >> >> 2) Then use it in 99base show_memstats() and make_trace_mem() like the way in this approach. >> >> But in the way above-mentioned, "/sbin/showkomem" only works if there is "rd.memdebug=4" or >> require some manual trace configuration before using. >> >>> And we should disable tracing before pivot root.. >> I think we can just leave the trace there, it does little harm. >> >> Also others may still utilize "showkomem" to show extra information, for example if there is some >> script (like pre-kdump script) modprobe some special modules, we can simply call showkomem to >> show them out. > Yes, it makes sense for user like kdump, in Fedora kdump add a pre-pivot > hook to capture vmcore. But we can still disable trace after all hooks > finishing (include kdump though it will never return) in dracut code. > > Thanks > Dave > -- > To unsubscribe from this list: send the line "unsubscribe initramfs" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <581C3EDE.9080601-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <581C3EDE.9080601-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2016-11-04 8:23 ` Dave Young [not found] ` <20161104082328.GA19275-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Dave Young @ 2016-11-04 8:23 UTC (permalink / raw) To: xlpang-H+wXaHxf7aLQT0dZR+AlfA Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA, Pratyush Anand Hi Xunlei, > > If change the logic a bit, it will be clear: > > > > is_trace_prepared() > > { > > this function only check if tracing is enabled and events matched. > > } > > > > prepare_trace() > > { > > enable and prepare trace > > } > > > > parse_trace_data() > > { > > > > } > > > > if is_trace_prepared == true; then > > parse_trace_data > > else > > prepare_trace > > fi > > > > So in cmdline hook is_trace_prepared is false so the script only prepare > > trace and enable tracing, in latter hooks it will parse trace data. > > > > But there should be another function to disable trace before pivot root. > > OK > > We should find a neat way to disable the trace function, what do you think the following way? > Change show_memstats() and make_trace_mem() as follows: > make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' > ... ... > make_trace_mem "hook pre-pivot" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' > make_trace_mem "hook pre-pivot" '4+:komemfinish' # cleanup trace used by showkomem after use. > > show_memstats() > ... ... > > + komem) > + showkomem > + ;; > + komemfinish) > + # disable trace after use. > + showkomem finish > + ;; > esac > Hmm, I'm not sure it is worth, maybe add a function like cleanup_trace_mem() is enough and we can add it just before cleanup hook which is designed for cleanup purpose, cleanup hook is sourced after pre-pivot so that it will still work for adding trace in any pre-pivot scripts. Thanks Dave ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20161104082328.GA19275-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>]
* Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module [not found] ` <20161104082328.GA19275-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org> @ 2016-11-04 9:12 ` Xunlei Pang 0 siblings, 0 replies; 14+ messages in thread From: Xunlei Pang @ 2016-11-04 9:12 UTC (permalink / raw) To: Dave Young, xlpang-H+wXaHxf7aLQT0dZR+AlfA Cc: Harald Hoyer, initramfs-u79uwXL29TY76Z2rM5mHXA, Pratyush Anand On 2016/11/04 at 16:23, Dave Young wrote: > Hi Xunlei, > >>> If change the logic a bit, it will be clear: >>> >>> is_trace_prepared() >>> { >>> this function only check if tracing is enabled and events matched. >>> } >>> >>> prepare_trace() >>> { >>> enable and prepare trace >>> } >>> >>> parse_trace_data() >>> { >>> >>> } >>> >>> if is_trace_prepared == true; then >>> parse_trace_data >>> else >>> prepare_trace >>> fi >>> >>> So in cmdline hook is_trace_prepared is false so the script only prepare >>> trace and enable tracing, in latter hooks it will parse trace data. >>> >>> But there should be another function to disable trace before pivot root. >> OK >> >> We should find a neat way to disable the trace function, what do you think the following way? >> Change show_memstats() and make_trace_mem() as follows: >> make_trace_mem "hook cmdline" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' >> ... ... >> make_trace_mem "hook pre-pivot" '1+:mem' '1+:iomem' '3+:slab' '4+:komem' >> make_trace_mem "hook pre-pivot" '4+:komemfinish' # cleanup trace used by showkomem after use. >> >> show_memstats() >> ... ... >> >> + komem) >> + showkomem >> + ;; >> + komemfinish) >> + # disable trace after use. >> + showkomem finish >> + ;; >> esac >> > Hmm, I'm not sure it is worth, maybe add a function like > cleanup_trace_mem() is enough and we can add it just before cleanup > hook which is designed for cleanup purpose, cleanup hook is sourced > after pre-pivot so that it will still work for adding trace in any > pre-pivot scripts. OK I'll send v2 later, thanks for all the valuable suggestions. Regards, Xunlei ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-11-04 9:12 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-02 9:03 [RFC PATCH 1/2] add 99memdebug-ko dracut module Xunlei Pang
[not found] ` <1478077386-30039-1-git-send-email-xlpang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-02 9:03 ` [RFC PATCH 2/2] dracut.cmdline.7.asc: update document for rd.memdebug=4 Xunlei Pang
2016-11-03 3:01 ` [RFC PATCH 1/2] add 99memdebug-ko dracut module Dave Young
[not found] ` <20161103030142.GA3201-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2016-11-03 4:15 ` Xunlei Pang
[not found] ` <581AB9D7.2010905-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-03 4:45 ` Xunlei Pang
2016-11-03 7:28 ` Xunlei Pang
[not found] ` <581AE707.9050606-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-03 8:38 ` Dave Young
[not found] ` <20161103083826.GA15431-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2016-11-03 11:52 ` Xunlei Pang
[not found] ` <581B2518.6060503-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-04 5:50 ` Dave Young
[not found] ` <20161104055026.GB2889-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2016-11-04 6:35 ` Xunlei Pang
[not found] ` <581C2C30.8070403-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-04 7:06 ` Dave Young
[not found] ` <20161104070629.GA17145-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2016-11-04 7:55 ` Xunlei Pang
[not found] ` <581C3EDE.9080601-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-11-04 8:23 ` Dave Young
[not found] ` <20161104082328.GA19275-0VdLhd/A9Pl+NNSt+8eSiB/sF2h8X+2i0E9HWUfgJXw@public.gmane.org>
2016-11-04 9:12 ` Xunlei Pang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox