Re: [PATCH resend V2] Eliminate task stack trace duplication.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ying Han <yinghan@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	Tejun Heo <tj@kernel.org>, Pavel Emelyanov <xemul@openvz.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Li Zefan <lizf@cn.fujitsu.com>, Mel Gorman <mel@csn.ul.ie>,
	Christoph Lameter <cl@linux.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Rik van Riel <riel@redhat.com>, Hugh Dickins <hughd@google.com>,
	Michal Hocko <mhocko@suse.cz>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Zhu Yanhai <zhu.yanhai@gmail.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH resend V2] Eliminate task stack trace duplication.
Date: Tue, 14 Jun 2011 22:25:39 -0700	[thread overview]
Message-ID: <BANLkTi=c6YUHVJtdPZs3prXMqQtrsjsCvg@mail.gmail.com> (raw)
In-Reply-To: <20110614142347.8f9634a9.akpm@linux-foundation.org>

On Tue, Jun 14, 2011 at 2:23 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu,  2 Jun 2011 22:32:09 -0700 Ying Han <yinghan@google.com> wrote:
>
>> The problem with small dmesg ring buffer like 512k is that only limited number
>> of task traces will be logged. Sometimes we lose important information only
>> because of too many duplicated stack traces.

Thank you Andrew reviewing the patch !

> The description would be improved if it were to point out that this
> problem occurs when dumping lots of stacks in a single operation, such
> as sysrq-T.

I will add the description on the next post.

>
>> This patch tries to reduce the duplication of task stack trace in the dump
>> message by hashing the task stack. The hashtable is a 32k pre-allocated buffer
>> during bootup. Then we hash the task stack with stack_depth 32 for each stack
>> entry. Each time if we find the identical task trace in the task stack, we dump
>> only the pid of the task which has the task trace dumped. So it is easy to back
>> track to the full stack with the pid.
>>
>> [   58.469730] kworker/0:0     S 0000000000000000     0     4      2 0x00000000
>> [   58.469735]  ffff88082fcfde80 0000000000000046 ffff88082e9d8000 ffff88082fcfc010
>> [   58.469739]  ffff88082fce9860 0000000000011440 ffff88082fcfdfd8 ffff88082fcfdfd8
>> [   58.469743]  0000000000011440 0000000000000000 ffff88082fcee180 ffff88082fce9860
>> [   58.469747] Call Trace:
>> [   58.469751]  [<ffffffff8108525a>] worker_thread+0x24b/0x250
>> [   58.469754]  [<ffffffff8108500f>] ? manage_workers+0x192/0x192
>> [   58.469757]  [<ffffffff810885bd>] kthread+0x82/0x8a
>> [   58.469760]  [<ffffffff8141aed4>] kernel_thread_helper+0x4/0x10
>> [   58.469763]  [<ffffffff8108853b>] ? kthread_worker_fn+0x112/0x112
>> [   58.469765]  [<ffffffff8141aed0>] ? gs_change+0xb/0xb
>> [   58.469768] kworker/u:0     S 0000000000000004     0     5      2 0x00000000
>> [   58.469773]  ffff88082fcffe80 0000000000000046 ffff880800000000 ffff88082fcfe010
>> [   58.469777]  ffff88082fcea080 0000000000011440 ffff88082fcfffd8 ffff88082fcfffd8
>> [   58.469781]  0000000000011440 0000000000000000 ffff88082fd4e9a0 ffff88082fcea080
>> [   58.469785] Call Trace:
>> [   58.469786] <Same stack as pid 4>
>> [   58.470235] kworker/0:1     S 0000000000000000     0    13      2 0x00000000
>> [   58.470255]  ffff88082fd3fe80 0000000000000046 ffff880800000000 ffff88082fd3e010
>> [   58.470279]  ffff88082fcee180 0000000000011440 ffff88082fd3ffd8 ffff88082fd3ffd8
>> [   58.470301]  0000000000011440 0000000000000000 ffffffff8180b020 ffff88082fcee180
>> [   58.470325] Call Trace:
>> [   58.470332] <Same stack as pid 4>
>
> That looks good to me.  Not only does it save space, it also makes the
> human processing of these traces more efficient.
>
> Are these pids unique?  What happens if I have a pid 4 in two pid
> namespaces?

I know that we might have different process sharing the same PID
within different namespace. How that is handled on the original stack
trace w/o the dedup? Hmm, I need to look closely into the pid
namespace.

If that's a problem then we could use the task_struct* as
> a key or something.  Perhaps add a new "stack trace number" field to
> each trace and increment/display that as the dump proceeds.


>>
>> ...
>>
>>  void
>>  show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
>> -             unsigned long *stack, unsigned long bp, char *log_lvl)
>> +             unsigned long *stack, unsigned long bp, char *log_lvl,
>> +             int index)
>
> The `index' arg is a bit mysterious, especially as it has such a bland name.

>

 Please document it somewhere (perhaps here).  Include a description of
> the magical value 0.

ok, will make better documentation.

>
>>  {
>> -     printk("%sCall Trace:\n", log_lvl);
>> -     dump_trace(task, regs, stack, bp, &print_trace_ops, log_lvl);
>> +     if (index) {
>> +             printk("%sCall Trace:\n", log_lvl);
>> +             printk("<Same stack as pid %d>\n\n", index);
>
> So it's a pid.  Perhaps it should have type pid_t and have "pid" in its
> name.

will include the change.
>
>> +     } else {
>> +             printk("%sCall Trace:\n", log_lvl);
>> +             dump_trace(task, regs, stack, bp, &print_trace_ops, log_lvl);
>> +     }
>>  }
>>
>>
>> ...
>>
>> @@ -94,6 +95,117 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
>>  }
>>  EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
>
> Some nice comments describing what we're doing in this file would be good.

ok, will add comments.

>
> It's regrettable that this code is available only on x86.  Fixable?

Hmm, i can take a look on other architectures. Not sure how much
changes are involved. I might go ahead send out the next patch w/ x86
only and other arch support comes with separate patch.

>
>> +#define DEDUP_MAX_STACK_DEPTH 32
>> +#define DEDUP_STACK_HASH 32768
>> +#define DEDUP_STACK_ENTRY (DEDUP_STACK_HASH/sizeof(struct task_stack) - 1)
>> +
>> +struct task_stack {
>> +     pid_t pid;
>> +     unsigned long entries[DEDUP_MAX_STACK_DEPTH];
>> +};
>> +
>> +struct task_stack *stack_hash_table;
>> +static struct task_stack *cur_stack;
>> +__cacheline_aligned_in_smp DEFINE_SPINLOCK(stack_hash_lock);
>> +
>> +void __init stack_trace_hash_init(void)
>> +{
>> +     stack_hash_table = vmalloc(DEDUP_STACK_HASH);
>> +     cur_stack = stack_hash_table + DEDUP_STACK_ENTRY;
>> +}
>
> Why vmalloc?
>
> Why not allocate it at compile time?

Hmm, sounds good to me. I will make the change.

>
>> +void stack_trace_hash_clean(void)
>> +{
>> +     memset(stack_hash_table, 0, DEDUP_STACK_HASH);
>> +}
>> +
>> +static inline u32 task_stack_hash(struct task_stack *stack, int len)
>> +{
>> +     u32 index = jhash(stack->entries, len * sizeof(unsigned long), 0);
>> +
>> +     return index;
>> +}
>> +
>> +static unsigned int stack_trace_lookup(int len)
>> +{
>> +     int j;
>> +     int index = 0;
>> +     unsigned int ret = 0;
>> +     struct task_stack *stack;
>> +
>> +     index = task_stack_hash(cur_stack, len) % DEDUP_STACK_ENTRY;
>> +
>> +     for (j = 0; j < 10; j++) {
>> +             stack = stack_hash_table + (index + (1 << j)) %
>> +                                             DEDUP_STACK_ENTRY;
>> +             if (stack->entries[0] == 0x0) {
>
> Good place for a comment describing why we got here.

Ok.
>
>> +                     memcpy(stack, cur_stack, sizeof(*cur_stack));
>> +                     ret = 0;
>> +                     break;
>> +             } else {
>
> Ditto.
>
>> +                     if (memcmp(stack->entries, cur_stack->entries,
>> +                                             sizeof(stack->entries)) == 0) {
>> +                             ret = stack->pid;
>> +                             break;
>> +                     }
>> +             }
>> +     }
>> +     memset(cur_stack, 0, sizeof(struct task_stack));
>> +
>> +     return ret;
>> +}
>
> Using memcmp() is pretty weak - the elimination of duplicates would
> work better if this code was integrated with the stack unwinding
> machinery, so we're not comparing random garbage non-return-address
> stack slots.

I can look into that.

>
>>
>> ...
>>
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -5727,10 +5727,11 @@ out_unlock:
>>
>>  static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
>>
>> -void sched_show_task(struct task_struct *p)
>> +void _sched_show_task(struct task_struct *p, int dedup)
>>  {
>>       unsigned long free = 0;
>>       unsigned state;
>> +     int index = 0;
>>
>>       state = p->state ? __ffs(p->state) + 1 : 0;
>>       printk(KERN_INFO "%-15.15s %c", p->comm,
>> @@ -5753,7 +5754,19 @@ void sched_show_task(struct task_struct *p)
>>               task_pid_nr(p), task_pid_nr(p->real_parent),
>>               (unsigned long)task_thread_info(p)->flags);
>>
>> -     show_stack(p, NULL);
>> +     if (dedup && stack_hash_table)
>> +             index = save_dup_stack_trace(p);
>> +     show_stack(p, NULL, index);
>> +}
>> +
>> +void sched_show_task(struct task_struct *p)
>> +{
>> +     _sched_show_task(p, 0);
>> +}
>> +
>> +void sched_show_task_dedup(struct task_struct *p)
>> +{
>> +     _sched_show_task(p, 1);
>>  }
>
> stack_hash_table only exists on x86.  Did everything else just get broken?

I will look into that.
>
>

Thank you

--Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

     prev parent reply	other threads:[~2011-06-15  5:25 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-03  5:32 [PATCH resend V2] Eliminate task stack trace duplication Ying Han
2011-06-14 21:23 ` Andrew Morton
2011-06-15  5:25   ` Ying Han [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='BANLkTi=c6YUHVJtdPZs3prXMqQtrsjsCvg@mail.gmail.com' \
    --to=yinghan@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=cl@linux.com \
    --cc=dave@linux.vnet.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=mel@csn.ul.ie \
    --cc=mhocko@suse.cz \
    --cc=minchan.kim@gmail.com \
    --cc=nishimura@mxp.nes.nec.co.jp \
    --cc=riel@redhat.com \
    --cc=tj@kernel.org \
    --cc=xemul@openvz.org \
    --cc=zhu.yanhai@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).