Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64

From: Joel Fernandes <joel@joelfernandes.org>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Jiping Ma <jiping.ma2@windriver.com>,
	catalin.marinas@arm.com, will.deacon@arm.com,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
Date: Tue, 6 Aug 2019 13:25:19 -0400	[thread overview]
Message-ID: <20190806172519.GD39951@google.com> (raw)
In-Reply-To: <20190806123455.487ac02b@gandalf.local.home>

On Tue, Aug 06, 2019 at 12:34:55PM -0400, Steven Rostedt wrote:
> On Tue, 6 Aug 2019 11:48:11 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> 
> > > diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
> > > index 5ab5200b2bdc..13a4832cfb00 100644
> > > --- a/arch/arm64/include/asm/ftrace.h
> > > +++ b/arch/arm64/include/asm/ftrace.h
> > > @@ -13,6 +13,7 @@
> > >  #define HAVE_FUNCTION_GRAPH_FP_TEST
> > >  #define MCOUNT_ADDR		((unsigned long)_mcount)
> > >  #define MCOUNT_INSN_SIZE	AARCH64_INSN_SIZE
> > > +#define ARCH_RET_ADDR_AFTER_LOCAL_VARS 1
> > >  
> > >  #ifndef __ASSEMBLY__
> > >  #include <linux/compat.h>
> > > diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
> > > index 5d16f73898db..050c6bd9beac 100644
> > > --- a/kernel/trace/trace_stack.c
> > > +++ b/kernel/trace/trace_stack.c
> > > @@ -158,6 +158,18 @@ static void check_stack(unsigned long ip, unsigned long *stack)
> > >  			i++;
> > >  	}
> > >  
> > > +#ifdef ARCH_RET_ADDR_AFTER_LOCAL_VARS
> > > +	/*
> > > +	 * Most archs store the return address before storing the
> > > +	 * function's local variables. But some archs do this backwards.
> > > +	 */
> > > +	if (x > 1) {
> > > +		memmove(&stack_trace_index[0], &stack_trace_index[1],
> > > +			sizeof(stack_trace_index[0]) * (x - 1));
> > > +		x--;
> > > +	}
> > > +#endif
> > > +
> > >  	stack_trace_nr_entries = x;
> > >  
> > >  	if (task_stack_end_corrupted(current)) {  
> > 
> > 
> > I am not fully understanding the fix :(. If the positions of the data and
> > FP/LR are swapped, then there should be a loop of some sort where the FP/LR
> > are copied repeatedly to undo the mess we are discussing. But in this patch
> > I see only one copy happening. May be I just don't understand this code well
> > enough. Are there any more clues for helping understand the fix?
> 
> Here's the best way to explain this. The code is using the stack trace
> to figure out which function is the stack hog. Or perhaps a serious of
> stack hogs. On x86, a call stores the return address as it calls the
> next function. Then that function allocates its stack frame for its
> local variables and saving of registers.

This makes perfect sense, (probably also makes sense to push this whole
explanation into either the changelog or the kernel documentation)

Thanks a lot, Steve!

 - Joel

> on x86:
> 
> [ top of stack ]
>  0: sys call entry frame
> 10: return addr to entry code
> 11: start of sys_foo frame
> 20: return addr to sys_foo
> 21: start of kernel_func_bar frame
> 30: return addr to kernel_func_bar
> 31: [ do trace stack here ]
> 
> 
> Then we do a save_stack_trace which returns the addresses of the
> functions it finds. Which would be (from the bottom of the stack to the
> top)
> 
>   return addr to kernel_func_bar
>   return addr to sys_foo
>   return addr to entry code
> 
> What we do here is try to figure out how much stack each of theses
> functions have. So we loop through the stack looking for the addresses
> returned by the save_stack trace, and see where on the stack this is.
> This gives us:
> 
>   return addr to kernel_func_bar [ 30 ]
>   return addr to sys_foo         [ 20 ]
>   return addr to entry frame     [ 10 ]
> 
> From this, we can conclude (on x86) that the size of the stack used for
> kernel_func_bar is 30 - 20 = 10. Because on the stack we have:
> 
> 20: return addr to sys_foo
> 21: start of kernel_func_bar frame  <<-- kernel_func_bar stack frame
> 30: return addr to kernel_func_bar
> 
> 
> Now, what Jiping reported, is that on arm64, it saves the link register
> (the return address) when it is needed to, which is after the stack
> frame for the current function has been saved. That means we have
> something that looks like this:
> 
> on arm64:
> 
> [ top of stack ]
>  0: sys call entry frame
> 10: start of sys_foo_frame
> 19: return addr to entry code << lr saved before calling kern_func_bar
> 20: start of kernel_func_bar frame
> 29: return addr to sys_foo_frame << lr saved before calling next function
> 30: [ do trace stack here ]
> 
> Now, I have a question. To call the mcount code (ftrace and the stack
> tracing), you need to save the return address of kern_func_bar
> somewhere, otherwise the call to mcount will overwrite the lr. But
> let's say it does and then forgets it, so we have:
> 
> 30: return addr of kernel_func_bar frame
> 31: [ do trace stack here ]
> 
> Now save_stack_trace gives us the same result:
> 
>  return addr to kernel_func bar
>  return addr to sys_foo
>  return addr to entry frame
> 
> But we get a different result when finding them in the location of the
> stack.
> 
>  return addr to kernel_func_bar [ 30 ]
>  return addr to sys foo         [ 29 ]
>  return addr to entry frame     [ 19 ]
> 
> The simple subtractions will be off:
> 
> kernel_func_bar stack size = 30 - 29 = 1
> Or even, sys_foo 29 - 19 = 10, but if we look at the stack:
> 
> 10: start of sys_foo_frame
> 19: return addr to entry_code
> 20: start of kernel_func_bar frame
> 29: return addr to sys_foo
> 
> We are measuring the kernel_func_bar frame for sys_foo!
> 
> We are off by one here.
> 
> stack_trace_index[] is an array of the offsets mapping to the function
> return addresses found. If we shift it by one, then we then sync the
> functions found with their frames:
> 
> stack_trace_index[0] = 30
> stack_trace_index[1] = 29
> stack_trace_index[2] = 19
> 
> 		memmove((&stack_trace_index[0], &stack_trace_index[1],
> 			sizeof(stack_trace_index[0]) * (x - 1));
> 
> Makes that:
> 
> stack_trace_index[0] = 29
> stack_trace_index[1] = 19
> 
> And we do x-- to lose the last frame.
> 
> With the stack_dump_trace being:
> 
> stack_dump_trace[0] = return addr kernel_func_bar
> stack_dump_trace[1] = return addr sys_foo
> 
> we then match which frame size belongs to which function better.
> 
> 
> > 
> > Also, this stack trace loop (original code) is a bit hairy :) It appears
> > there is a call to stack_trace_save() followed by another loop that goes
> > through the returned entries from there and tries to generate a set of
> > indexes. Isn't the real issue that the entries returned by stack_trace_save()
> > are a out of whack? I am curious also if other users of stack_trace_save()
> > will suffer from the same issue.
> 
> No, the order is fine. The issue is that we are using the location of
> the return address in the stack to find out what function has the
> biggest stack usage, and our assumption for arm64 is incorrect in that
> location.
> 
> -- Steve

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel