* [RFC] New codectl(2) system call for sframe registration
@ 2025-07-21 15:20 Mathieu Desnoyers
2025-07-21 18:53 ` Steven Rostedt
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Mathieu Desnoyers @ 2025-07-21 15:20 UTC (permalink / raw)
To: rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Brian Robbins, Elena Zannoni
Hi!
I've written up an RFC for a new system call to handle sframe registration
for shared libraries. There has been interest to cover both sframe in
the short term, but also JIT use-cases in the long term, so I'm
covering both here in this RFC to provide the full context. Implementation
wise we could start by only covering the sframe use-case.
I've called it "codectl(2)" for now, but I'm of course open to feedback.
For ELF, I'm including the optional pathname, build id, and debug link
information which are really useful to translate from instruction pointers
to executable/library name, symbol, offset, source file, line number.
This is what we are using in LTTng-UST and Babeltrace debug-info filter
plugin [1], and I think this would be relevant for kernel tracers as well
so they can make the resulting stack traces meaningful to users.
sys_codectl(2)
=================
* arg0: unsigned int @option:
/* Additional labels can be added to enum code_opt, for extensibility. */
enum code_opt {
CODE_REGISTER_ELF,
CODE_REGISTER_JIT,
CODE_UNREGISTER,
};
* arg1: void * @info
/* if (@option == CODE_REGISTER_ELF) */
/*
* text_start, text_end, sframe_start, sframe_end allow unwinding of the
* call stack.
*
* elf_start, elf_end, pathname, and either build_id or debug_link allows
* mapping instruction pointers to file, symbol, offset, and source file
* location.
*/
struct code_elf_info {
: __u64 elf_start;
__u64 elf_end;
__u64 text_start;
__u64 text_end;
__u64 sframe_start;
__u64 sframe_end;
__u64 pathname; /* char *, NULL if unavailable. */
__u64 build_id; /* char *, NULL if unavailable. */
__u64 debug_link_pathname; /* char *, NULL if unavailable. */
__u32 build_id_len;
__u32 debug_link_crc;
};
/* if (@option == CODE_REGISTER_JIT) */
/*
* Registration of sorted JIT unwind table: The reserved memory area is
* of size reserved_len. Userspace increases used_len as new code is
* populated between text_start and text_end. This area is populated in
* increasing address order, and its ABI requires to have no overlapping
* fre. This fits the common use-case where JITs populate code into
* a given memory area by increasing address order. The sorted unwind
* tables can be chained with a singly-linked list as they become full.
* Consecutive chained tables are also in sorted text address order.
*
* Note: if there is an eventual use-case for unsorted jit unwind table,
* this would be introduced as a new "code option".
*/
struct code_jit_info {
__u64 text_start; /* text_start >= addr */
__u64 text_end; /* addr < text_end */
__u64 unwind_head; /* struct code_jit_unwind_table * */
};
struct code_jit_unwind_fre {
/*
* Contains info similar to sframe, allowing unwind for a given
* code address range.
*/
__u32 size;
__u32 ip_off; /* offset from text_start */
__s32 cfa_off;
__s32 ra_off;
__s32 fp_off;
__u8 info;
};
struct code_jit_unwind_table {
__u64 reserved_len;
__u64 used_len; /*
* Incremented by userspace (store-release), read by
* the kernel (load-acquire).
*/
__u64 next; /* Chain with next struct code_jit_unwind_table. */
struct code_jit_unwind_fre fre[];
};
/* if (@option == CODE_UNREGISTER) */
void *info
* arg2: size_t info_size
/*
* Size of @info structure, allowing extensibility. See
* copy_struct_from_user().
*/
* arg3: unsigned int flags (0)
/* Flags for extensibility. */
Your feedback is welcome,
Thanks,
Mathieu
[1] https://babeltrace.org/docs/v2.0/man7/babeltrace2-filter.lttng-utils.debug-info.7/
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-21 15:20 [RFC] New codectl(2) system call for sframe registration Mathieu Desnoyers
@ 2025-07-21 18:53 ` Steven Rostedt
2025-07-21 20:58 ` Mathieu Desnoyers
2025-07-22 18:21 ` Indu Bhagat
2025-07-23 0:26 ` Masami Hiramatsu
2 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-07-21 18:53 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Brian Robbins, Elena Zannoni
On Mon, 21 Jul 2025 11:20:34 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> Hi!
>
> I've written up an RFC for a new system call to handle sframe registration
> for shared libraries. There has been interest to cover both sframe in
> the short term, but also JIT use-cases in the long term, so I'm
> covering both here in this RFC to provide the full context. Implementation
> wise we could start by only covering the sframe use-case.
>
> I've called it "codectl(2)" for now, but I'm of course open to feedback.
Hmm, I guess I'm OK with that name. I can't really think of anything that
would be better. But kernel developers are notorious for sucking at coming
up with decent names ;-)
>
> For ELF, I'm including the optional pathname, build id, and debug link
> information which are really useful to translate from instruction pointers
> to executable/library name, symbol, offset, source file, line number.
> This is what we are using in LTTng-UST and Babeltrace debug-info filter
> plugin [1], and I think this would be relevant for kernel tracers as well
> so they can make the resulting stack traces meaningful to users.
Honestly, I'm not sure it needs to be an ELF file. Just a file that has an
sframe section in it.
>
> sys_codectl(2)
> =================
>
> * arg0: unsigned int @option:
>
> /* Additional labels can be added to enum code_opt, for extensibility. */
>
> enum code_opt {
> CODE_REGISTER_ELF,
Perhaps the above should be: CODE_REGISTER_SFRAME,
as currently SFrame is read only via files.
> CODE_REGISTER_JIT,
From our other conversations, JIT will likely be a completely different
format than SFRAME, so calling it just JIT should be fine.
> CODE_UNREGISTER,
I wonder if this should be the first enum. That is, "0" is to unregister.
That way, all non-zero options will be for what is being registered, and
"0" is for unregistering any of them.
> };
>
> * arg1: void * @info
>
> /* if (@option == CODE_REGISTER_ELF) */
>
> /*
> * text_start, text_end, sframe_start, sframe_end allow unwinding of the
> * call stack.
> *
> * elf_start, elf_end, pathname, and either build_id or debug_link allows
> * mapping instruction pointers to file, symbol, offset, and source file
> * location.
> */
> struct code_elf_info {
> : __u64 elf_start;
> __u64 elf_end;
Perhaps:
__u64 file_start;
__u64 file_end;
?
And call it "struct code_sframe_info"
> __u64 text_start;
> __u64 text_end;
> __u64 sframe_start;
> __u64 sframe_end;
What is the above "sframe" for?
> __u64 pathname; /* char *, NULL if unavailable. */
>
> __u64 build_id; /* char *, NULL if unavailable. */
> __u64 debug_link_pathname; /* char *, NULL if unavailable. */
Maybe just list the above three as "optional" ?
It may be available, but the implementer just doesn't want to implement it.
> __u32 build_id_len;
> __u32 debug_link_crc;
> };
>
>
> /* if (@option == CODE_REGISTER_JIT) */
>
> /*
> * Registration of sorted JIT unwind table: The reserved memory area is
> * of size reserved_len. Userspace increases used_len as new code is
> * populated between text_start and text_end. This area is populated in
> * increasing address order, and its ABI requires to have no overlapping
> * fre. This fits the common use-case where JITs populate code into
> * a given memory area by increasing address order. The sorted unwind
> * tables can be chained with a singly-linked list as they become full.
> * Consecutive chained tables are also in sorted text address order.
> *
> * Note: if there is an eventual use-case for unsorted jit unwind table,
> * this would be introduced as a new "code option".
> */
>
> struct code_jit_info {
> __u64 text_start; /* text_start >= addr */
> __u64 text_end; /* addr < text_end */
> __u64 unwind_head; /* struct code_jit_unwind_table * */
> };
>
> struct code_jit_unwind_fre {
> /*
> * Contains info similar to sframe, allowing unwind for a given
> * code address range.
> */
> __u32 size;
> __u32 ip_off; /* offset from text_start */
> __s32 cfa_off;
> __s32 ra_off;
> __s32 fp_off;
> __u8 info;
> };
>
> struct code_jit_unwind_table {
> __u64 reserved_len;
> __u64 used_len; /*
> * Incremented by userspace (store-release), read by
> * the kernel (load-acquire).
> */
> __u64 next; /* Chain with next struct code_jit_unwind_table. */
> struct code_jit_unwind_fre fre[];
> };
I wonder if we should avoid the "jit" portion completely for now until we
know what exactly we need.
Thanks,
-- Steve
>
> /* if (@option == CODE_UNREGISTER) */
>
> void *info
>
> * arg2: size_t info_size
>
> /*
> * Size of @info structure, allowing extensibility. See
> * copy_struct_from_user().
> */
>
> * arg3: unsigned int flags (0)
>
> /* Flags for extensibility. */
>
> Your feedback is welcome,
>
> Thanks,
>
> Mathieu
>
> [1] https://babeltrace.org/docs/v2.0/man7/babeltrace2-filter.lttng-utils.debug-info.7/
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-21 18:53 ` Steven Rostedt
@ 2025-07-21 20:58 ` Mathieu Desnoyers
2025-07-21 21:15 ` Steven Rostedt
0 siblings, 1 reply; 22+ messages in thread
From: Mathieu Desnoyers @ 2025-07-21 20:58 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Brian Robbins, Elena Zannoni
On 2025-07-21 14:53, Steven Rostedt wrote:
> On Mon, 21 Jul 2025 11:20:34 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>> Hi!
>>
>> I've written up an RFC for a new system call to handle sframe registration
>> for shared libraries. There has been interest to cover both sframe in
>> the short term, but also JIT use-cases in the long term, so I'm
>> covering both here in this RFC to provide the full context. Implementation
>> wise we could start by only covering the sframe use-case.
>>
>> I've called it "codectl(2)" for now, but I'm of course open to feedback.
>
> Hmm, I guess I'm OK with that name. I can't really think of anything that
> would be better. But kernel developers are notorious for sucking at coming
> up with decent names ;-)
I agree wholeheartedly. ;)
>
>>
>> For ELF, I'm including the optional pathname, build id, and debug link
>> information which are really useful to translate from instruction pointers
>> to executable/library name, symbol, offset, source file, line number.
>> This is what we are using in LTTng-UST and Babeltrace debug-info filter
>> plugin [1], and I think this would be relevant for kernel tracers as well
>> so they can make the resulting stack traces meaningful to users.
>
> Honestly, I'm not sure it needs to be an ELF file. Just a file that has an
> sframe section in it.
Indu told me on IRC that for GNU/Linux, SFrame will be an
allocated,loaded section in elf files.
I'm planning to add optional fields (build id, debug link) that are
ELF-specific. I therefore think it's best that we keep this specific as
registration of an elf file.
If there are other file types in the future that happen to contain an
sframe section (but are not ELF), then we can simply add a new label to
enum code_opt.
>
>>
>> sys_codectl(2)
>> =================
>>
>> * arg0: unsigned int @option:
>>
>> /* Additional labels can be added to enum code_opt, for extensibility. */
>>
>> enum code_opt {
>> CODE_REGISTER_ELF,
>
> Perhaps the above should be: CODE_REGISTER_SFRAME,
>
> as currently SFrame is read only via files.
As I pointed out above, on GNU/Linux, sframe is always an allocated,loaded
ELF section. AFAIU, your comment implies that we'd want to support other scenarios
where the sframe is in files outside of elf binary sframe sections. Can you
expand on the use-case you have for this, or is it just for future-proofing ?
>
>> CODE_REGISTER_JIT,
>
> From our other conversations, JIT will likely be a completely different
> format than SFRAME, so calling it just JIT should be fine.
OK
>
>
>> CODE_UNREGISTER,
>
> I wonder if this should be the first enum. That is, "0" is to unregister.
>
> That way, all non-zero options will be for what is being registered, and
> "0" is for unregistering any of them.
Good idea, I'll do that.
>
>
>> };
>>
>> * arg1: void * @info
>>
>> /* if (@option == CODE_REGISTER_ELF) */
>>
>> /*
>> * text_start, text_end, sframe_start, sframe_end allow unwinding of the
>> * call stack.
>> *
>> * elf_start, elf_end, pathname, and either build_id or debug_link allows
>> * mapping instruction pointers to file, symbol, offset, and source file
>> * location.
>> */
>> struct code_elf_info {
>> : __u64 elf_start;
>> __u64 elf_end;
>
> Perhaps:
>
> __u64 file_start;
> __u64 file_end;
>
> ?
>
> And call it "struct code_sframe_info"
>
>> __u64 text_start;
>> __u64 text_end;
>
>> __u64 sframe_start;
>> __u64 sframe_end;
>
> What is the above "sframe" for?
>
>> __u64 pathname; /* char *, NULL if unavailable. */
>>
>> __u64 build_id; /* char *, NULL if unavailable. */
>> __u64 debug_link_pathname; /* char *, NULL if unavailable. */
>
> Maybe just list the above three as "optional" ?
This is what I had in mind with "NULL if unavailable", but I can clarify
them as being "optional" in the comment.
Do you envision that the sizeof(struct code_elf_info) could be smaller
and not include the optional fields, or just specifying them as NULL if
unavailable is enough ?
>
> It may be available, but the implementer just doesn't want to implement it.
>
>> __u32 build_id_len;
>> __u32 debug_link_crc;
>> };
>>
>>
>> /* if (@option == CODE_REGISTER_JIT) */
>>
>> /*
>> * Registration of sorted JIT unwind table: The reserved memory area is
>> * of size reserved_len. Userspace increases used_len as new code is
>> * populated between text_start and text_end. This area is populated in
>> * increasing address order, and its ABI requires to have no overlapping
>> * fre. This fits the common use-case where JITs populate code into
>> * a given memory area by increasing address order. The sorted unwind
>> * tables can be chained with a singly-linked list as they become full.
>> * Consecutive chained tables are also in sorted text address order.
>> *
>> * Note: if there is an eventual use-case for unsorted jit unwind table,
>> * this would be introduced as a new "code option".
>> */
>>
>> struct code_jit_info {
>> __u64 text_start; /* text_start >= addr */
>> __u64 text_end; /* addr < text_end */
>> __u64 unwind_head; /* struct code_jit_unwind_table * */
>> };
>>
>> struct code_jit_unwind_fre {
>> /*
>> * Contains info similar to sframe, allowing unwind for a given
>> * code address range.
>> */
>> __u32 size;
>> __u32 ip_off; /* offset from text_start */
>> __s32 cfa_off;
>> __s32 ra_off;
>> __s32 fp_off;
>> __u8 info;
>> };
>>
>> struct code_jit_unwind_table {
>> __u64 reserved_len;
>> __u64 used_len; /*
>> * Incremented by userspace (store-release), read by
>> * the kernel (load-acquire).
>> */
>> __u64 next; /* Chain with next struct code_jit_unwind_table. */
>> struct code_jit_unwind_fre fre[];
>> };
>
> I wonder if we should avoid the "jit" portion completely for now until we
> know what exactly we need.
I don't want to spend too much discussion time on the jit portion at this stage,
but I think it's good to keep this in mind so we come up with an ABI that will
naturally extend to cover that use case. I favor keeping the JIT portion in these
discussions but not implement it initially.
Thanks Steven!
Mathieu
>
> Thanks,
>
> -- Steve
>
>
>>
>> /* if (@option == CODE_UNREGISTER) */
>>
>> void *info
>>
>> * arg2: size_t info_size
>>
>> /*
>> * Size of @info structure, allowing extensibility. See
>> * copy_struct_from_user().
>> */
>>
>> * arg3: unsigned int flags (0)
>>
>> /* Flags for extensibility. */
>>
>> Your feedback is welcome,
>>
>> Thanks,
>>
>> Mathieu
>>
>> [1] https://babeltrace.org/docs/v2.0/man7/babeltrace2-filter.lttng-utils.debug-info.7/
>>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-21 20:58 ` Mathieu Desnoyers
@ 2025-07-21 21:15 ` Steven Rostedt
2025-07-22 13:51 ` Mathieu Desnoyers
0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-07-21 21:15 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Brian Robbins, Elena Zannoni
On Mon, 21 Jul 2025 16:58:43 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > Honestly, I'm not sure it needs to be an ELF file. Just a file that has an
> > sframe section in it.
>
> Indu told me on IRC that for GNU/Linux, SFrame will be an
> allocated,loaded section in elf files.
Yes it is, but is that a requirement for this interface? I just don't want
to add requirements based on how thing currently work if they are not
needed.
>
> I'm planning to add optional fields (build id, debug link) that are
> ELF-specific. I therefore think it's best that we keep this specific as
> registration of an elf file.
Here's a hypothetical, what if for some reason (say having the sframe
sections outside of the elf file) that the linker shares that?
For instance, if the sframe sections are downloaded separately as a
separate package for a given executable (to make it not mandatory for an
install), the linker could be smart enough to see that they exist in some
special location and then pass that to the kernel. In other words, this is
option is specific for sframe and not ELF. I rather call it by that.
>
> If there are other file types in the future that happen to contain an
> sframe section (but are not ELF), then we can simply add a new label to
> enum code_opt.
>
> >
> >>
> >> sys_codectl(2)
> >> =================
> >>
> >> * arg0: unsigned int @option:
> >>
> >> /* Additional labels can be added to enum code_opt, for extensibility. */
> >>
> >> enum code_opt {
> >> CODE_REGISTER_ELF,
> >
> > Perhaps the above should be: CODE_REGISTER_SFRAME,
> >
> > as currently SFrame is read only via files.
>
> As I pointed out above, on GNU/Linux, sframe is always an allocated,loaded
> ELF section. AFAIU, your comment implies that we'd want to support other scenarios
> where the sframe is in files outside of elf binary sframe sections. Can you
> expand on the use-case you have for this, or is it just for future-proofing ?
Heh, I just did above (before reading this). But yeah, it could be. As I
mentioned above, this is not about ELF files. Sframes just happen to be in
an ELF file. CODE_REGISTER_ELF sounds like this is for doing special
actions to an ELF file, when in reality it is doing special actions to tell
the kernel this is an sframe table. It just happens that sframes are in
ELF. Let's call it for what it is used for.
>
> >
> >> CODE_REGISTER_JIT,
> >
> > From our other conversations, JIT will likely be a completely different
> > format than SFRAME, so calling it just JIT should be fine.
>
> OK
>
> >
> >
> >> CODE_UNREGISTER,
> >
> > I wonder if this should be the first enum. That is, "0" is to unregister.
> >
> > That way, all non-zero options will be for what is being registered, and
> > "0" is for unregistering any of them.
>
> Good idea, I'll do that.
>
> >
> >
> >> };
> >>
> >> * arg1: void * @info
> >>
> >> /* if (@option == CODE_REGISTER_ELF) */
> >>
> >> /*
> >> * text_start, text_end, sframe_start, sframe_end allow unwinding of the
> >> * call stack.
> >> *
> >> * elf_start, elf_end, pathname, and either build_id or debug_link allows
> >> * mapping instruction pointers to file, symbol, offset, and source file
> >> * location.
> >> */
> >> struct code_elf_info {
> >> : __u64 elf_start;
> >> __u64 elf_end;
> >
> > Perhaps:
> >
> > __u64 file_start;
> > __u64 file_end;
> >
> > ?
> >
> > And call it "struct code_sframe_info"
> >
> >> __u64 text_start;
> >> __u64 text_end;
> >
> >> __u64 sframe_start;
> >> __u64 sframe_end;
> >
> > What is the above "sframe" for?
Still wondering what the above is for.
> >
> >> __u64 pathname; /* char *, NULL if unavailable. */
> >>
> >> __u64 build_id; /* char *, NULL if unavailable. */
> >> __u64 debug_link_pathname; /* char *, NULL if unavailable. */
> >
> > Maybe just list the above three as "optional" ?
>
> This is what I had in mind with "NULL if unavailable", but I can clarify
> them as being "optional" in the comment.
>
> Do you envision that the sizeof(struct code_elf_info) could be smaller
> and not include the optional fields, or just specifying them as NULL if
> unavailable is enough ?
Hmm, are we going to allow this structure to expand? Should we give it a
size. Or just state that different options could have different sizes (and
make this more of a union than a structure).
>
> >
> > It may be available, but the implementer just doesn't want to implement it.
> >
> >> __u32 build_id_len;
> >> __u32 debug_link_crc;
> >> };
> >>
> >>
> >> /* if (@option == CODE_REGISTER_JIT) */
> >>
> >> /*
> >> * Registration of sorted JIT unwind table: The reserved memory area is
> >> * of size reserved_len. Userspace increases used_len as new code is
> >> * populated between text_start and text_end. This area is populated in
> >> * increasing address order, and its ABI requires to have no overlapping
> >> * fre. This fits the common use-case where JITs populate code into
> >> * a given memory area by increasing address order. The sorted unwind
> >> * tables can be chained with a singly-linked list as they become full.
> >> * Consecutive chained tables are also in sorted text address order.
> >> *
> >> * Note: if there is an eventual use-case for unsorted jit unwind table,
> >> * this would be introduced as a new "code option".
> >> */
> >>
> >> struct code_jit_info {
> >> __u64 text_start; /* text_start >= addr */
> >> __u64 text_end; /* addr < text_end */
> >> __u64 unwind_head; /* struct code_jit_unwind_table * */
> >> };
> >>
> >> struct code_jit_unwind_fre {
> >> /*
> >> * Contains info similar to sframe, allowing unwind for a given
> >> * code address range.
> >> */
> >> __u32 size;
> >> __u32 ip_off; /* offset from text_start */
> >> __s32 cfa_off;
> >> __s32 ra_off;
> >> __s32 fp_off;
> >> __u8 info;
> >> };
> >>
> >> struct code_jit_unwind_table {
> >> __u64 reserved_len;
> >> __u64 used_len; /*
> >> * Incremented by userspace (store-release), read by
> >> * the kernel (load-acquire).
> >> */
> >> __u64 next; /* Chain with next struct code_jit_unwind_table. */
> >> struct code_jit_unwind_fre fre[];
> >> };
> >
> > I wonder if we should avoid the "jit" portion completely for now until we
> > know what exactly we need.
>
> I don't want to spend too much discussion time on the jit portion at this stage,
> but I think it's good to keep this in mind so we come up with an ABI that will
> naturally extend to cover that use case. I favor keeping the JIT portion in these
> discussions but not implement it initially.
As long as the structure is flexible to handle this. We could even add the
JIT enum, but return -EINVAL (or whatever) if it is used to state that it's
not currently implemented.
-- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-21 21:15 ` Steven Rostedt
@ 2025-07-22 13:51 ` Mathieu Desnoyers
2025-07-22 16:25 ` Steven Rostedt
0 siblings, 1 reply; 22+ messages in thread
From: Mathieu Desnoyers @ 2025-07-22 13:51 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Brian Robbins, Elena Zannoni
On 2025-07-21 17:15, Steven Rostedt wrote:
> On Mon, 21 Jul 2025 16:58:43 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>>> Honestly, I'm not sure it needs to be an ELF file. Just a file that has an
>>> sframe section in it.
>>
>> Indu told me on IRC that for GNU/Linux, SFrame will be an
>> allocated,loaded section in elf files.
>
> Yes it is, but is that a requirement for this interface? I just don't want
> to add requirements based on how thing currently work if they are not
> needed.
The ELF-specific optional fields I am suggesting (pathname, build id,
debug info) are useful for tooling. AFAIR this is what gdb uses to
find the debug info associated with each binary file.
>
>>
>> I'm planning to add optional fields (build id, debug link) that are
>> ELF-specific. I therefore think it's best that we keep this specific as
>> registration of an elf file.
>
> Here's a hypothetical, what if for some reason (say having the sframe
> sections outside of the elf file) that the linker shares that?
So your hypothetical scenario is having sframe provided as a separate
file. This sframe file (or part of it) would still describe how to
unwind a given elf file text range. So I would argue that this would
still fit into the model of CODE_REGISTER_ELF, it's just that the
address range from sframe_start to sframe_end would be mapped from
a different file. This is entirely up to the dynamic loader and should
not impact the kernel ABI.
AFAIK the a.out binary support was deprecated in Linux kernel v5.1. So
being elf specific is not an issue.
And if for some reason we end up inventing a new model to hand over the
sframe information in the future, for instance if we choose not to map
the sframe information in userspace and hand over a sframe-file pathname
and offset instead, we'll just extend the code_opt enum with a new
label.
>
> For instance, if the sframe sections are downloaded separately as a
> separate package for a given executable (to make it not mandatory for an
> install), the linker could be smart enough to see that they exist in some
> special location and then pass that to the kernel. In other words, this is
> option is specific for sframe and not ELF. I rather call it by that.
As I explained above, if the dynamic loader populates the sframe section
in userspace memory, this fits within the CODE_REGISTER_ELF ABI. If we
eventually choose not to map the sframe section into userspace memory
(even though this is not an envisioned use-case at the moment), we can
just extend enum code_opt with a new label.
>
>>
>> If there are other file types in the future that happen to contain an
>> sframe section (but are not ELF), then we can simply add a new label to
>> enum code_opt.
>>
>>>
>>>>
>>>> sys_codectl(2)
>>>> =================
>>>>
>>>> * arg0: unsigned int @option:
>>>>
>>>> /* Additional labels can be added to enum code_opt, for extensibility. */
>>>>
>>>> enum code_opt {
>>>> CODE_REGISTER_ELF,
>>>
>>> Perhaps the above should be: CODE_REGISTER_SFRAME,
>>>
>>> as currently SFrame is read only via files.
>>
>> As I pointed out above, on GNU/Linux, sframe is always an allocated,loaded
>> ELF section. AFAIU, your comment implies that we'd want to support other scenarios
>> where the sframe is in files outside of elf binary sframe sections. Can you
>> expand on the use-case you have for this, or is it just for future-proofing ?
>
> Heh, I just did above (before reading this). But yeah, it could be. As I
> mentioned above, this is not about ELF files. Sframes just happen to be in
> an ELF file. CODE_REGISTER_ELF sounds like this is for doing special
> actions to an ELF file, when in reality it is doing special actions to tell
> the kernel this is an sframe table. It just happens that sframes are in
> ELF. Let's call it for what it is used for.
I see sframe as one "aspect" of an ELF file. Sure, we could do one
system call for every aspect of an ELF file that we want to register,
but that would require many round trips from userspace to the kernel
every time a library is loaded. In my opinion it makes sense to combine
all aspects of an elf file that we want the kernel to know about into
one registration system call. In that sense, we're not registering just
sframe, but the various aspects of an ELF file, which include sframe.
By the way, the sframe section is optional as well. If we allow
sframe_start and sframe_end to be NULL, this would let libc register
an sframe-less ELF file with its pathname, build-id, and debug info
to the kernel. This would be immediately useful on its own for
distributions that have frame pointers enabled even without sframe
section.
[...]
>>>
>>>> };
>>>>
>>>> * arg1: void * @info
>>>>
>>>> /* if (@option == CODE_REGISTER_ELF) */
>>>>
>>>> /*
>>>> * text_start, text_end, sframe_start, sframe_end allow unwinding of the
>>>> * call stack.
>>>> *
>>>> * elf_start, elf_end, pathname, and either build_id or debug_link allows
>>>> * mapping instruction pointers to file, symbol, offset, and source file
>>>> * location.
>>>> */
>>>> struct code_elf_info {
>>>> : __u64 elf_start;
>>>> __u64 elf_end;
>>>
>>> Perhaps:
>>>
>>> __u64 file_start;
>>> __u64 file_end;
>>>
>>> ?
>>>
>>> And call it "struct code_sframe_info"
>>>
>>>> __u64 text_start;
>>>> __u64 text_end;
>>>
>>>> __u64 sframe_start;
>>>> __u64 sframe_end;
>>>
>>> What is the above "sframe" for?
>
> Still wondering what the above is for.
Well we have an sframe section which is mapped into userspace memory
from sframe_start to sframe_end, which contains the unwind information
that covers the code from text_start to text_end.
Am I unknowingly adding some kind of redundancy here ?
>
>>>
>>>> __u64 pathname; /* char *, NULL if unavailable. */
>>>>
>>>> __u64 build_id; /* char *, NULL if unavailable. */
>>>> __u64 debug_link_pathname; /* char *, NULL if unavailable. */
>>>
>>> Maybe just list the above three as "optional" ?
>>
>> This is what I had in mind with "NULL if unavailable", but I can clarify
>> them as being "optional" in the comment.
>>
>> Do you envision that the sizeof(struct code_elf_info) could be smaller
>> and not include the optional fields, or just specifying them as NULL if
>> unavailable is enough ?
>
> Hmm, are we going to allow this structure to expand? Should we give it a
> size. Or just state that different options could have different sizes (and
> make this more of a union than a structure).
This is extensible. The size of this structure is expected in
arg2: size_t info_size.
Each "@option" label select which structure is expected, and each of
those structures are extensible, with their size expected as arg2.
>
>>
>>>
>>> It may be available, but the implementer just doesn't want to implement it.
>>>
>>>> __u32 build_id_len;
>>>> __u32 debug_link_crc;
>>>> };
>>>>
>>>>
>>>> /* if (@option == CODE_REGISTER_JIT) */
>>>>
>>>> /*
>>>> * Registration of sorted JIT unwind table: The reserved memory area is
>>>> * of size reserved_len. Userspace increases used_len as new code is
>>>> * populated between text_start and text_end. This area is populated in
>>>> * increasing address order, and its ABI requires to have no overlapping
>>>> * fre. This fits the common use-case where JITs populate code into
>>>> * a given memory area by increasing address order. The sorted unwind
>>>> * tables can be chained with a singly-linked list as they become full.
>>>> * Consecutive chained tables are also in sorted text address order.
>>>> *
>>>> * Note: if there is an eventual use-case for unsorted jit unwind table,
>>>> * this would be introduced as a new "code option".
>>>> */
>>>>
>>>> struct code_jit_info {
>>>> __u64 text_start; /* text_start >= addr */
>>>> __u64 text_end; /* addr < text_end */
>>>> __u64 unwind_head; /* struct code_jit_unwind_table * */
>>>> };
>>>>
>>>> struct code_jit_unwind_fre {
>>>> /*
>>>> * Contains info similar to sframe, allowing unwind for a given
>>>> * code address range.
>>>> */
>>>> __u32 size;
>>>> __u32 ip_off; /* offset from text_start */
>>>> __s32 cfa_off;
>>>> __s32 ra_off;
>>>> __s32 fp_off;
>>>> __u8 info;
>>>> };
>>>>
>>>> struct code_jit_unwind_table {
>>>> __u64 reserved_len;
>>>> __u64 used_len; /*
>>>> * Incremented by userspace (store-release), read by
>>>> * the kernel (load-acquire).
>>>> */
>>>> __u64 next; /* Chain with next struct code_jit_unwind_table. */
>>>> struct code_jit_unwind_fre fre[];
>>>> };
>>>
>>> I wonder if we should avoid the "jit" portion completely for now until we
>>> know what exactly we need.
>>
>> I don't want to spend too much discussion time on the jit portion at this stage,
>> but I think it's good to keep this in mind so we come up with an ABI that will
>> naturally extend to cover that use case. I favor keeping the JIT portion in these
>> discussions but not implement it initially.
>
> As long as the structure is flexible to handle this. We could even add the
> JIT enum, but return -EINVAL (or whatever) if it is used to state that it's
> not currently implemented.
Sure, we can indeed initially have the JIT label, and return -EINVAL for
now.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 13:51 ` Mathieu Desnoyers
@ 2025-07-22 16:25 ` Steven Rostedt
2025-07-22 18:26 ` Mathieu Desnoyers
2025-07-22 18:56 ` Jose E. Marchesi
0 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-07-22 16:25 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Brian Robbins, Elena Zannoni
On Tue, 22 Jul 2025 09:51:22 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > Here's a hypothetical, what if for some reason (say having the sframe
> > sections outside of the elf file) that the linker shares that?
>
> So your hypothetical scenario is having sframe provided as a separate
> file. This sframe file (or part of it) would still describe how to
> unwind a given elf file text range. So I would argue that this would
No. It should describe how to get access to an sframe section for some
text that has already been loaded in memory.
I'm looking for a mapping between already loaded text memory to how to
unwind it that will be in an sframe format somewhere on disk.
> still fit into the model of CODE_REGISTER_ELF, it's just that the
> address range from sframe_start to sframe_end would be mapped from
> a different file. This is entirely up to the dynamic loader and should
> not impact the kernel ABI.
>
> AFAIK the a.out binary support was deprecated in Linux kernel v5.1. So
> being elf specific is not an issue.
Yes, but we are not registering ELF. We are registering how to unwind
something with sframes. If it's not sframes we are registering, what is
it?
>
> And if for some reason we end up inventing a new model to hand over the
> sframe information in the future, for instance if we choose not to map
> the sframe information in userspace and hand over a sframe-file pathname
> and offset instead, we'll just extend the code_opt enum with a new
> label.
This is not a new model. We could likely do it today without much
effort. We are handing over sframe data regardless if it's in an ELF
file or not.
The systemcall is to let the dynamic linker know where the kernel can
find the sframes for newly loaded text.
>
> >
> > For instance, if the sframe sections are downloaded separately as a
> > separate package for a given executable (to make it not mandatory for an
> > install), the linker could be smart enough to see that they exist in some
> > special location and then pass that to the kernel. In other words, this is
> > option is specific for sframe and not ELF. I rather call it by that.
>
> As I explained above, if the dynamic loader populates the sframe section
> in userspace memory, this fits within the CODE_REGISTER_ELF ABI. If we
But this isn't about ELF! It's about sframes! Why not name it that?
> eventually choose not to map the sframe section into userspace memory
> (even though this is not an envisioned use-case at the moment), we can
> just extend enum code_opt with a new label.
Why call this at all if you don't plan on mapping sframes?
>
> >
> >>
> >> If there are other file types in the future that happen to contain an
> >> sframe section (but are not ELF), then we can simply add a new label to
> >> enum code_opt.
> >>
> >>>
> >>>>
> >>>> sys_codectl(2)
> >>>> =================
> >>>>
> >>>> * arg0: unsigned int @option:
> >>>>
> >>>> /* Additional labels can be added to enum code_opt, for extensibility. */
> >>>>
> >>>> enum code_opt {
> >>>> CODE_REGISTER_ELF,
> >>>
> >>> Perhaps the above should be: CODE_REGISTER_SFRAME,
> >>>
> >>> as currently SFrame is read only via files.
> >>
> >> As I pointed out above, on GNU/Linux, sframe is always an allocated,loaded
> >> ELF section. AFAIU, your comment implies that we'd want to support other scenarios
> >> where the sframe is in files outside of elf binary sframe sections. Can you
> >> expand on the use-case you have for this, or is it just for future-proofing ?
> >
> > Heh, I just did above (before reading this). But yeah, it could be. As I
> > mentioned above, this is not about ELF files. Sframes just happen to be in
> > an ELF file. CODE_REGISTER_ELF sounds like this is for doing special
> > actions to an ELF file, when in reality it is doing special actions to tell
> > the kernel this is an sframe table. It just happens that sframes are in
> > ELF. Let's call it for what it is used for.
>
> I see sframe as one "aspect" of an ELF file. Sure, we could do one
> system call for every aspect of an ELF file that we want to register,
> but that would require many round trips from userspace to the kernel
> every time a library is loaded. In my opinion it makes sense to combine
> all aspects of an elf file that we want the kernel to know about into
> one registration system call. In that sense, we're not registering just
> sframe, but the various aspects of an ELF file, which include sframe.
So you are making this a generic ELF function? What other functions do
you plan to do with this system call?
>
> By the way, the sframe section is optional as well. If we allow
> sframe_start and sframe_end to be NULL, this would let libc register
> an sframe-less ELF file with its pathname, build-id, and debug info
> to the kernel. This would be immediately useful on its own for
> distributions that have frame pointers enabled even without sframe
> section.
The above is called mission creep. Looks to me that you are using this
as a way to have LTTng get easier access to build ids and such. We can
add *that* later if needed, as a separate option. This has nothing to
do with the current requirements.
> >>>
> >>> And call it "struct code_sframe_info"
> >>>
> >>>> __u64 text_start;
> >>>> __u64 text_end;
> >>>
> >>>> __u64 sframe_start;
> >>>> __u64 sframe_end;
> >>>
> >>> What is the above "sframe" for?
> >
> > Still wondering what the above is for.
>
> Well we have an sframe section which is mapped into userspace memory
> from sframe_start to sframe_end, which contains the unwind information
> that covers the code from text_start to text_end.
Actually, the sframe section shouldn't be mapped into user space
memory. The kernel will be doing that, not the linker. I would say that
the system call can give a hint of where it would like it mapped, but
it should allow the kernel to decide where to map it as the user space
code doesn't care where it gets mapped.
In the future, if we wants to compress the sframe section, it will not
even be a loadable ELF section. But the system call can tell the
kernel: "there's a sframe compressed section at this offset/size in
this file" for this text address range and then the kernel will do the
rest.
>
> Am I unknowingly adding some kind of redundancy here ?
>
Maybe. This systemcall was to add unwinding information for the kernel.
It looks like you are having it be much more than that. I'm not against
that, but that should only be for extensions, and currently, this is
supposed to only make sframes work.
-- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-21 15:20 [RFC] New codectl(2) system call for sframe registration Mathieu Desnoyers
2025-07-21 18:53 ` Steven Rostedt
@ 2025-07-22 18:21 ` Indu Bhagat
2025-07-22 18:49 ` Mathieu Desnoyers
2025-07-23 0:26 ` Masami Hiramatsu
2 siblings, 1 reply; 22+ messages in thread
From: Indu Bhagat @ 2025-07-22 18:21 UTC (permalink / raw)
To: Mathieu Desnoyers, rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James, Brian Robbins,
Elena Zannoni
On 7/21/25 8:20 AM, Mathieu Desnoyers wrote:
> Hi!
>
> I've written up an RFC for a new system call to handle sframe registration
> for shared libraries. There has been interest to cover both sframe in
> the short term, but also JIT use-cases in the long term, so I'm
> covering both here in this RFC to provide the full context. Implementation
> wise we could start by only covering the sframe use-case.
>
> I've called it "codectl(2)" for now, but I'm of course open to feedback.
>
> For ELF, I'm including the optional pathname, build id, and debug link
> information which are really useful to translate from instruction pointers
> to executable/library name, symbol, offset, source file, line number.
> This is what we are using in LTTng-UST and Babeltrace debug-info filter
> plugin [1], and I think this would be relevant for kernel tracers as well
> so they can make the resulting stack traces meaningful to users.
>
> sys_codectl(2)
> =================
>
> * arg0: unsigned int @option:
>
> /* Additional labels can be added to enum code_opt, for extensibility. */
>
> enum code_opt {
> CODE_REGISTER_ELF,
> CODE_REGISTER_JIT,
> CODE_UNREGISTER,
> };
>
> * arg1: void * @info
>
> /* if (@option == CODE_REGISTER_ELF) */
>
> /*
> * text_start, text_end, sframe_start, sframe_end allow unwinding of the
> * call stack.
> *
> * elf_start, elf_end, pathname, and either build_id or debug_link allows
> * mapping instruction pointers to file, symbol, offset, and source file
> * location.
> */
> struct code_elf_info {
> : __u64 elf_start;
> __u64 elf_end;
What are the elf_start , elf_end intended for ?
> __u64 text_start;
> __u64 text_end;
> __u64 sframe_start;
> __u64 sframe_end;
> __u64 pathname; /* char *, NULL if unavailable. */
>
> __u64 build_id; /* char *, NULL if unavailable. */
> __u64 debug_link_pathname; /* char *, NULL if unavailable. */
> __u32 build_id_len;
> __u32 debug_link_crc;
> };
>
>
> /* if (@option == CODE_REGISTER_JIT) */
>
> /*
> * Registration of sorted JIT unwind table: The reserved memory area is
> * of size reserved_len. Userspace increases used_len as new code is
> * populated between text_start and text_end. This area is populated in
> * increasing address order, and its ABI requires to have no overlapping
> * fre. This fits the common use-case where JITs populate code into
> * a given memory area by increasing address order. The sorted unwind
> * tables can be chained with a singly-linked list as they become full.
> * Consecutive chained tables are also in sorted text address order.
> *
> * Note: if there is an eventual use-case for unsorted jit unwind table,
> * this would be introduced as a new "code option".
> */
>
> struct code_jit_info {
> __u64 text_start; /* text_start >= addr */
> __u64 text_end; /* addr < text_end */
> __u64 unwind_head; /* struct code_jit_unwind_table * */
> };
>
I see the discussion has evolved here with the general sentiment that
the JIT part needs to be kept in mind for a rough sketch but cannot be
designed at this time. But two comments (if we keep JIT part in the
discussion):
- I think we need to keep __u64 unwind_head not a pointer to a
defined structure (struct code_jit_unwind_table * above), but some
opaque type like we have for SFrame case.
- The reserved_len should ideally be a part of code_jit_info, so the
length can be known without parsing the contents.
> struct code_jit_unwind_fre {
> /*
> * Contains info similar to sframe, allowing unwind for a given
> * code address range.
> */
> __u32 size;
> __u32 ip_off; /* offset from text_start */
> __s32 cfa_off;
> __s32 ra_off;
> __s32 fp_off;
> __u8 info;
> };
>
> struct code_jit_unwind_table {
> __u64 reserved_len;
> __u64 used_len; /*
> * Incremented by userspace (store-release), read by
> * the kernel (load-acquire).
> */
> __u64 next; /* Chain with next struct code_jit_unwind_table. */
> struct code_jit_unwind_fre fre[];
> };
>
> /* if (@option == CODE_UNREGISTER) */
>
> void *info
>
> * arg2: size_t info_size
>
> /*
> * Size of @info structure, allowing extensibility. See
> * copy_struct_from_user().
> */
>
> * arg3: unsigned int flags (0)
>
> /* Flags for extensibility. */
>
> Your feedback is welcome,
>
> Thanks,
>
> Mathieu
>
> [1] https://babeltrace.org/docs/v2.0/man7/babeltrace2-filter.lttng-
> utils.debug-info.7/
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 16:25 ` Steven Rostedt
@ 2025-07-22 18:26 ` Mathieu Desnoyers
2025-07-22 19:11 ` Steven Rostedt
2025-07-22 18:56 ` Jose E. Marchesi
1 sibling, 1 reply; 22+ messages in thread
From: Mathieu Desnoyers @ 2025-07-22 18:26 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Brian Robbins, Elena Zannoni
On 2025-07-22 12:25, Steven Rostedt wrote:
> On Tue, 22 Jul 2025 09:51:22 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>>> Here's a hypothetical, what if for some reason (say having the sframe
>>> sections outside of the elf file) that the linker shares that?
>>
>> So your hypothetical scenario is having sframe provided as a separate
>> file. This sframe file (or part of it) would still describe how to
>> unwind a given elf file text range. So I would argue that this would
>
> No. It should describe how to get access to an sframe section for some
> text that has already been loaded in memory.
>
> I'm looking for a mapping between already loaded text memory to how to
> unwind it that will be in an sframe format somewhere on disk.
OK, so what you have in mind is the compressed sframe use-case.
Ideally, for the compressed sframe use-case I suspect we'd want to do
lazy on demand decompression which could decompress only the parts that
are needed for the unwind, rather than expand everything in memory.
Pointing the kernel to a file/offset on disk is rather different than
the current ELF sframe section scenario, where is it allocated,loaded
into the process' address space. I suspect we would want to cover this
with a future new code_opt enum label.
>
>> still fit into the model of CODE_REGISTER_ELF, it's just that the
>> address range from sframe_start to sframe_end would be mapped from
>> a different file. This is entirely up to the dynamic loader and should
>> not impact the kernel ABI.
>>
>> AFAIK the a.out binary support was deprecated in Linux kernel v5.1. So
>> being elf specific is not an issue.
>
> Yes, but we are not registering ELF. We are registering how to unwind
> something with sframes. If it's not sframes we are registering, what is
> it?
I am thinking of sframes as one of the properties of an ELF executable.
So from my perspective we are registering an ELF file with various
properties, one of which is its sframe section.
But I think I get where you are getting at: if we define the sframe
registration for ELF as sframe_start, sframe_end, then it forgoes
approaches where sframe is provided through other means, such as
pathname and offset, which would be useful for the compressed sframe
use-case.
If system call overhead is not too much of an issue at library load,
then we could break this down into multiple system calls, e.g.
eventually:
codectl(CODE_REGISTER_SFRAME, /* provide sframe start + end */ )
codectl(CODE_REGISTER_ELF, /* provide elf-specific info such as build id */ )
>
>>
>> And if for some reason we end up inventing a new model to hand over the
>> sframe information in the future, for instance if we choose not to map
>> the sframe information in userspace and hand over a sframe-file pathname
>> and offset instead, we'll just extend the code_opt enum with a new
>> label.
>
> This is not a new model. We could likely do it today without much
> effort. We are handing over sframe data regardless if it's in an ELF
> file or not.
>
> The systemcall is to let the dynamic linker know where the kernel can
> find the sframes for newly loaded text.
I am saying this is a "new" model because the current sframe section is
allocated,loaded, which means it is present in userspace memory, so it
seems rather logical to delimit this area with pointers to the start/end
of that range.
>
>>
>>>
>>> For instance, if the sframe sections are downloaded separately as a
>>> separate package for a given executable (to make it not mandatory for an
>>> install), the linker could be smart enough to see that they exist in some
>>> special location and then pass that to the kernel. In other words, this is
>>> option is specific for sframe and not ELF. I rather call it by that.
>>
>> As I explained above, if the dynamic loader populates the sframe section
>> in userspace memory, this fits within the CODE_REGISTER_ELF ABI. If we
>
> But this isn't about ELF! It's about sframes! Why not name it that?
I understand your position in wanting other "types" of sframe registration
in the future that would cover compressed sframe files. Because of this,
it makes sense that the registration becomes specific to sframe, because
we would not want to tie all "elf" registrations to a specific sframe
ABI (mapped in userspace memory, within a given address range vs pathname
and offset).
>
>> eventually choose not to map the sframe section into userspace memory
>> (even though this is not an envisioned use-case at the moment), we can
>> just extend enum code_opt with a new label.
>
> Why call this at all if you don't plan on mapping sframes?
If we split this into separate registrations (sframe vs elf), then it
would be fine: registering an elf binary (in the future) could be done
to explicitly register pathname, build-id and debug link. And this is
independent of sframe. This could come as a future new code_opt label, no
need to do it now.
>
>>
>>>
>>>>
>>>> If there are other file types in the future that happen to contain an
>>>> sframe section (but are not ELF), then we can simply add a new label to
>>>> enum code_opt.
>>>>
>>>>>
>>>>>>
>>>>>> sys_codectl(2)
>>>>>> =================
>>>>>>
>>>>>> * arg0: unsigned int @option:
>>>>>>
>>>>>> /* Additional labels can be added to enum code_opt, for extensibility. */
>>>>>>
>>>>>> enum code_opt {
>>>>>> CODE_REGISTER_ELF,
>>>>>
>>>>> Perhaps the above should be: CODE_REGISTER_SFRAME,
>>>>>
>>>>> as currently SFrame is read only via files.
>>>>
>>>> As I pointed out above, on GNU/Linux, sframe is always an allocated,loaded
>>>> ELF section. AFAIU, your comment implies that we'd want to support other scenarios
>>>> where the sframe is in files outside of elf binary sframe sections. Can you
>>>> expand on the use-case you have for this, or is it just for future-proofing ?
>>>
>>> Heh, I just did above (before reading this). But yeah, it could be. As I
>>> mentioned above, this is not about ELF files. Sframes just happen to be in
>>> an ELF file. CODE_REGISTER_ELF sounds like this is for doing special
>>> actions to an ELF file, when in reality it is doing special actions to tell
>>> the kernel this is an sframe table. It just happens that sframes are in
>>> ELF. Let's call it for what it is used for.
>>
>> I see sframe as one "aspect" of an ELF file. Sure, we could do one
>> system call for every aspect of an ELF file that we want to register,
>> but that would require many round trips from userspace to the kernel
>> every time a library is loaded. In my opinion it makes sense to combine
>> all aspects of an elf file that we want the kernel to know about into
>> one registration system call. In that sense, we're not registering just
>> sframe, but the various aspects of an ELF file, which include sframe.
>
> So you are making this a generic ELF function? What other functions do
> you plan to do with this system call?
All those I have in mind are part of this RFC.
>
>>
>> By the way, the sframe section is optional as well. If we allow
>> sframe_start and sframe_end to be NULL, this would let libc register
>> an sframe-less ELF file with its pathname, build-id, and debug info
>> to the kernel. This would be immediately useful on its own for
>> distributions that have frame pointers enabled even without sframe
>> section.
>
> The above is called mission creep. Looks to me that you are using this
> as a way to have LTTng get easier access to build ids and such. We can
> add *that* later if needed, as a separate option. This has nothing to
> do with the current requirements.
I agree on the mission creep argument. I disagree on the stated intent though.
For LTTng, I'm happy to grab this information from userspace. I already have
it and I don't need it from the kernel. I figured it would be most useful for
perf and ftrace if you guys can directly get that information without relying
on a userspace tracer.
So considering the fact that you'll want to introduce new sframe registration
methods in the future, then indeed it makes sense to make the registration
sframe-specific.
>
>
>>>>>
>>>>> And call it "struct code_sframe_info"
>>>>>
>>>>>> __u64 text_start;
>>>>>> __u64 text_end;
>>>>>
>>>>>> __u64 sframe_start;
>>>>>> __u64 sframe_end;
>>>>>
>>>>> What is the above "sframe" for?
>>>
>>> Still wondering what the above is for.
>>
>> Well we have an sframe section which is mapped into userspace memory
>> from sframe_start to sframe_end, which contains the unwind information
>> that covers the code from text_start to text_end.
>
> Actually, the sframe section shouldn't be mapped into user space
> memory. The kernel will be doing that, not the linker.
AFAIU, that's not how the sframe section works today. It's allocated,loaded.
So userspace maps the section into its address space, and the kernel takes
the page faults when it needs to load its content.
> I would say that
> the system call can give a hint of where it would like it mapped, but
> it should allow the kernel to decide where to map it as the user space
> code doesn't care where it gets mapped.
AFAIU currently the dynamic loader maps the section, not the kernel.
>
> In the future, if we wants to compress the sframe section, it will not
> even be a loadable ELF section. But the system call can tell the
> kernel: "there's a sframe compressed section at this offset/size in
> this file" for this text address range and then the kernel will do the
> rest.
I would see this compressed side-file handled entirely from the kernel
(not mapped in userspace) as a new enum code_opt option.
>
>>
>> Am I unknowingly adding some kind of redundancy here ?
>>
>
> Maybe. This systemcall was to add unwinding information for the kernel.
> It looks like you are having it be much more than that. I'm not against
> that, but that should only be for extensions, and currently, this is
> supposed to only make sframes work.
I agree that if we state that "elf" registration has sframe_start/end
as a mean to express sframe, then we are stuck with a model where userspace
needs to map the section in its memory. Considering that you want to
express different models where a filename and offset is provided to the
kernel instead, then it makes sense to make the registration more specific.
The downside would be that we may have to do more than one system call if we
want to register more than one "aspect", e.g. sframe vs elf build-id.
I think the overhead of a single vs a few system calls is an important
aspect to consider. If the overhead of a few more system calls at library
load does not matter too much, then we should go for the more specific
registration. I have no clue whether that overhead matters in practice though.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 18:21 ` Indu Bhagat
@ 2025-07-22 18:49 ` Mathieu Desnoyers
2025-07-23 8:16 ` Indu Bhagat
0 siblings, 1 reply; 22+ messages in thread
From: Mathieu Desnoyers @ 2025-07-22 18:49 UTC (permalink / raw)
To: Indu Bhagat, rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James, Brian Robbins,
Elena Zannoni
On 2025-07-22 14:21, Indu Bhagat wrote:
> On 7/21/25 8:20 AM, Mathieu Desnoyers wrote:
>> Hi!
>>
>> I've written up an RFC for a new system call to handle sframe
>> registration
>> for shared libraries. There has been interest to cover both sframe in
>> the short term, but also JIT use-cases in the long term, so I'm
>> covering both here in this RFC to provide the full context.
>> Implementation
>> wise we could start by only covering the sframe use-case.
>>
>> I've called it "codectl(2)" for now, but I'm of course open to feedback.
>>
>> For ELF, I'm including the optional pathname, build id, and debug link
>> information which are really useful to translate from instruction
>> pointers
>> to executable/library name, symbol, offset, source file, line number.
>> This is what we are using in LTTng-UST and Babeltrace debug-info filter
>> plugin [1], and I think this would be relevant for kernel tracers as well
>> so they can make the resulting stack traces meaningful to users.
>>
>> sys_codectl(2)
>> =================
>>
>> * arg0: unsigned int @option:
>>
>> /* Additional labels can be added to enum code_opt, for extensibility. */
>>
>> enum code_opt {
>> CODE_REGISTER_ELF,
>> CODE_REGISTER_JIT,
>> CODE_UNREGISTER,
>> };
>>
>> * arg1: void * @info
>>
>> /* if (@option == CODE_REGISTER_ELF) */
>>
>> /*
>> * text_start, text_end, sframe_start, sframe_end allow unwinding of the
>> * call stack.
>> *
>> * elf_start, elf_end, pathname, and either build_id or debug_link
>> allows
>> * mapping instruction pointers to file, symbol, offset, and source file
>> * location.
>> */
>> struct code_elf_info {
>> : __u64 elf_start;
>> __u64 elf_end;
>
> What are the elf_start , elf_end intended for ?
The intent is to know at which address the first loadable segment of
the shared object is mapped (elf_start), and the size of the shared
object mapping, which is the sum of the size of its PT_LOAD segments.
This allows tooling to easily lookup which addresses belong to that
shared object, for any loaded segment, whether it's code or data.
>
>> __u64 text_start;
>> __u64 text_end;
>> __u64 sframe_start;
>> __u64 sframe_end;
>> __u64 pathname; /* char *, NULL if unavailable. */
>>
>> __u64 build_id; /* char *, NULL if unavailable. */
>> __u64 debug_link_pathname; /* char *, NULL if unavailable. */
>> __u32 build_id_len;
>> __u32 debug_link_crc;
>> };
>>
>>
>> /* if (@option == CODE_REGISTER_JIT) */
>>
>> /*
>> * Registration of sorted JIT unwind table: The reserved memory area is
>> * of size reserved_len. Userspace increases used_len as new code is
>> * populated between text_start and text_end. This area is populated in
>> * increasing address order, and its ABI requires to have no overlapping
>> * fre. This fits the common use-case where JITs populate code into
>> * a given memory area by increasing address order. The sorted unwind
>> * tables can be chained with a singly-linked list as they become full.
>> * Consecutive chained tables are also in sorted text address order.
>> *
>> * Note: if there is an eventual use-case for unsorted jit unwind table,
>> * this would be introduced as a new "code option".
>> */
>>
>> struct code_jit_info {
>> __u64 text_start; /* text_start >= addr */
>> __u64 text_end; /* addr < text_end */
>> __u64 unwind_head; /* struct code_jit_unwind_table * */
>> };
>>
>
> I see the discussion has evolved here with the general sentiment that
> the JIT part needs to be kept in mind for a rough sketch but cannot be
> designed at this time. But two comments (if we keep JIT part in the
> discussion):
> - I think we need to keep __u64 unwind_head not a pointer to a
> defined structure (struct code_jit_unwind_table * above), but some
> opaque type like we have for SFrame case.
What is the reason for making this an opaque type for sframe ?
> - The reserved_len should ideally be a part of code_jit_info, so the
> length can be known without parsing the contents.
I've placed reserved_len within the unwind table because I planned to
have the jit information for a given range of text be a linked list of
tables. Therefore, if one table fills up, then another table can be
chained at the tail. Having the reserved_len part of each table makes
things easier to combine into a linked list.
Thanks for your feedback !
Mathieu
>
>> struct code_jit_unwind_fre {
>> /*
>> * Contains info similar to sframe, allowing unwind for a given
>> * code address range.
>> */
>> __u32 size;
>> __u32 ip_off; /* offset from text_start */
>> __s32 cfa_off;
>> __s32 ra_off;
>> __s32 fp_off;
>> __u8 info;
>> };
>>
>> struct code_jit_unwind_table {
>> __u64 reserved_len;
>> __u64 used_len; /*
>> * Incremented by userspace (store-release), read by
>> * the kernel (load-acquire).
>> */
>> __u64 next; /* Chain with next struct code_jit_unwind_table. */
>> struct code_jit_unwind_fre fre[];
>> };
>>
>> /* if (@option == CODE_UNREGISTER) */
>>
>> void *info
>>
>> * arg2: size_t info_size
>>
>> /*
>> * Size of @info structure, allowing extensibility. See
>> * copy_struct_from_user().
>> */
>>
>> * arg3: unsigned int flags (0)
>>
>> /* Flags for extensibility. */
>>
>> Your feedback is welcome,
>>
>> Thanks,
>>
>> Mathieu
>>
>> [1] https://babeltrace.org/docs/v2.0/man7/babeltrace2-filter.lttng-
>> utils.debug-info.7/
>>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 16:25 ` Steven Rostedt
2025-07-22 18:26 ` Mathieu Desnoyers
@ 2025-07-22 18:56 ` Jose E. Marchesi
2025-07-22 19:17 ` Steven Rostedt
1 sibling, 1 reply; 22+ messages in thread
From: Jose E. Marchesi @ 2025-07-22 18:56 UTC (permalink / raw)
To: Steven Rostedt
Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Brian Robbins, Elena Zannoni
> On Tue, 22 Jul 2025 09:51:22 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>> > Here's a hypothetical, what if for some reason (say having the sframe
>> > sections outside of the elf file) that the linker shares that?
>>
>> So your hypothetical scenario is having sframe provided as a separate
>> file. This sframe file (or part of it) would still describe how to
>> unwind a given elf file text range. So I would argue that this would
>
> No. It should describe how to get access to an sframe section for some
> text that has already been loaded in memory.
>
> I'm looking for a mapping between already loaded text memory to how to
> unwind it that will be in an sframe format somewhere on disk.
>
>> still fit into the model of CODE_REGISTER_ELF, it's just that the
>> address range from sframe_start to sframe_end would be mapped from
>> a different file. This is entirely up to the dynamic loader and should
>> not impact the kernel ABI.
>>
>> AFAIK the a.out binary support was deprecated in Linux kernel v5.1. So
>> being elf specific is not an issue.
>
> Yes, but we are not registering ELF. We are registering how to unwind
> something with sframes. If it's not sframes we are registering, what is
> it?
>
>>
>> And if for some reason we end up inventing a new model to hand over the
>> sframe information in the future, for instance if we choose not to map
>> the sframe information in userspace and hand over a sframe-file pathname
>> and offset instead, we'll just extend the code_opt enum with a new
>> label.
>
> This is not a new model. We could likely do it today without much
> effort. We are handing over sframe data regardless if it's in an ELF
> file or not.
>
> The systemcall is to let the dynamic linker know where the kernel can
> find the sframes for newly loaded text.
>
>>
>> >
>> > For instance, if the sframe sections are downloaded separately as a
>> > separate package for a given executable (to make it not mandatory for an
>> > install), the linker could be smart enough to see that they exist in some
>> > special location and then pass that to the kernel. In other words, this is
>> > option is specific for sframe and not ELF. I rather call it by that.
>>
>> As I explained above, if the dynamic loader populates the sframe section
>> in userspace memory, this fits within the CODE_REGISTER_ELF ABI. If we
>
> But this isn't about ELF! It's about sframes! Why not name it that?
>
>> eventually choose not to map the sframe section into userspace memory
>> (even though this is not an envisioned use-case at the moment), we can
>> just extend enum code_opt with a new label.
>
> Why call this at all if you don't plan on mapping sframes?
>
>>
>> >
>> >>
>> >> If there are other file types in the future that happen to contain an
>> >> sframe section (but are not ELF), then we can simply add a new label to
>> >> enum code_opt.
>> >>
>> >>>
>> >>>>
>> >>>> sys_codectl(2)
>> >>>> =================
>> >>>>
>> >>>> * arg0: unsigned int @option:
>> >>>>
>> >>>> /* Additional labels can be added to enum code_opt, for extensibility. */
>> >>>>
>> >>>> enum code_opt {
>> >>>> CODE_REGISTER_ELF,
>> >>>
>> >>> Perhaps the above should be: CODE_REGISTER_SFRAME,
>> >>>
>> >>> as currently SFrame is read only via files.
>> >>
>> >> As I pointed out above, on GNU/Linux, sframe is always an allocated,loaded
>> >> ELF section. AFAIU, your comment implies that we'd want to support other scenarios
>> >> where the sframe is in files outside of elf binary sframe sections. Can you
>> >> expand on the use-case you have for this, or is it just for future-proofing ?
>> >
>> > Heh, I just did above (before reading this). But yeah, it could be. As I
>> > mentioned above, this is not about ELF files. Sframes just happen to be in
>> > an ELF file. CODE_REGISTER_ELF sounds like this is for doing special
>> > actions to an ELF file, when in reality it is doing special actions to tell
>> > the kernel this is an sframe table. It just happens that sframes are in
>> > ELF. Let's call it for what it is used for.
>>
>> I see sframe as one "aspect" of an ELF file. Sure, we could do one
>> system call for every aspect of an ELF file that we want to register,
>> but that would require many round trips from userspace to the kernel
>> every time a library is loaded. In my opinion it makes sense to combine
>> all aspects of an elf file that we want the kernel to know about into
>> one registration system call. In that sense, we're not registering just
>> sframe, but the various aspects of an ELF file, which include sframe.
>
> So you are making this a generic ELF function? What other functions do
> you plan to do with this system call?
>
>>
>> By the way, the sframe section is optional as well. If we allow
>> sframe_start and sframe_end to be NULL, this would let libc register
>> an sframe-less ELF file with its pathname, build-id, and debug info
>> to the kernel. This would be immediately useful on its own for
>> distributions that have frame pointers enabled even without sframe
>> section.
>
> The above is called mission creep. Looks to me that you are using this
> as a way to have LTTng get easier access to build ids and such. We can
> add *that* later if needed, as a separate option. This has nothing to
> do with the current requirements.
I also think that involving ELF in the interface at this point without
having a good reason for doing so is probably not a good idea. Linked
and loaded SFrame data is pretty much self-contained and can be used as
such, but only if it gets loaded along with the text segments it referts
to. See below...
>
>> >>>
>> >>> And call it "struct code_sframe_info"
>> >>>
>> >>>> __u64 text_start;
>> >>>> __u64 text_end;
>> >>>
>> >>>> __u64 sframe_start;
>> >>>> __u64 sframe_end;
>> >>>
>> >>> What is the above "sframe" for?
>> >
>> > Still wondering what the above is for.
>>
>> Well we have an sframe section which is mapped into userspace memory
>> from sframe_start to sframe_end, which contains the unwind information
>> that covers the code from text_start to text_end.
>
> Actually, the sframe section shouldn't be mapped into user space
> memory. The kernel will be doing that, not the linker. I would say that
> the system call can give a hint of where it would like it mapped, but
> it should allow the kernel to decide where to map it as the user space
> code doesn't care where it gets mapped.
The SFrame data actually lives in a loadable segment in the ELF file
(and actually not alone, it has flatmates). This is important. The FDEs
in the SFrame section refer to the start of the functions they apply to.
These references take the form of an offset between the VM address where
the FDE itself gets loaded and the VM address of the beginning of the
function, which are also valid as file offsets, since all segments get
loaded together always keeping the same order. That's how dynamic
relocs are avoided.
This makes SFrame pretty self-contained, provided it gets loaded/mapped
along with the other loadable segments in the ELF file it comes from.
The loader will assure that.
I think glibc could "register" loaded SFrame data by just pointing the
kernel to the VM address where it got loaded, "you got some SFrame
there". Starting from that address it is then possible to find the
referred code locations just by applying the offsets, without needing
any additional information nor ELF foobar...
Or thats how I understand it. Indu will undoubtly correct me if I am
wrong 8-)
> In the future, if we wants to compress the sframe section, it will not
> even be a loadable ELF section. But the system call can tell the
> kernel: "there's a sframe compressed section at this offset/size in
> this file" for this text address range and then the kernel will do the
> rest.
I think supporting compressed SFrame will probably require to do some
sort of relocation of the offsets in the uncompressed data, depending on
where the uncompressed data will get eventually loaded.
>>
>> Am I unknowingly adding some kind of redundancy here ?
>>
>
> Maybe. This systemcall was to add unwinding information for the kernel.
> It looks like you are having it be much more than that. I'm not against
> that, but that should only be for extensions, and currently, this is
> supposed to only make sframes work.
>
> -- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 18:26 ` Mathieu Desnoyers
@ 2025-07-22 19:11 ` Steven Rostedt
0 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-07-22 19:11 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Brian Robbins, Elena Zannoni
Florian, You may want to read this email as there's some question about
dynamic linking.
On Tue, 22 Jul 2025 14:26:44 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> > I'm looking for a mapping between already loaded text memory to how to
> > unwind it that will be in an sframe format somewhere on disk.
>
> OK, so what you have in mind is the compressed sframe use-case.
>
> Ideally, for the compressed sframe use-case I suspect we'd want to do
> lazy on demand decompression which could decompress only the parts that
> are needed for the unwind, rather than expand everything in memory.
>
> Pointing the kernel to a file/offset on disk is rather different than
> the current ELF sframe section scenario, where is it allocated,loaded
> into the process' address space. I suspect we would want to cover this
> with a future new code_opt enum label.
The sframe program header is of type PT_GNU_SFRAME and not PT_LOAD so
the linker will not be loading it. The code in the kernel has to do
something special with this section. It's not automatic.
So yes, I never had any expectation that the dynamic linker would even
load sframes into memory. It would simply tell the kernel where to find
it and it will load it.
> >
> > Yes, but we are not registering ELF. We are registering how to unwind
> > something with sframes. If it's not sframes we are registering, what is
> > it?
>
> I am thinking of sframes as one of the properties of an ELF executable.
> So from my perspective we are registering an ELF file with various
> properties, one of which is its sframe section.
That wasn't what I was thinking.
>
> But I think I get where you are getting at: if we define the sframe
> registration for ELF as sframe_start, sframe_end, then it forgoes
> approaches where sframe is provided through other means, such as
> pathname and offset, which would be useful for the compressed sframe
> use-case.
>
> If system call overhead is not too much of an issue at library load,
> then we could break this down into multiple system calls, e.g.
> eventually:
>
> codectl(CODE_REGISTER_SFRAME, /* provide sframe start + end */ )
> codectl(CODE_REGISTER_ELF, /* provide elf-specific info such as build id */ )
IIRC, and Florian (who has been Cc'd) can correct me if I'm wrong,
dynamic file loading is quite a slow process and a few extra system
calls isn't going to show up outside the noise.
> > The systemcall is to let the dynamic linker know where the kernel can
> > find the sframes for newly loaded text.
>
> I am saying this is a "new" model because the current sframe section is
> allocated,loaded, which means it is present in userspace memory, so it
> seems rather logical to delimit this area with pointers to the start/end
> of that range.
But its the kernel that maps it into memory. I was expecting that the
kernel would map it again into memory just like it does with the ELF
file. I wasn't expecting the dynamic linker to.
> >
> > Actually, the sframe section shouldn't be mapped into user space
> > memory. The kernel will be doing that, not the linker.
>
> AFAIU, that's not how the sframe section works today. It's allocated,loaded.
> So userspace maps the section into its address space, and the kernel takes
> the page faults when it needs to load its content.
Yes, but the kernel maps it. I wasn't expecting the user space dynamic
linker to map it. I was expecting the system call to simply say "here's
where the sframe section is in this file" and the kernel would take
care of the rest.
>
>
> > I would say that
> > the system call can give a hint of where it would like it mapped, but
> > it should allow the kernel to decide where to map it as the user space
> > code doesn't care where it gets mapped.
>
> AFAIU currently the dynamic loader maps the section, not the kernel.
You mean the prctl()?
I haven't looked to deep into that systemcall. It may do that
currently. I'm just thinking what is the best way to do this. I guess
we should ask Florian which is best for the dynamic linker. If it
should map it in, or if the kernel should, with thinking about a
compressed format in mind as well.
>
> >
> > In the future, if we wants to compress the sframe section, it will not
> > even be a loadable ELF section. But the system call can tell the
> > kernel: "there's a sframe compressed section at this offset/size in
> > this file" for this text address range and then the kernel will do the
> > rest.
>
> I would see this compressed side-file handled entirely from the kernel
> (not mapped in userspace) as a new enum code_opt option.
Yes, it would likely be a new emum.
But if the dynamic linker has already mapped the sframe into memory and
giving it to the kernel, then it is even less an "elf" file. It's
simply mapping a sframe section in memory with some text in memory. The
way the dynamic linker mapped it will still do everything as normal.
>
> >
> >>
> >> Am I unknowingly adding some kind of redundancy here ?
> >>
> >
> > Maybe. This systemcall was to add unwinding information for the kernel.
> > It looks like you are having it be much more than that. I'm not against
> > that, but that should only be for extensions, and currently, this is
> > supposed to only make sframes work.
>
> I agree that if we state that "elf" registration has sframe_start/end
> as a mean to express sframe, then we are stuck with a model where userspace
> needs to map the section in its memory. Considering that you want to
> express different models where a filename and offset is provided to the
> kernel instead, then it makes sense to make the registration more specific.
>
> The downside would be that we may have to do more than one system call if we
> want to register more than one "aspect", e.g. sframe vs elf build-id.
>
> I think the overhead of a single vs a few system calls is an important
> aspect to consider. If the overhead of a few more system calls at library
> load does not matter too much, then we should go for the more specific
> registration. I have no clue whether that overhead matters in practice though.
If the linker needs to map it, it is already doing lots of systemcalls
to accomplish that ;-)
-- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 18:56 ` Jose E. Marchesi
@ 2025-07-22 19:17 ` Steven Rostedt
2025-07-22 21:04 ` Indu Bhagat
0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-07-22 19:17 UTC (permalink / raw)
To: Jose E. Marchesi
Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Brian Robbins, Elena Zannoni
On Tue, 22 Jul 2025 20:56:47 +0200
"Jose E. Marchesi" <jemarch@gnu.org> wrote:
> I think glibc could "register" loaded SFrame data by just pointing the
> kernel to the VM address where it got loaded, "you got some SFrame
> there". Starting from that address it is then possible to find the
> referred code locations just by applying the offsets, without needing
> any additional information nor ELF foobar...
>
> Or thats how I understand it. Indu will undoubtly correct me if I am
> wrong 8-)
Maybe I'm wrong, but if you know where the text is loaded (the final
location it is in memory), it is possible to figure out the relocations
in the sframe section.
>
> > In the future, if we wants to compress the sframe section, it will not
> > even be a loadable ELF section. But the system call can tell the
> > kernel: "there's a sframe compressed section at this offset/size in
> > this file" for this text address range and then the kernel will do the
> > rest.
>
> I think supporting compressed SFrame will probably require to do some
> sort of relocation of the offsets in the uncompressed data, depending on
> where the uncompressed data will get eventually loaded.
Assuming that all the text is at a given offset, would that be enough
to fill in the blanks?
As the text would have already been linked into memory before the
system call is made. If this is not the case, then we definitely need
the linker to load the sframe into memory before it does the system
call, and just give the kernel that address.
-- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 19:17 ` Steven Rostedt
@ 2025-07-22 21:04 ` Indu Bhagat
2025-07-22 21:13 ` Steven Rostedt
2025-07-23 15:07 ` Mathieu Desnoyers
0 siblings, 2 replies; 22+ messages in thread
From: Indu Bhagat @ 2025-07-22 21:04 UTC (permalink / raw)
To: Steven Rostedt, Jose E. Marchesi
Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James, Brian Robbins,
Elena Zannoni
On 7/22/25 12:17 PM, Steven Rostedt wrote:
> On Tue, 22 Jul 2025 20:56:47 +0200
> "Jose E. Marchesi" <jemarch@gnu.org> wrote:
>
>> I think glibc could "register" loaded SFrame data by just pointing the
>> kernel to the VM address where it got loaded, "you got some SFrame
>> there". Starting from that address it is then possible to find the
>> referred code locations just by applying the offsets, without needing
>> any additional information nor ELF foobar...
>>
>> Or thats how I understand it. Indu will undoubtly correct me if I am
>> wrong 8-)
>
> Maybe I'm wrong, but if you know where the text is loaded (the final
> location it is in memory), it is possible to figure out the relocations
> in the sframe section.
>
(FWIW, What Jose wrote is correct.)
Some details which may help clear up some confusion here. The SFrame
sections are of type SHT_GNU_SFRAME and currently have
SEC_ALLOC|SEC_LOAD flags set. This means that they are allocated memory
and loaded at application start up time. These sections appear in a
PT_LOAD segment in the linked binaries.
Then there is a PT_GNU_SFRAME, which is a new program header type for
SFrame. PT_GNU_SFRAME by itself does not trigger the loading of SFrame
sections. But the .sframe sections being present in the PT_LOAD segment
does.
>>
>>> In the future, if we wants to compress the sframe section, it will not
>>> even be a loadable ELF section. But the system call can tell the
>>> kernel: "there's a sframe compressed section at this offset/size in
>>> this file" for this text address range and then the kernel will do the
>>> rest.
>>
>> I think supporting compressed SFrame will probably require to do some
>> sort of relocation of the offsets in the uncompressed data, depending on
>> where the uncompressed data will get eventually loaded.
>
> Assuming that all the text is at a given offset, would that be enough
> to fill in the blanks?
>
Yes and No. The offset at which the text is loaded is _one_ part of the
information to "fill in the blanks". The other part is what to do with
that information (text_vma) or how to relocate the SFrame section itself
a.k.a. the relocation entries. To know the relocations, one will need
to get access to the respective relocation section, and hence access to
the ELF section headers.
> As the text would have already been linked into memory before the
> system call is made. If this is not the case, then we definitely need
> the linker to load the sframe into memory before it does the system
> call, and just give the kernel that address.
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 21:04 ` Indu Bhagat
@ 2025-07-22 21:13 ` Steven Rostedt
2025-07-22 21:57 ` Indu Bhagat
2025-07-23 15:09 ` Mathieu Desnoyers
2025-07-23 15:07 ` Mathieu Desnoyers
1 sibling, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-07-22 21:13 UTC (permalink / raw)
To: Indu Bhagat
Cc: Jose E. Marchesi, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, bpf, x86, Masami Hiramatsu, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Beau Belgrave, Jens Remus,
Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
Sam James, Brian Robbins, Elena Zannoni
On Tue, 22 Jul 2025 14:04:37 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:
> Yes and No. The offset at which the text is loaded is _one_ part of the
> information to "fill in the blanks". The other part is what to do with
> that information (text_vma) or how to relocate the SFrame section itself
> a.k.a. the relocation entries. To know the relocations, one will need
> to get access to the respective relocation section, and hence access to
> the ELF section headers.
You mean to find where in the sframe section itself that needs to be update?
OK, that makes sense. So sframes does need to still be in an ELF file for
its own relocations and such.
It will be interesting on how to do compression and on-demand page loading.
There would need to be a table as well that will denote where in the
decompressed pages that relocations need to be performed.
-- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 21:13 ` Steven Rostedt
@ 2025-07-22 21:57 ` Indu Bhagat
2025-07-23 15:09 ` Mathieu Desnoyers
1 sibling, 0 replies; 22+ messages in thread
From: Indu Bhagat @ 2025-07-22 21:57 UTC (permalink / raw)
To: Steven Rostedt
Cc: Jose E. Marchesi, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, bpf, x86, Masami Hiramatsu, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Beau Belgrave, Jens Remus,
Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
Sam James, Brian Robbins, Elena Zannoni
On 7/22/25 2:13 PM, Steven Rostedt wrote:
> On Tue, 22 Jul 2025 14:04:37 -0700
> Indu Bhagat <indu.bhagat@oracle.com> wrote:
>
>> Yes and No. The offset at which the text is loaded is _one_ part of the
>> information to "fill in the blanks". The other part is what to do with
>> that information (text_vma) or how to relocate the SFrame section itself
>> a.k.a. the relocation entries. To know the relocations, one will need
>> to get access to the respective relocation section, and hence access to
>> the ELF section headers.
>
> You mean to find where in the sframe section itself that needs to be update?
>
Correct. Each relocation entry carries pieces of information like :what
is the location to update, how many bytes to update and what is the
calculation to use i.e., the relocation type.
> OK, that makes sense. So sframes does need to still be in an ELF file for
> its own relocations and such.
>
> It will be interesting on how to do compression and on-demand page loading.
>
Right, its an open item.
Compression (SHF_COMPRESSED) for non SHF_ALLOC sections is doable. In
fact, debug sections use it already.
The tricky part is SHF_ALLOC and SHF_COMPRESSED, which is what SFrame
may need. This is currently not allowed in ELF. Some previous
discussion here https://groups.google.com/g/generic-abi/c/HUVhliUrTG0.
Not sure if things have evolved since.
> There would need to be a table as well that will denote where in the
> decompressed pages that relocations need to be performed.
>
> -- Steve
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-21 15:20 [RFC] New codectl(2) system call for sframe registration Mathieu Desnoyers
2025-07-21 18:53 ` Steven Rostedt
2025-07-22 18:21 ` Indu Bhagat
@ 2025-07-23 0:26 ` Masami Hiramatsu
2025-07-23 15:15 ` Mathieu Desnoyers
2 siblings, 1 reply; 22+ messages in thread
From: Masami Hiramatsu @ 2025-07-23 0:26 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
Sam James, Brian Robbins, Elena Zannoni
Hi Mathieu,
On Mon, 21 Jul 2025 11:20:34 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> Hi!
>
> I've written up an RFC for a new system call to handle sframe registration
> for shared libraries. There has been interest to cover both sframe in
> the short term, but also JIT use-cases in the long term, so I'm
> covering both here in this RFC to provide the full context. Implementation
> wise we could start by only covering the sframe use-case.
>
> I've called it "codectl(2)" for now, but I'm of course open to feedback.
Nice idea for JIT, but I doubt we need this for ELF.
>
> For ELF, I'm including the optional pathname, build id, and debug link
> information which are really useful to translate from instruction pointers
> to executable/library name, symbol, offset, source file, line number.
For ELF file, does the kernel already know how to parse the elf header?
I just wonder what happen if user sends different information to the
kernel.
> This is what we are using in LTTng-UST and Babeltrace debug-info filter
> plugin [1], and I think this would be relevant for kernel tracers as well
> so they can make the resulting stack traces meaningful to users.
>
> sys_codectl(2)
> =================
>
> * arg0: unsigned int @option:
>
> /* Additional labels can be added to enum code_opt, for extensibility. */
>
> enum code_opt {
> CODE_REGISTER_ELF,
> CODE_REGISTER_JIT,
> CODE_UNREGISTER,
> };
>
> * arg1: void * @info
>
> /* if (@option == CODE_REGISTER_ELF) */
>
> /*
> * text_start, text_end, sframe_start, sframe_end allow unwinding of the
> * call stack.
> *
> * elf_start, elf_end, pathname, and either build_id or debug_link allows
> * mapping instruction pointers to file, symbol, offset, and source file
> * location.
> */
> struct code_elf_info {
> : __u64 elf_start;
> __u64 elf_end;
> __u64 text_start;
> __u64 text_end;
What happen if there are multiple .text.* sections?
Or, does it used for each text section?
> __u64 sframe_start;
> __u64 sframe_end;
> __u64 pathname; /* char *, NULL if unavailable. */
>
> __u64 build_id; /* char *, NULL if unavailable. */
> __u64 debug_link_pathname; /* char *, NULL if unavailable. */
> __u32 build_id_len;
> __u32 debug_link_crc;
> };
>
>
> /* if (@option == CODE_REGISTER_JIT) */
>
> /*
> * Registration of sorted JIT unwind table: The reserved memory area is
> * of size reserved_len. Userspace increases used_len as new code is
> * populated between text_start and text_end. This area is populated in
> * increasing address order, and its ABI requires to have no overlapping
> * fre. This fits the common use-case where JITs populate code into
> * a given memory area by increasing address order. The sorted unwind
> * tables can be chained with a singly-linked list as they become full.
> * Consecutive chained tables are also in sorted text address order.
> *
> * Note: if there is an eventual use-case for unsorted jit unwind table,
> * this would be introduced as a new "code option".
> */
>
> struct code_jit_info {
> __u64 text_start; /* text_start >= addr */
> __u64 text_end; /* addr < text_end */
> __u64 unwind_head; /* struct code_jit_unwind_table * */
> };
>
> struct code_jit_unwind_fre {
> /*
> * Contains info similar to sframe, allowing unwind for a given
Hmm, why not just the sframe?
(Is there any library to generate sframe online for JIT?)
Thank you,
> * code address range.
> */
> __u32 size;
> __u32 ip_off; /* offset from text_start */
> __s32 cfa_off;
> __s32 ra_off;
> __s32 fp_off;
> __u8 info;
> };
>
> struct code_jit_unwind_table {
> __u64 reserved_len;
> __u64 used_len; /*
> * Incremented by userspace (store-release), read by
> * the kernel (load-acquire).
> */
> __u64 next; /* Chain with next struct code_jit_unwind_table. */
> struct code_jit_unwind_fre fre[];
> };
>
> /* if (@option == CODE_UNREGISTER) */
>
> void *info
>
> * arg2: size_t info_size
>
> /*
> * Size of @info structure, allowing extensibility. See
> * copy_struct_from_user().
> */
>
> * arg3: unsigned int flags (0)
>
> /* Flags for extensibility. */
>
> Your feedback is welcome,
>
> Thanks,
>
> Mathieu
>
> [1] https://babeltrace.org/docs/v2.0/man7/babeltrace2-filter.lttng-utils.debug-info.7/
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 18:49 ` Mathieu Desnoyers
@ 2025-07-23 8:16 ` Indu Bhagat
2025-07-23 14:32 ` Mathieu Desnoyers
0 siblings, 1 reply; 22+ messages in thread
From: Indu Bhagat @ 2025-07-23 8:16 UTC (permalink / raw)
To: Mathieu Desnoyers, rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James, Brian Robbins,
Elena Zannoni
On 7/22/25 11:49 AM, Mathieu Desnoyers wrote:
> On 2025-07-22 14:21, Indu Bhagat wrote:
>> On 7/21/25 8:20 AM, Mathieu Desnoyers wrote:
>>> Hi!
>>>
>>> I've written up an RFC for a new system call to handle sframe
>>> registration
>>> for shared libraries. There has been interest to cover both sframe in
>>> the short term, but also JIT use-cases in the long term, so I'm
>>> covering both here in this RFC to provide the full context.
>>> Implementation
>>> wise we could start by only covering the sframe use-case.
>>>
>>> I've called it "codectl(2)" for now, but I'm of course open to feedback.
>>>
>>> For ELF, I'm including the optional pathname, build id, and debug link
>>> information which are really useful to translate from instruction
>>> pointers
>>> to executable/library name, symbol, offset, source file, line number.
>>> This is what we are using in LTTng-UST and Babeltrace debug-info filter
>>> plugin [1], and I think this would be relevant for kernel tracers as
>>> well
>>> so they can make the resulting stack traces meaningful to users.
>>>
>>> sys_codectl(2)
>>> =================
>>>
>>> * arg0: unsigned int @option:
>>>
>>> /* Additional labels can be added to enum code_opt, for
>>> extensibility. */
>>>
>>> enum code_opt {
>>> CODE_REGISTER_ELF,
>>> CODE_REGISTER_JIT,
>>> CODE_UNREGISTER,
>>> };
>>>
>>> * arg1: void * @info
>>>
>>> /* if (@option == CODE_REGISTER_ELF) */
>>>
>>> /*
>>> * text_start, text_end, sframe_start, sframe_end allow unwinding of
>>> the
>>> * call stack.
>>> *
>>> * elf_start, elf_end, pathname, and either build_id or debug_link
>>> allows
>>> * mapping instruction pointers to file, symbol, offset, and source
>>> file
>>> * location.
>>> */
>>> struct code_elf_info {
>>> : __u64 elf_start;
>>> __u64 elf_end;
>>
>> What are the elf_start , elf_end intended for ?
>
> The intent is to know at which address the first loadable segment of
> the shared object is mapped (elf_start), and the size of the shared
> object mapping, which is the sum of the size of its PT_LOAD segments.
>
> This allows tooling to easily lookup which addresses belong to that
> shared object, for any loaded segment, whether it's code or data.
>
>>
>>> __u64 text_start;
>>> __u64 text_end;
>>> __u64 sframe_start;
>>> __u64 sframe_end;
>>> __u64 pathname; /* char *, NULL if unavailable. */
>>>
>>> __u64 build_id; /* char *, NULL if unavailable. */
>>> __u64 debug_link_pathname; /* char *, NULL if unavailable. */
>>> __u32 build_id_len;
>>> __u32 debug_link_crc;
>>> };
>>>
>>>
>>> /* if (@option == CODE_REGISTER_JIT) */
>>>
>>> /*
>>> * Registration of sorted JIT unwind table: The reserved memory area is
>>> * of size reserved_len. Userspace increases used_len as new code is
>>> * populated between text_start and text_end. This area is populated in
>>> * increasing address order, and its ABI requires to have no
>>> overlapping
>>> * fre. This fits the common use-case where JITs populate code into
>>> * a given memory area by increasing address order. The sorted unwind
>>> * tables can be chained with a singly-linked list as they become full.
>>> * Consecutive chained tables are also in sorted text address order.
>>> *
>>> * Note: if there is an eventual use-case for unsorted jit unwind
>>> table,
>>> * this would be introduced as a new "code option".
>>> */
>>>
>>> struct code_jit_info {
>>> __u64 text_start; /* text_start >= addr */
>>> __u64 text_end; /* addr < text_end */
>>> __u64 unwind_head; /* struct code_jit_unwind_table * */
>>> };
>>>
>>
>> I see the discussion has evolved here with the general sentiment that
>> the JIT part needs to be kept in mind for a rough sketch but cannot be
>> designed at this time. But two comments (if we keep JIT part in the
>> discussion):
>> - I think we need to keep __u64 unwind_head not a pointer to a
>> defined structure (struct code_jit_unwind_table * above), but some
>> opaque type like we have for SFrame case.
>
> What is the reason for making this an opaque type for sframe ?
>
So that the system call only does the work of registering the memory of
specific size as stack trace data for addr range (text_start, text_end).
IIUC, in the current proposal, the format of the stack trace
information is exposed in the arg. So when the format evolves, this will
mean additional management via some flags?
>
>> - The reserved_len should ideally be a part of code_jit_info, so
>> the length can be known without parsing the contents.
>
> I've placed reserved_len within the unwind table because I planned to
> have the jit information for a given range of text be a linked list of
> tables. Therefore, if one table fills up, then another table can be
> chained at the tail. Having the reserved_len part of each table makes
> things easier to combine into a linked list.
>
> Thanks for your feedback !
>
> Mathieu
>
>>
>>> struct code_jit_unwind_fre {
>>> /*
>>> * Contains info similar to sframe, allowing unwind for a given
>>> * code address range.
>>> */
>>> __u32 size;
>>> __u32 ip_off; /* offset from text_start */
>>> __s32 cfa_off;
>>> __s32 ra_off;
>>> __s32 fp_off;
>>> __u8 info;
>>> };
>>>
>>> struct code_jit_unwind_table {
>>> __u64 reserved_len;
>>> __u64 used_len; /*
>>> * Incremented by userspace (store-release),
>>> read by
>>> * the kernel (load-acquire).
>>> */
>>> __u64 next; /* Chain with next struct code_jit_unwind_table. */
>>> struct code_jit_unwind_fre fre[];
>>> };
>>>
>>> /* if (@option == CODE_UNREGISTER) */
>>>
>>> void *info
>>>
>>> * arg2: size_t info_size
>>>
>>> /*
>>> * Size of @info structure, allowing extensibility. See
>>> * copy_struct_from_user().
>>> */
>>>
>>> * arg3: unsigned int flags (0)
>>>
>>> /* Flags for extensibility. */
>>>
>>> Your feedback is welcome,
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>> [1] https://babeltrace.org/docs/v2.0/man7/babeltrace2-filter.lttng-
>>> utils.debug-info.7/
>>>
>>
>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-23 8:16 ` Indu Bhagat
@ 2025-07-23 14:32 ` Mathieu Desnoyers
0 siblings, 0 replies; 22+ messages in thread
From: Mathieu Desnoyers @ 2025-07-23 14:32 UTC (permalink / raw)
To: Indu Bhagat, rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James, Brian Robbins,
Elena Zannoni
On 2025-07-23 04:16, Indu Bhagat wrote:
> On 7/22/25 11:49 AM, Mathieu Desnoyers wrote:
>> On 2025-07-22 14:21, Indu Bhagat wrote:
>>> On 7/21/25 8:20 AM, Mathieu Desnoyers wrote:
>>>> Hi!
>>>>
>>>> I've written up an RFC for a new system call to handle sframe
>>>> registration
>>>> for shared libraries. There has been interest to cover both sframe in
>>>> the short term, but also JIT use-cases in the long term, so I'm
>>>> covering both here in this RFC to provide the full context.
>>>> Implementation
>>>> wise we could start by only covering the sframe use-case.
>>>>
>>>> I've called it "codectl(2)" for now, but I'm of course open to
>>>> feedback.
>>>>
>>>> For ELF, I'm including the optional pathname, build id, and debug link
>>>> information which are really useful to translate from instruction
>>>> pointers
>>>> to executable/library name, symbol, offset, source file, line number.
>>>> This is what we are using in LTTng-UST and Babeltrace debug-info filter
>>>> plugin [1], and I think this would be relevant for kernel tracers as
>>>> well
>>>> so they can make the resulting stack traces meaningful to users.
>>>>
>>>> sys_codectl(2)
>>>> =================
>>>>
>>>> * arg0: unsigned int @option:
>>>>
>>>> /* Additional labels can be added to enum code_opt, for
>>>> extensibility. */
>>>>
>>>> enum code_opt {
>>>> CODE_REGISTER_ELF,
>>>> CODE_REGISTER_JIT,
>>>> CODE_UNREGISTER,
>>>> };
>>>>
>>>> * arg1: void * @info
>>>>
>>>> /* if (@option == CODE_REGISTER_ELF) */
>>>>
>>>> /*
>>>> * text_start, text_end, sframe_start, sframe_end allow unwinding
>>>> of the
>>>> * call stack.
>>>> *
>>>> * elf_start, elf_end, pathname, and either build_id or debug_link
>>>> allows
>>>> * mapping instruction pointers to file, symbol, offset, and source
>>>> file
>>>> * location.
>>>> */
>>>> struct code_elf_info {
>>>> : __u64 elf_start;
>>>> __u64 elf_end;
>>>
>>> What are the elf_start , elf_end intended for ?
>>
>> The intent is to know at which address the first loadable segment of
>> the shared object is mapped (elf_start), and the size of the shared
>> object mapping, which is the sum of the size of its PT_LOAD segments.
>>
>> This allows tooling to easily lookup which addresses belong to that
>> shared object, for any loaded segment, whether it's code or data.
>>
>>>
>>>> __u64 text_start;
>>>> __u64 text_end;
>>>> __u64 sframe_start;
>>>> __u64 sframe_end;
>>>> __u64 pathname; /* char *, NULL if unavailable. */
>>>>
>>>> __u64 build_id; /* char *, NULL if unavailable. */
>>>> __u64 debug_link_pathname; /* char *, NULL if unavailable. */
>>>> __u32 build_id_len;
>>>> __u32 debug_link_crc;
>>>> };
>>>>
>>>>
>>>> /* if (@option == CODE_REGISTER_JIT) */
>>>>
>>>> /*
>>>> * Registration of sorted JIT unwind table: The reserved memory
>>>> area is
>>>> * of size reserved_len. Userspace increases used_len as new code is
>>>> * populated between text_start and text_end. This area is
>>>> populated in
>>>> * increasing address order, and its ABI requires to have no
>>>> overlapping
>>>> * fre. This fits the common use-case where JITs populate code into
>>>> * a given memory area by increasing address order. The sorted unwind
>>>> * tables can be chained with a singly-linked list as they become
>>>> full.
>>>> * Consecutive chained tables are also in sorted text address order.
>>>> *
>>>> * Note: if there is an eventual use-case for unsorted jit unwind
>>>> table,
>>>> * this would be introduced as a new "code option".
>>>> */
>>>>
>>>> struct code_jit_info {
>>>> __u64 text_start; /* text_start >= addr */
>>>> __u64 text_end; /* addr < text_end */
>>>> __u64 unwind_head; /* struct code_jit_unwind_table * */
>>>> };
>>>>
>>>
>>> I see the discussion has evolved here with the general sentiment that
>>> the JIT part needs to be kept in mind for a rough sketch but cannot
>>> be designed at this time. But two comments (if we keep JIT part in
>>> the discussion):
>>> - I think we need to keep __u64 unwind_head not a pointer to a
>>> defined structure (struct code_jit_unwind_table * above), but some
>>> opaque type like we have for SFrame case.
>>
>> What is the reason for making this an opaque type for sframe ?
>>
>
> So that the system call only does the work of registering the memory of
> specific size as stack trace data for addr range (text_start, text_end).
> IIUC, in the current proposal, the format of the stack trace
> information is exposed in the arg. So when the format evolves, this will
> mean additional management via some flags?
There are various way to handle extensions here. The simplest
would be to add a new label to enum code_opt and register the extended
JIT abi as a new option. But this would likely involve a lot of
duplication if the goal is just to add one more field to struct
code_jit_unwind_fre.
I suspect that the unwind table and linked list of unwind tables is
something we won't want to change for this ABI. What I see could be
a relevant extension point is struct code_jit_unwind_fre, but given
that it will be placed into an array, making it extensible requires
some care: we'd need to keep track of its stride. We could do it like
this:
struct code_jit_unwind_table {
__u64 reserved_len;
__u64 used_len; /*
* Incremented by userspace (store-release), read by
* the kernel (load-acquire).
*/
__u64 next; /* Chain with next struct code_jit_unwind_table. */
__u32 fre_stride; /* Stride of fre array (includes padding). */
__u32 fre_size; /* Offset at end of last used field. */
char fre[];
};
So extending struct code_jit_unwind_fre could be done by adding fields
at the end, thus potentially increasing its size or turning padding into
used fields. fre_size would keep track of the "used" fields.
I'm open to extend by size (with fre_size) or using flags, whatever fits
best.
Thanks,
Mathieu
>
>>
>>> - The reserved_len should ideally be a part of code_jit_info, so
>>> the length can be known without parsing the contents.
>>
>> I've placed reserved_len within the unwind table because I planned to
>> have the jit information for a given range of text be a linked list of
>> tables. Therefore, if one table fills up, then another table can be
>> chained at the tail. Having the reserved_len part of each table makes
>> things easier to combine into a linked list.
>>
>> Thanks for your feedback !
>>
>> Mathieu
>>
>>>
>>>> struct code_jit_unwind_fre {
>>>> /*
>>>> * Contains info similar to sframe, allowing unwind for a given
>>>> * code address range.
>>>> */
>>>> __u32 size;
>>>> __u32 ip_off; /* offset from text_start */
>>>> __s32 cfa_off;
>>>> __s32 ra_off;
>>>> __s32 fp_off;
>>>> __u8 info;
>>>> };
>>>>
>>>> struct code_jit_unwind_table {
>>>> __u64 reserved_len;
>>>> __u64 used_len; /*
>>>> * Incremented by userspace (store-release),
>>>> read by
>>>> * the kernel (load-acquire).
>>>> */
>>>> __u64 next; /* Chain with next struct
>>>> code_jit_unwind_table. */
>>>> struct code_jit_unwind_fre fre[];
>>>> };
>>>>
>>>> /* if (@option == CODE_UNREGISTER) */
>>>>
>>>> void *info
>>>>
>>>> * arg2: size_t info_size
>>>>
>>>> /*
>>>> * Size of @info structure, allowing extensibility. See
>>>> * copy_struct_from_user().
>>>> */
>>>>
>>>> * arg3: unsigned int flags (0)
>>>>
>>>> /* Flags for extensibility. */
>>>>
>>>> Your feedback is welcome,
>>>>
>>>> Thanks,
>>>>
>>>> Mathieu
>>>>
>>>> [1] https://babeltrace.org/docs/v2.0/man7/babeltrace2-filter.lttng-
>>>> utils.debug-info.7/
>>>>
>>>
>>
>>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 21:04 ` Indu Bhagat
2025-07-22 21:13 ` Steven Rostedt
@ 2025-07-23 15:07 ` Mathieu Desnoyers
1 sibling, 0 replies; 22+ messages in thread
From: Mathieu Desnoyers @ 2025-07-23 15:07 UTC (permalink / raw)
To: Indu Bhagat, Steven Rostedt, Jose E. Marchesi
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Beau Belgrave,
Jens Remus, Linus Torvalds, Andrew Morton, Jens Axboe,
Florian Weimer, Sam James, Brian Robbins, Elena Zannoni
On 2025-07-22 17:04, Indu Bhagat wrote:
> On 7/22/25 12:17 PM, Steven Rostedt wrote:
>> On Tue, 22 Jul 2025 20:56:47 +0200
>> "Jose E. Marchesi" <jemarch@gnu.org> wrote:
>>
>>> I think glibc could "register" loaded SFrame data by just pointing the
>>> kernel to the VM address where it got loaded, "you got some SFrame
>>> there". Starting from that address it is then possible to find the
>>> referred code locations just by applying the offsets, without needing
>>> any additional information nor ELF foobar...
>>>
>>> Or thats how I understand it. Indu will undoubtly correct me if I am
>>> wrong 8-)
>>
>> Maybe I'm wrong, but if you know where the text is loaded (the final
>> location it is in memory), it is possible to figure out the relocations
>> in the sframe section.
>>
>
> (FWIW, What Jose wrote is correct.)
>
> Some details which may help clear up some confusion here. The SFrame
> sections are of type SHT_GNU_SFRAME and currently have SEC_ALLOC|
> SEC_LOAD flags set. This means that they are allocated memory and
> loaded at application start up time. These sections appear in a PT_LOAD
> segment in the linked binaries.
>
> Then there is a PT_GNU_SFRAME, which is a new program header type for
> SFrame. PT_GNU_SFRAME by itself does not trigger the loading of SFrame
> sections. But the .sframe sections being present in the PT_LOAD segment
> does.
>
>>>
>>>> In the future, if we wants to compress the sframe section, it will not
>>>> even be a loadable ELF section. But the system call can tell the
>>>> kernel: "there's a sframe compressed section at this offset/size in
>>>> this file" for this text address range and then the kernel will do the
>>>> rest.
>>>
>>> I think supporting compressed SFrame will probably require to do some
>>> sort of relocation of the offsets in the uncompressed data, depending on
>>> where the uncompressed data will get eventually loaded.
>>
>> Assuming that all the text is at a given offset, would that be enough
>> to fill in the blanks?
>>
>
> Yes and No. The offset at which the text is loaded is _one_ part of the
> information to "fill in the blanks". The other part is what to do with
> that information (text_vma) or how to relocate the SFrame section itself
> a.k.a. the relocation entries. To know the relocations, one will need
> to get access to the respective relocation section, and hence access to
> the ELF section headers.
So AFAIU we have three main scenarios:
1) The dynamic loader allocates the sframe section, and possibly applies
relocations before passing pointers to the start/end of that section
to the kernel.
2) The dynamic loader only maps memory for the sframe section, without
actually populating its content. It would register the sframe section
to the kernel by providing a pathname and offset allowing the kernel
to find the sframe information and populate it via the page fault
handler. In that scenario, the kernel would be responsible to perform
the relocations. Ideally the sframe layout would always contain
offsets that are relative to the text_vma base, so the kernel could
easily do the relocs. Is that the case ?
3) Variation on scenario 2: the sframe data is compressed in the file.
The page fault handler is responsible to uncompress and apply relocs
if need be.
Am I missing something ?
Thanks,
Mathieu
>
>> As the text would have already been linked into memory before the
>> system call is made. If this is not the case, then we definitely need
>> the linker to load the sframe into memory before it does the system
>> call, and just give the kernel that address.
>>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-22 21:13 ` Steven Rostedt
2025-07-22 21:57 ` Indu Bhagat
@ 2025-07-23 15:09 ` Mathieu Desnoyers
2025-07-23 16:29 ` Indu Bhagat
1 sibling, 1 reply; 22+ messages in thread
From: Mathieu Desnoyers @ 2025-07-23 15:09 UTC (permalink / raw)
To: Steven Rostedt, Indu Bhagat
Cc: Jose E. Marchesi, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James, Brian Robbins,
Elena Zannoni
On 2025-07-22 17:13, Steven Rostedt wrote:
> On Tue, 22 Jul 2025 14:04:37 -0700
> Indu Bhagat <indu.bhagat@oracle.com> wrote:
>
>> Yes and No. The offset at which the text is loaded is _one_ part of the
>> information to "fill in the blanks". The other part is what to do with
>> that information (text_vma) or how to relocate the SFrame section itself
>> a.k.a. the relocation entries. To know the relocations, one will need
>> to get access to the respective relocation section, and hence access to
>> the ELF section headers.
>
> You mean to find where in the sframe section itself that needs to be update?
>
> OK, that makes sense. So sframes does need to still be in an ELF file for
> its own relocations and such.
>
> It will be interesting on how to do compression and on-demand page loading.
>
> There would need to be a table as well that will denote where in the
> decompressed pages that relocations need to be performed.
If we can find a way to express all sframe "pointers" as offsets from a
text_vma base, then there is no need for relocations. This would
minimize complexity.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-23 0:26 ` Masami Hiramatsu
@ 2025-07-23 15:15 ` Mathieu Desnoyers
0 siblings, 0 replies; 22+ messages in thread
From: Mathieu Desnoyers @ 2025-07-23 15:15 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Brian Robbins, Elena Zannoni
On 2025-07-22 20:26, Masami Hiramatsu (Google) wrote:
> Hi Mathieu,
>
> On Mon, 21 Jul 2025 11:20:34 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>> Hi!
>>
>> I've written up an RFC for a new system call to handle sframe registration
>> for shared libraries. There has been interest to cover both sframe in
>> the short term, but also JIT use-cases in the long term, so I'm
>> covering both here in this RFC to provide the full context. Implementation
>> wise we could start by only covering the sframe use-case.
>>
>> I've called it "codectl(2)" for now, but I'm of course open to feedback.
>
> Nice idea for JIT, but I doubt we need this for ELF.
>
>>
>> For ELF, I'm including the optional pathname, build id, and debug link
>> information which are really useful to translate from instruction pointers
>> to executable/library name, symbol, offset, source file, line number.
>
> For ELF file, does the kernel already know how to parse the elf header?
> I just wonder what happen if user sends different information to the
> kernel.
AFAIU, the kernel has an elf parser that is uses on execve when it
executes a program, but the dynamic linking use-case all happens in
userspace. The kernel only maps memory and currently does not know that
it contains an ELF file.
The objective here is to allow registration of shared libraries sframe
sections from the dynamic linker.
>
>> This is what we are using in LTTng-UST and Babeltrace debug-info filter
>> plugin [1], and I think this would be relevant for kernel tracers as well
>> so they can make the resulting stack traces meaningful to users.
>>
>> sys_codectl(2)
>> =================
>>
>> * arg0: unsigned int @option:
>>
>> /* Additional labels can be added to enum code_opt, for extensibility. */
>>
>> enum code_opt {
>> CODE_REGISTER_ELF,
>> CODE_REGISTER_JIT,
>> CODE_UNREGISTER,
>> };
>>
>> * arg1: void * @info
>>
>> /* if (@option == CODE_REGISTER_ELF) */
>>
>> /*
>> * text_start, text_end, sframe_start, sframe_end allow unwinding of the
>> * call stack.
>> *
>> * elf_start, elf_end, pathname, and either build_id or debug_link allows
>> * mapping instruction pointers to file, symbol, offset, and source file
>> * location.
>> */
>> struct code_elf_info {
>> : __u64 elf_start;
>> __u64 elf_end;
>> __u64 text_start;
>> __u64 text_end;
>
> What happen if there are multiple .text.* sections?
> Or, does it used for each text section?
That's a good point. I guess we could theoretically have a shared
object that has more than one text range, in which case we'd want to
register one sframe section for each of the text range. (let me know
if I'm misunderstanding something here)
This is an additional argument for having an sframe-specific
registration rather than an "elf" registration for the sframe
use-case.
>
>> __u64 sframe_start;
>> __u64 sframe_end;
>> __u64 pathname; /* char *, NULL if unavailable. */
>>
>> __u64 build_id; /* char *, NULL if unavailable. */
>> __u64 debug_link_pathname; /* char *, NULL if unavailable. */
>> __u32 build_id_len;
>> __u32 debug_link_crc;
>> };
>>
>>
>> /* if (@option == CODE_REGISTER_JIT) */
>>
>> /*
>> * Registration of sorted JIT unwind table: The reserved memory area is
>> * of size reserved_len. Userspace increases used_len as new code is
>> * populated between text_start and text_end. This area is populated in
>> * increasing address order, and its ABI requires to have no overlapping
>> * fre. This fits the common use-case where JITs populate code into
>> * a given memory area by increasing address order. The sorted unwind
>> * tables can be chained with a singly-linked list as they become full.
>> * Consecutive chained tables are also in sorted text address order.
>> *
>> * Note: if there is an eventual use-case for unsorted jit unwind table,
>> * this would be introduced as a new "code option".
>> */
>>
>> struct code_jit_info {
>> __u64 text_start; /* text_start >= addr */
>> __u64 text_end; /* addr < text_end */
>> __u64 unwind_head; /* struct code_jit_unwind_table * */
>> };
>>
>> struct code_jit_unwind_fre {
>> /*
>> * Contains info similar to sframe, allowing unwind for a given
>
> Hmm, why not just the sframe?
> (Is there any library to generate sframe online for JIT?)
The layout and size of the sframe section is fixed after it's been
registered. The jit unwind tables are meant to dynamically
grow as the JIT populates additional code. The goal here is to make sure
JITs don't have to issue a system call every time they add a few
functions, otherwise the overhead becomes a significant bottleneck.
Thanks,
Mathieu
>
> Thank you,
>
>> * code address range.
>> */
>> __u32 size;
>> __u32 ip_off; /* offset from text_start */
>> __s32 cfa_off;
>> __s32 ra_off;
>> __s32 fp_off;
>> __u8 info;
>> };
>>
>> struct code_jit_unwind_table {
>> __u64 reserved_len;
>> __u64 used_len; /*
>> * Incremented by userspace (store-release), read by
>> * the kernel (load-acquire).
>> */
>> __u64 next; /* Chain with next struct code_jit_unwind_table. */
>> struct code_jit_unwind_fre fre[];
>> };
>>
>> /* if (@option == CODE_UNREGISTER) */
>>
>> void *info
>>
>> * arg2: size_t info_size
>>
>> /*
>> * Size of @info structure, allowing extensibility. See
>> * copy_struct_from_user().
>> */
>>
>> * arg3: unsigned int flags (0)
>>
>> /* Flags for extensibility. */
>>
>> Your feedback is welcome,
>>
>> Thanks,
>>
>> Mathieu
>>
>> [1] https://babeltrace.org/docs/v2.0/man7/babeltrace2-filter.lttng-utils.debug-info.7/
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] New codectl(2) system call for sframe registration
2025-07-23 15:09 ` Mathieu Desnoyers
@ 2025-07-23 16:29 ` Indu Bhagat
0 siblings, 0 replies; 22+ messages in thread
From: Indu Bhagat @ 2025-07-23 16:29 UTC (permalink / raw)
To: Mathieu Desnoyers, Steven Rostedt
Cc: Jose E. Marchesi, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James, Brian Robbins,
Elena Zannoni
On 7/23/25 8:09 AM, Mathieu Desnoyers wrote:
> On 2025-07-22 17:13, Steven Rostedt wrote:
>> On Tue, 22 Jul 2025 14:04:37 -0700
>> Indu Bhagat <indu.bhagat@oracle.com> wrote:
>>
>>> Yes and No. The offset at which the text is loaded is _one_ part of the
>>> information to "fill in the blanks". The other part is what to do with
>>> that information (text_vma) or how to relocate the SFrame section itself
>>> a.k.a. the relocation entries. To know the relocations, one will need
>>> to get access to the respective relocation section, and hence access to
>>> the ELF section headers.
>>
>> You mean to find where in the sframe section itself that needs to be
>> update?
>>
>> OK, that makes sense. So sframes does need to still be in an ELF file for
>> its own relocations and such.
>>
>> It will be interesting on how to do compression and on-demand page
>> loading.
>>
>> There would need to be a table as well that will denote where in the
>> decompressed pages that relocations need to be performed.
>
> If we can find a way to express all sframe "pointers" as offsets from a
> text_vma base, then there is no need for relocations. This would
> minimize complexity.
>
(So we got into the topic of relocation in the context of compressing
SFrame sections, the details of which are not chalked out yet. As I
mentioned SHF_ALLOC|SHF_COMPRESSED usecase needs to be discussed
further. But stating the following in case there has been a
misunderstanding.)
The SFrame FDE func start addr field is indeed an offset from the field
itself to the start PC of the function in the text section. So in
relocatable files (object files, or ld -r i.e. ET_REL), we see the
relocations.
These relocations can be resolved at link time by the linker. So for
shared libraries and executables (ET_DYN, ET_EXEC), once the SFrame
section in placed in the PT_LOAD segment, there are no relocations for
readers/consumers. These relocations remain for relocatable objects.
A consumer of relocatable object, e.g., in case of kernel modules, will
need to take care of the relocations when adding the module.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-07-23 16:30 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-21 15:20 [RFC] New codectl(2) system call for sframe registration Mathieu Desnoyers
2025-07-21 18:53 ` Steven Rostedt
2025-07-21 20:58 ` Mathieu Desnoyers
2025-07-21 21:15 ` Steven Rostedt
2025-07-22 13:51 ` Mathieu Desnoyers
2025-07-22 16:25 ` Steven Rostedt
2025-07-22 18:26 ` Mathieu Desnoyers
2025-07-22 19:11 ` Steven Rostedt
2025-07-22 18:56 ` Jose E. Marchesi
2025-07-22 19:17 ` Steven Rostedt
2025-07-22 21:04 ` Indu Bhagat
2025-07-22 21:13 ` Steven Rostedt
2025-07-22 21:57 ` Indu Bhagat
2025-07-23 15:09 ` Mathieu Desnoyers
2025-07-23 16:29 ` Indu Bhagat
2025-07-23 15:07 ` Mathieu Desnoyers
2025-07-22 18:21 ` Indu Bhagat
2025-07-22 18:49 ` Mathieu Desnoyers
2025-07-23 8:16 ` Indu Bhagat
2025-07-23 14:32 ` Mathieu Desnoyers
2025-07-23 0:26 ` Masami Hiramatsu
2025-07-23 15:15 ` Mathieu Desnoyers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).