* [PATCH dwarves] dwarves: set cu->obstack chunk size to 128Kb
@ 2024-12-21 3:04 Ihor Solodrai
2024-12-21 3:08 ` Ihor Solodrai
2025-01-08 13:55 ` Alan Maguire
0 siblings, 2 replies; 6+ messages in thread
From: Ihor Solodrai @ 2024-12-21 3:04 UTC (permalink / raw)
To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf
In dwarf_loader with growing nr_jobs the wall-clock time of BTF
encoding starts worsening after a certain point [1].
While some overhead of additional threads is expected, it's not
supposed to be noticeable unless nr_jobs is set to an unreasonably big
value.
It turns out when there are "too many" threads decoding DWARF, they
start competing for memory allocation: significant number of cycles is
spent in osq_lock - in the depth of malloc called within
cu__zalloc. Which suggests that many threads are trying to allocate
memory at the same time.
See an example on a perf flamegraph for run with -j240 [2]. This is
12-core machine, so the effect is small. On machines with more cores
this problem is worse.
Increasing the chunk size of obstacks associated with CUs helps to
reduce the performance penalty caused by this race condition.
[1] https://lore.kernel.org/dwarves/C82bYTvJaV4bfT15o25EsBiUvFsj5eTlm17933Hvva76CXjIcu3gvpaOCWPgeZ8g3cZ-RMa8Vp0y1o_QMR2LhPB-LEUYfZCGuCfR_HvkIP8=@pm.me/
[2] https://gist.github.com/theihor/926af22417a78605fec8d85e1338920e
Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
---
dwarves.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/dwarves.c b/dwarves.c
index 7c3e878..105f81a 100644
--- a/dwarves.c
+++ b/dwarves.c
@@ -722,6 +722,8 @@ int cu__fprintf_ptr_table_stats_csv(struct cu *cu, FILE *fp)
return printed;
}
+#define OBSTACK_CHUNK_SIZE (128*1024)
+
struct cu *cu__new(const char *name, uint8_t addr_size,
const unsigned char *build_id, int build_id_len,
const char *filename, bool use_obstack)
@@ -733,7 +735,7 @@ struct cu *cu__new(const char *name, uint8_t addr_size,
cu->use_obstack = use_obstack;
if (cu->use_obstack)
- obstack_init(&cu->obstack);
+ obstack_begin(&cu->obstack, OBSTACK_CHUNK_SIZE);
if (name == NULL || filename == NULL)
goto out_free;
--
2.47.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH dwarves] dwarves: set cu->obstack chunk size to 128Kb
2024-12-21 3:04 [PATCH dwarves] dwarves: set cu->obstack chunk size to 128Kb Ihor Solodrai
@ 2024-12-21 3:08 ` Ihor Solodrai
2025-01-08 13:55 ` Alan Maguire
1 sibling, 0 replies; 6+ messages in thread
From: Ihor Solodrai @ 2024-12-21 3:08 UTC (permalink / raw)
To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf
This patch was supposed to be a part of [1], but I missed it when
preparing a patchset. However it is independent of the other changes
there, so I'm sending it separately.
[1] https://lore.kernel.org/dwarves/20241221012245.243845-9-ihor.solodrai@pm.me/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH dwarves] dwarves: set cu->obstack chunk size to 128Kb
2024-12-21 3:04 [PATCH dwarves] dwarves: set cu->obstack chunk size to 128Kb Ihor Solodrai
2024-12-21 3:08 ` Ihor Solodrai
@ 2025-01-08 13:55 ` Alan Maguire
2025-01-08 16:38 ` Ihor Solodrai
1 sibling, 1 reply; 6+ messages in thread
From: Alan Maguire @ 2025-01-08 13:55 UTC (permalink / raw)
To: Ihor Solodrai, dwarves; +Cc: acme, eddyz87, andrii, mykolal, bpf
On 21/12/2024 03:04, Ihor Solodrai wrote:
> In dwarf_loader with growing nr_jobs the wall-clock time of BTF
> encoding starts worsening after a certain point [1].
>
> While some overhead of additional threads is expected, it's not
> supposed to be noticeable unless nr_jobs is set to an unreasonably big
> value.
>
> It turns out when there are "too many" threads decoding DWARF, they
> start competing for memory allocation: significant number of cycles is
> spent in osq_lock - in the depth of malloc called within
> cu__zalloc. Which suggests that many threads are trying to allocate
> memory at the same time.
>
> See an example on a perf flamegraph for run with -j240 [2]. This is
> 12-core machine, so the effect is small. On machines with more cores
> this problem is worse.
>
> Increasing the chunk size of obstacks associated with CUs helps to
> reduce the performance penalty caused by this race condition.
>
Is this because starting with a larger obstack size means we don't have
to keep reallocating as the obstack grows?
Thanks!
Alan
> [1] https://lore.kernel.org/dwarves/C82bYTvJaV4bfT15o25EsBiUvFsj5eTlm17933Hvva76CXjIcu3gvpaOCWPgeZ8g3cZ-RMa8Vp0y1o_QMR2LhPB-LEUYfZCGuCfR_HvkIP8=@pm.me/
> [2] https://gist.github.com/theihor/926af22417a78605fec8d85e1338920e
>
> Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
> ---
> dwarves.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/dwarves.c b/dwarves.c
> index 7c3e878..105f81a 100644
> --- a/dwarves.c
> +++ b/dwarves.c
> @@ -722,6 +722,8 @@ int cu__fprintf_ptr_table_stats_csv(struct cu *cu, FILE *fp)
> return printed;
> }
>
> +#define OBSTACK_CHUNK_SIZE (128*1024)
> +
> struct cu *cu__new(const char *name, uint8_t addr_size,
> const unsigned char *build_id, int build_id_len,
> const char *filename, bool use_obstack)
> @@ -733,7 +735,7 @@ struct cu *cu__new(const char *name, uint8_t addr_size,
>
> cu->use_obstack = use_obstack;
> if (cu->use_obstack)
> - obstack_init(&cu->obstack);
> + obstack_begin(&cu->obstack, OBSTACK_CHUNK_SIZE);
>
> if (name == NULL || filename == NULL)
> goto out_free;
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH dwarves] dwarves: set cu->obstack chunk size to 128Kb
2025-01-08 13:55 ` Alan Maguire
@ 2025-01-08 16:38 ` Ihor Solodrai
2025-01-08 17:22 ` Alan Maguire
0 siblings, 1 reply; 6+ messages in thread
From: Ihor Solodrai @ 2025-01-08 16:38 UTC (permalink / raw)
To: Alan Maguire; +Cc: dwarves, acme, eddyz87, andrii, mykolal, bpf
On Wednesday, January 8th, 2025 at 5:55 AM, Alan Maguire <alan.maguire@oracle.com> wrote:
>
>
> On 21/12/2024 03:04, Ihor Solodrai wrote:
>
> > In dwarf_loader with growing nr_jobs the wall-clock time of BTF
> > encoding starts worsening after a certain point [1].
> >
> > While some overhead of additional threads is expected, it's not
> > supposed to be noticeable unless nr_jobs is set to an unreasonably big
> > value.
> >
> > It turns out when there are "too many" threads decoding DWARF, they
> > start competing for memory allocation: significant number of cycles is
> > spent in osq_lock - in the depth of malloc called within
> > cu__zalloc. Which suggests that many threads are trying to allocate
> > memory at the same time.
> >
> > See an example on a perf flamegraph for run with -j240 [2]. This is
> > 12-core machine, so the effect is small. On machines with more cores
> > this problem is worse.
> >
> > Increasing the chunk size of obstacks associated with CUs helps to
> > reduce the performance penalty caused by this race condition.
>
>
> Is this because starting with a larger obstack size means we don't have
> to keep reallocating as the obstack grows?
Yes. Bigger obstack size leads to lower number of malloc calls. The
mallocs tend to happen at the same time between threads in the case of
DWARF decoding.
Curiously, setting a higher obstack chunk size (like 1Mb), does not
improve the overall wall-clock time, and can even make it worse.
This happens because the kernel takes a different code path to allocate
bigger chunks of memory. And also most CUs are not big (at least in case
of vmlinux), so a bigger chunk size probably increases wasted memory.
128Kb seems to be close to a sweet spot for the vmlinux.
The default is 4Kb.
>
> Thanks!
>
> Alan
>
> > [1] https://lore.kernel.org/dwarves/C82bYTvJaV4bfT15o25EsBiUvFsj5eTlm17933Hvva76CXjIcu3gvpaOCWPgeZ8g3cZ-RMa8Vp0y1o_QMR2LhPB-LEUYfZCGuCfR_HvkIP8=@pm.me/
> > [2] https://gist.github.com/theihor/926af22417a78605fec8d85e1338920e
> >
> > Signed-off-by: Ihor Solodrai ihor.solodrai@pm.me
> > ---
> > dwarves.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/dwarves.c b/dwarves.c
> > index 7c3e878..105f81a 100644
> > --- a/dwarves.c
> > +++ b/dwarves.c
> > @@ -722,6 +722,8 @@ int cu__fprintf_ptr_table_stats_csv(struct cu *cu, FILE *fp)
> > return printed;
> > }
> >
> > +#define OBSTACK_CHUNK_SIZE (128*1024)
> > +
> > struct cu *cu__new(const char *name, uint8_t addr_size,
> > const unsigned char *build_id, int build_id_len,
> > const char *filename, bool use_obstack)
> > @@ -733,7 +735,7 @@ struct cu *cu__new(const char *name, uint8_t addr_size,
> >
> > cu->use_obstack = use_obstack;
> > if (cu->use_obstack)
> > - obstack_init(&cu->obstack);
> > + obstack_begin(&cu->obstack, OBSTACK_CHUNK_SIZE);
> >
> > if (name == NULL || filename == NULL)
> > goto out_free;
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH dwarves] dwarves: set cu->obstack chunk size to 128Kb
2025-01-08 16:38 ` Ihor Solodrai
@ 2025-01-08 17:22 ` Alan Maguire
2025-01-09 16:03 ` Arnaldo Carvalho de Melo
0 siblings, 1 reply; 6+ messages in thread
From: Alan Maguire @ 2025-01-08 17:22 UTC (permalink / raw)
To: Ihor Solodrai; +Cc: dwarves, acme, eddyz87, andrii, mykolal, bpf
On 08/01/2025 16:38, Ihor Solodrai wrote:
> On Wednesday, January 8th, 2025 at 5:55 AM, Alan Maguire <alan.maguire@oracle.com> wrote:
>
>>
>>
>> On 21/12/2024 03:04, Ihor Solodrai wrote:
>>
>>> In dwarf_loader with growing nr_jobs the wall-clock time of BTF
>>> encoding starts worsening after a certain point [1].
>>>
>>> While some overhead of additional threads is expected, it's not
>>> supposed to be noticeable unless nr_jobs is set to an unreasonably big
>>> value.
>>>
>>> It turns out when there are "too many" threads decoding DWARF, they
>>> start competing for memory allocation: significant number of cycles is
>>> spent in osq_lock - in the depth of malloc called within
>>> cu__zalloc. Which suggests that many threads are trying to allocate
>>> memory at the same time.
>>>
>>> See an example on a perf flamegraph for run with -j240 [2]. This is
>>> 12-core machine, so the effect is small. On machines with more cores
>>> this problem is worse.
>>>
>>> Increasing the chunk size of obstacks associated with CUs helps to
>>> reduce the performance penalty caused by this race condition.
>>
>>
>> Is this because starting with a larger obstack size means we don't have
>> to keep reallocating as the obstack grows?
>
> Yes. Bigger obstack size leads to lower number of malloc calls. The
> mallocs tend to happen at the same time between threads in the case of
> DWARF decoding.
>
> Curiously, setting a higher obstack chunk size (like 1Mb), does not
> improve the overall wall-clock time, and can even make it worse.
> This happens because the kernel takes a different code path to allocate
> bigger chunks of memory. And also most CUs are not big (at least in case
> of vmlinux), so a bigger chunk size probably increases wasted memory.
>
> 128Kb seems to be close to a sweet spot for the vmlinux.
> The default is 4Kb.
>
Thanks for the additional details!
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
>>
>> Thanks!
>>
>> Alan
>>
>>> [1] https://lore.kernel.org/dwarves/C82bYTvJaV4bfT15o25EsBiUvFsj5eTlm17933Hvva76CXjIcu3gvpaOCWPgeZ8g3cZ-RMa8Vp0y1o_QMR2LhPB-LEUYfZCGuCfR_HvkIP8=@pm.me/
>>> [2] https://gist.github.com/theihor/926af22417a78605fec8d85e1338920e
>>>
>>> Signed-off-by: Ihor Solodrai ihor.solodrai@pm.me
>>> ---
>>> dwarves.c | 4 +++-
>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/dwarves.c b/dwarves.c
>>> index 7c3e878..105f81a 100644
>>> --- a/dwarves.c
>>> +++ b/dwarves.c
>>> @@ -722,6 +722,8 @@ int cu__fprintf_ptr_table_stats_csv(struct cu *cu, FILE *fp)
>>> return printed;
>>> }
>>>
>>> +#define OBSTACK_CHUNK_SIZE (128*1024)
>>> +
>>> struct cu *cu__new(const char *name, uint8_t addr_size,
>>> const unsigned char *build_id, int build_id_len,
>>> const char *filename, bool use_obstack)
>>> @@ -733,7 +735,7 @@ struct cu *cu__new(const char *name, uint8_t addr_size,
>>>
>>> cu->use_obstack = use_obstack;
>>> if (cu->use_obstack)
>>> - obstack_init(&cu->obstack);
>>> + obstack_begin(&cu->obstack, OBSTACK_CHUNK_SIZE);
>>>
>>> if (name == NULL || filename == NULL)
>>> goto out_free;
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH dwarves] dwarves: set cu->obstack chunk size to 128Kb
2025-01-08 17:22 ` Alan Maguire
@ 2025-01-09 16:03 ` Arnaldo Carvalho de Melo
0 siblings, 0 replies; 6+ messages in thread
From: Arnaldo Carvalho de Melo @ 2025-01-09 16:03 UTC (permalink / raw)
To: Alan Maguire; +Cc: Ihor Solodrai, dwarves, eddyz87, andrii, mykolal, bpf
On Wed, Jan 08, 2025 at 05:22:56PM +0000, Alan Maguire wrote:
> On 08/01/2025 16:38, Ihor Solodrai wrote:
> > On Wednesday, January 8th, 2025 at 5:55 AM, Alan Maguire <alan.maguire@oracle.com> wrote:
> >
> >>
> >>
> >> On 21/12/2024 03:04, Ihor Solodrai wrote:
> >>
> >>> In dwarf_loader with growing nr_jobs the wall-clock time of BTF
> >>> encoding starts worsening after a certain point [1].
> >>>
> >>> While some overhead of additional threads is expected, it's not
> >>> supposed to be noticeable unless nr_jobs is set to an unreasonably big
> >>> value.
> >>>
> >>> It turns out when there are "too many" threads decoding DWARF, they
> >>> start competing for memory allocation: significant number of cycles is
> >>> spent in osq_lock - in the depth of malloc called within
> >>> cu__zalloc. Which suggests that many threads are trying to allocate
> >>> memory at the same time.
> >>>
> >>> See an example on a perf flamegraph for run with -j240 [2]. This is
> >>> 12-core machine, so the effect is small. On machines with more cores
> >>> this problem is worse.
> >>>
> >>> Increasing the chunk size of obstacks associated with CUs helps to
> >>> reduce the performance penalty caused by this race condition.
> >>
> >>
> >> Is this because starting with a larger obstack size means we don't have
> >> to keep reallocating as the obstack grows?
> >
> > Yes. Bigger obstack size leads to lower number of malloc calls. The
> > mallocs tend to happen at the same time between threads in the case of
> > DWARF decoding.
> >
> > Curiously, setting a higher obstack chunk size (like 1Mb), does not
> > improve the overall wall-clock time, and can even make it worse.
> > This happens because the kernel takes a different code path to allocate
> > bigger chunks of memory. And also most CUs are not big (at least in case
> > of vmlinux), so a bigger chunk size probably increases wasted memory.
> >
> > 128Kb seems to be close to a sweet spot for the vmlinux.
> > The default is 4Kb.
> >
>
> Thanks for the additional details!
>
> Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
I'm adding these details and your reviewed-by tag to that cset.
Thanks!
- Arnaldo
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-01-09 16:03 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-21 3:04 [PATCH dwarves] dwarves: set cu->obstack chunk size to 128Kb Ihor Solodrai
2024-12-21 3:08 ` Ihor Solodrai
2025-01-08 13:55 ` Alan Maguire
2025-01-08 16:38 ` Ihor Solodrai
2025-01-08 17:22 ` Alan Maguire
2025-01-09 16:03 ` Arnaldo Carvalho de Melo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox