* [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps()
@ 2026-02-02 9:47 Arnd Bergmann
2026-02-02 9:47 ` [PATCH 2/2] media: rkvdec: reduce stack usage in rkvdec_init_v4l2_vp9_count_tbl() Arnd Bergmann
2026-02-02 13:42 ` [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() Nicolas Dufresne
0 siblings, 2 replies; 9+ messages in thread
From: Arnd Bergmann @ 2026-02-02 9:47 UTC (permalink / raw)
To: Detlev Casanova, Ezequiel Garcia, Mauro Carvalho Chehab,
Heiko Stuebner, Nathan Chancellor, Nicolas Dufresne, Hans Verkuil
Cc: Arnd Bergmann, Nick Desaulniers, Bill Wendling, Justin Stitt,
linux-media, linux-rockchip, linux-arm-kernel, linux-kernel, llvm
From: Arnd Bergmann <arnd@arndb.de>
The rkvdec_pps had a large set of bitfields, all of which
as misaligned. This causes clang-21 and likely other versions to
produce absolutely awful object code and a warning about very
large stack usage, on targets without unaligned access:
drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c:966:12: error: stack frame size (1472) exceeds limit (1280) in 'rkvdec_vp9_start' [-Werror,-Wframe-larger-than]
Part of the problem here is how all the bitfield accesses are
inlined into a function that already has large structures on
the stack.
Mark set_field_order_cnt() as noinline_for_stack, and split out
the following accesses in assemble_hw_pps() into another noinline
function, both of which now using around 800 bytes of stack in the
same configuration.
There is clearly still something wrong with clang here, but
splitting it into multiple functions reduces the risk of stack
overflow.
Fixes: fde24907570d ("media: rkvdec: Add H264 support for the VDPU383 variant")
Link: https://godbolt.org/z/acP1eKeq9
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
.../rockchip/rkvdec/rkvdec-vdpu383-h264.c | 50 ++++++++++---------
1 file changed, 27 insertions(+), 23 deletions(-)
diff --git a/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu383-h264.c b/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu383-h264.c
index 6ab3167addc8..ef69f2a36478 100644
--- a/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu383-h264.c
+++ b/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu383-h264.c
@@ -130,7 +130,7 @@ struct rkvdec_h264_ctx {
struct vdpu383_regs_h26x regs;
};
-static void set_field_order_cnt(struct rkvdec_pps *pps, const struct v4l2_h264_dpb_entry *dpb)
+static noinline_for_stack void set_field_order_cnt(struct rkvdec_pps *pps, const struct v4l2_h264_dpb_entry *dpb)
{
pps->top_field_order_cnt0 = dpb[0].top_field_order_cnt;
pps->bot_field_order_cnt0 = dpb[0].bottom_field_order_cnt;
@@ -166,6 +166,31 @@ static void set_field_order_cnt(struct rkvdec_pps *pps, const struct v4l2_h264_d
pps->bot_field_order_cnt15 = dpb[15].bottom_field_order_cnt;
}
+static noinline_for_stack void set_dec_params(struct rkvdec_pps *pps, const struct v4l2_ctrl_h264_decode_params *dec_params)
+{
+ const struct v4l2_h264_dpb_entry *dpb = dec_params->dpb;
+
+ for (int i = 0; i < ARRAY_SIZE(dec_params->dpb); i++) {
+ if (dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_LONG_TERM)
+ pps->is_longterm |= (1 << i);
+ pps->ref_field_flags |=
+ (!!(dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_FIELD)) << i;
+ pps->ref_colmv_use_flag |=
+ (!!(dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_ACTIVE)) << i;
+ pps->ref_topfield_used |=
+ (!!(dpb[i].fields & V4L2_H264_TOP_FIELD_REF)) << i;
+ pps->ref_botfield_used |=
+ (!!(dpb[i].fields & V4L2_H264_BOTTOM_FIELD_REF)) << i;
+ }
+ pps->pic_field_flag =
+ !!(dec_params->flags & V4L2_H264_DECODE_PARAM_FLAG_FIELD_PIC);
+ pps->pic_associated_flag =
+ !!(dec_params->flags & V4L2_H264_DECODE_PARAM_FLAG_BOTTOM_FIELD);
+
+ pps->cur_top_field = dec_params->top_field_order_cnt;
+ pps->cur_bot_field = dec_params->bottom_field_order_cnt;
+}
+
static void assemble_hw_pps(struct rkvdec_ctx *ctx,
struct rkvdec_h264_run *run)
{
@@ -177,7 +202,6 @@ static void assemble_hw_pps(struct rkvdec_ctx *ctx,
struct rkvdec_h264_priv_tbl *priv_tbl = h264_ctx->priv_tbl.cpu;
struct rkvdec_sps_pps *hw_ps;
u32 pic_width, pic_height;
- u32 i;
/*
* HW read the SPS/PPS information from PPS packet index by PPS id.
@@ -261,28 +285,8 @@ static void assemble_hw_pps(struct rkvdec_ctx *ctx,
!!(pps->flags & V4L2_H264_PPS_FLAG_SCALING_MATRIX_PRESENT);
set_field_order_cnt(&hw_ps->pps, dpb);
+ set_dec_params(&hw_ps->pps, dec_params);
- for (i = 0; i < ARRAY_SIZE(dec_params->dpb); i++) {
- if (dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_LONG_TERM)
- hw_ps->pps.is_longterm |= (1 << i);
-
- hw_ps->pps.ref_field_flags |=
- (!!(dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_FIELD)) << i;
- hw_ps->pps.ref_colmv_use_flag |=
- (!!(dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_ACTIVE)) << i;
- hw_ps->pps.ref_topfield_used |=
- (!!(dpb[i].fields & V4L2_H264_TOP_FIELD_REF)) << i;
- hw_ps->pps.ref_botfield_used |=
- (!!(dpb[i].fields & V4L2_H264_BOTTOM_FIELD_REF)) << i;
- }
-
- hw_ps->pps.pic_field_flag =
- !!(dec_params->flags & V4L2_H264_DECODE_PARAM_FLAG_FIELD_PIC);
- hw_ps->pps.pic_associated_flag =
- !!(dec_params->flags & V4L2_H264_DECODE_PARAM_FLAG_BOTTOM_FIELD);
-
- hw_ps->pps.cur_top_field = dec_params->top_field_order_cnt;
- hw_ps->pps.cur_bot_field = dec_params->bottom_field_order_cnt;
}
static void rkvdec_write_regs(struct rkvdec_ctx *ctx)
--
2.39.5
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 2/2] media: rkvdec: reduce stack usage in rkvdec_init_v4l2_vp9_count_tbl() 2026-02-02 9:47 [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() Arnd Bergmann @ 2026-02-02 9:47 ` Arnd Bergmann 2026-02-02 16:32 ` Nicolas Dufresne 2026-02-02 13:42 ` [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() Nicolas Dufresne 1 sibling, 1 reply; 9+ messages in thread From: Arnd Bergmann @ 2026-02-02 9:47 UTC (permalink / raw) To: Detlev Casanova, Ezequiel Garcia, Mauro Carvalho Chehab, Heiko Stuebner, Nathan Chancellor Cc: Arnd Bergmann, Nick Desaulniers, Bill Wendling, Justin Stitt, Nicolas Dufresne, Hans Verkuil, linux-media, linux-rockchip, linux-arm-kernel, linux-kernel, llvm, Jonas Karlman, Alex Bee From: Arnd Bergmann <arnd@arndb.de> The deeply nested loop in rkvdec_init_v4l2_vp9_count_tbl() needs a lot of registers, so when the clang register allocator runs out, it ends up spilling countless temporaries to the stack: drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c:966:12: error: stack frame size (1472) exceeds limit (1280) in 'rkvdec_vp9_start' [-Werror,-Wframe-larger-than] Marking this function as noinline_for_stack keeps it out of rkvdec_vp9_start(), giving the compiler more room for optimization. The resulting code is good enough that both the total stack usage and the loop get enough better to stay under the warning limit, though it's still slow, and would need a larger rework if this function ends up being called in a fast path. Signed-off-by: Arnd Bergmann <arnd@arndb.de> --- drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c b/drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c index ba51a7c2fe55..1c875d5a2bac 100644 --- a/drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c +++ b/drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c @@ -893,7 +893,8 @@ static void rkvdec_vp9_done(struct rkvdec_ctx *ctx, update_ctx_last_info(vp9_ctx); } -static void rkvdec_init_v4l2_vp9_count_tbl(struct rkvdec_ctx *ctx) +static noinline_for_stack void +rkvdec_init_v4l2_vp9_count_tbl(struct rkvdec_ctx *ctx) { struct rkvdec_vp9_ctx *vp9_ctx = ctx->priv; struct rkvdec_vp9_intra_frame_symbol_counts *intra_cnts = vp9_ctx->count_tbl.cpu; -- 2.39.5 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] media: rkvdec: reduce stack usage in rkvdec_init_v4l2_vp9_count_tbl() 2026-02-02 9:47 ` [PATCH 2/2] media: rkvdec: reduce stack usage in rkvdec_init_v4l2_vp9_count_tbl() Arnd Bergmann @ 2026-02-02 16:32 ` Nicolas Dufresne 0 siblings, 0 replies; 9+ messages in thread From: Nicolas Dufresne @ 2026-02-02 16:32 UTC (permalink / raw) To: Arnd Bergmann, Detlev Casanova, Ezequiel Garcia, Mauro Carvalho Chehab, Heiko Stuebner, Nathan Chancellor Cc: Arnd Bergmann, Nick Desaulniers, Bill Wendling, Justin Stitt, Hans Verkuil, linux-media, linux-rockchip, linux-arm-kernel, linux-kernel, llvm, Jonas Karlman, Alex Bee [-- Attachment #1: Type: text/plain, Size: 1843 bytes --] Le lundi 02 février 2026 à 10:47 +0100, Arnd Bergmann a écrit : > From: Arnd Bergmann <arnd@arndb.de> > > The deeply nested loop in rkvdec_init_v4l2_vp9_count_tbl() needs a lot > of registers, so when the clang register allocator runs out, it ends up > spilling countless temporaries to the stack: > > drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c:966:12: error: stack frame size (1472) exceeds limit (1280) in 'rkvdec_vp9_start' [-Werror,-Wframe-larger-than] > > Marking this function as noinline_for_stack keeps it out of > rkvdec_vp9_start(), giving the compiler more room for optimization. > > The resulting code is good enough that both the total stack usage > and the loop get enough better to stay under the warning limit, > though it's still slow, and would need a larger rework if this > function ends up being called in a fast path. > > Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com> > --- > drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c b/drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c > index ba51a7c2fe55..1c875d5a2bac 100644 > --- a/drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c > +++ b/drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c > @@ -893,7 +893,8 @@ static void rkvdec_vp9_done(struct rkvdec_ctx *ctx, > update_ctx_last_info(vp9_ctx); > } > > -static void rkvdec_init_v4l2_vp9_count_tbl(struct rkvdec_ctx *ctx) > +static noinline_for_stack void > +rkvdec_init_v4l2_vp9_count_tbl(struct rkvdec_ctx *ctx) > { > struct rkvdec_vp9_ctx *vp9_ctx = ctx->priv; > struct rkvdec_vp9_intra_frame_symbol_counts *intra_cnts = vp9_ctx->count_tbl.cpu; [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() 2026-02-02 9:47 [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() Arnd Bergmann 2026-02-02 9:47 ` [PATCH 2/2] media: rkvdec: reduce stack usage in rkvdec_init_v4l2_vp9_count_tbl() Arnd Bergmann @ 2026-02-02 13:42 ` Nicolas Dufresne 2026-02-02 14:09 ` Arnd Bergmann 1 sibling, 1 reply; 9+ messages in thread From: Nicolas Dufresne @ 2026-02-02 13:42 UTC (permalink / raw) To: Arnd Bergmann, Detlev Casanova, Ezequiel Garcia, Mauro Carvalho Chehab, Heiko Stuebner, Nathan Chancellor, Hans Verkuil Cc: Arnd Bergmann, Nick Desaulniers, Bill Wendling, Justin Stitt, linux-media, linux-rockchip, linux-arm-kernel, linux-kernel, llvm [-- Attachment #1: Type: text/plain, Size: 6137 bytes --] Hi Arnd, Le lundi 02 février 2026 à 10:47 +0100, Arnd Bergmann a écrit : > From: Arnd Bergmann <arnd@arndb.de> > > The rkvdec_pps had a large set of bitfields, all of which > as misaligned. This causes clang-21 and likely other versions to > produce absolutely awful object code and a warning about very > large stack usage, on targets without unaligned access: > > drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c:966:12: error: stack frame size (1472) exceeds limit (1280) in 'rkvdec_vp9_start' [-Werror,-Wframe-larger-than] We had already addressed and validated that on clang-21, which indicates me that we likely are missing an architecture (or a config) in our CI. Can you document which architecture, configuration and flags was affected so we can add it on our side ? Our media pipeline before sending to Linus and the clang builds trace are in the following link, in case it matters. https://gitlab.freedesktop.org/linux-media/media-committers/-/pipelines/1588731 https://gitlab.freedesktop.org/linux-media/media-committers/-/jobs/91604655 > > Part of the problem here is how all the bitfield accesses are > inlined into a function that already has large structures on > the stack. Another observation is that you had to enable ASAN to make it miss-behave on for loop unrolling (with complex bitfield writes). All I've obtained by visiting the Link: is that its armv7-a architecture. > > Mark set_field_order_cnt() as noinline_for_stack, and split out > the following accesses in assemble_hw_pps() into another noinline > function, both of which now using around 800 bytes of stack in the > same configuration. > > There is clearly still something wrong with clang here, but > splitting it into multiple functions reduces the risk of stack > overflow. We've tried really hard to avoid this noninline_for_stack just because compilers are buggy. I'll have a look again in case I find some ideas, but meanwhile, with failing architecture in the commit message: Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com> > > Fixes: fde24907570d ("media: rkvdec: Add H264 support for the VDPU383 variant") > Link: https://godbolt.org/z/acP1eKeq9 > Signed-off-by: Arnd Bergmann <arnd@arndb.de> > --- > .../rockchip/rkvdec/rkvdec-vdpu383-h264.c | 50 ++++++++++--------- > 1 file changed, 27 insertions(+), 23 deletions(-) > > diff --git a/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu383-h264.c b/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu383-h264.c > index 6ab3167addc8..ef69f2a36478 100644 > --- a/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu383-h264.c > +++ b/drivers/media/platform/rockchip/rkvdec/rkvdec-vdpu383-h264.c > @@ -130,7 +130,7 @@ struct rkvdec_h264_ctx { > struct vdpu383_regs_h26x regs; > }; > > -static void set_field_order_cnt(struct rkvdec_pps *pps, const struct v4l2_h264_dpb_entry *dpb) > +static noinline_for_stack void set_field_order_cnt(struct rkvdec_pps *pps, const struct v4l2_h264_dpb_entry *dpb) > { > pps->top_field_order_cnt0 = dpb[0].top_field_order_cnt; > pps->bot_field_order_cnt0 = dpb[0].bottom_field_order_cnt; > @@ -166,6 +166,31 @@ static void set_field_order_cnt(struct rkvdec_pps *pps, const struct v4l2_h264_d > pps->bot_field_order_cnt15 = dpb[15].bottom_field_order_cnt; > } > > +static noinline_for_stack void set_dec_params(struct rkvdec_pps *pps, const struct v4l2_ctrl_h264_decode_params *dec_params) > +{ > + const struct v4l2_h264_dpb_entry *dpb = dec_params->dpb; > + > + for (int i = 0; i < ARRAY_SIZE(dec_params->dpb); i++) { > + if (dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_LONG_TERM) > + pps->is_longterm |= (1 << i); > + pps->ref_field_flags |= > + (!!(dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_FIELD)) << i; > + pps->ref_colmv_use_flag |= > + (!!(dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_ACTIVE)) << i; > + pps->ref_topfield_used |= > + (!!(dpb[i].fields & V4L2_H264_TOP_FIELD_REF)) << i; > + pps->ref_botfield_used |= > + (!!(dpb[i].fields & V4L2_H264_BOTTOM_FIELD_REF)) << i; > + } > + pps->pic_field_flag = > + !!(dec_params->flags & V4L2_H264_DECODE_PARAM_FLAG_FIELD_PIC); > + pps->pic_associated_flag = > + !!(dec_params->flags & V4L2_H264_DECODE_PARAM_FLAG_BOTTOM_FIELD); > + > + pps->cur_top_field = dec_params->top_field_order_cnt; > + pps->cur_bot_field = dec_params->bottom_field_order_cnt; > +} > + > static void assemble_hw_pps(struct rkvdec_ctx *ctx, > struct rkvdec_h264_run *run) > { > @@ -177,7 +202,6 @@ static void assemble_hw_pps(struct rkvdec_ctx *ctx, > struct rkvdec_h264_priv_tbl *priv_tbl = h264_ctx->priv_tbl.cpu; > struct rkvdec_sps_pps *hw_ps; > u32 pic_width, pic_height; > - u32 i; > > /* > * HW read the SPS/PPS information from PPS packet index by PPS id. > @@ -261,28 +285,8 @@ static void assemble_hw_pps(struct rkvdec_ctx *ctx, > !!(pps->flags & V4L2_H264_PPS_FLAG_SCALING_MATRIX_PRESENT); > > set_field_order_cnt(&hw_ps->pps, dpb); > + set_dec_params(&hw_ps->pps, dec_params); > > - for (i = 0; i < ARRAY_SIZE(dec_params->dpb); i++) { > - if (dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_LONG_TERM) > - hw_ps->pps.is_longterm |= (1 << i); > - > - hw_ps->pps.ref_field_flags |= > - (!!(dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_FIELD)) << i; > - hw_ps->pps.ref_colmv_use_flag |= > - (!!(dpb[i].flags & V4L2_H264_DPB_ENTRY_FLAG_ACTIVE)) << i; > - hw_ps->pps.ref_topfield_used |= > - (!!(dpb[i].fields & V4L2_H264_TOP_FIELD_REF)) << i; > - hw_ps->pps.ref_botfield_used |= > - (!!(dpb[i].fields & V4L2_H264_BOTTOM_FIELD_REF)) << i; > - } > - > - hw_ps->pps.pic_field_flag = > - !!(dec_params->flags & V4L2_H264_DECODE_PARAM_FLAG_FIELD_PIC); > - hw_ps->pps.pic_associated_flag = > - !!(dec_params->flags & V4L2_H264_DECODE_PARAM_FLAG_BOTTOM_FIELD); > - > - hw_ps->pps.cur_top_field = dec_params->top_field_order_cnt; > - hw_ps->pps.cur_bot_field = dec_params->bottom_field_order_cnt; > } > > static void rkvdec_write_regs(struct rkvdec_ctx *ctx) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() 2026-02-02 13:42 ` [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() Nicolas Dufresne @ 2026-02-02 14:09 ` Arnd Bergmann 2026-02-02 15:12 ` Nicolas Dufresne 0 siblings, 1 reply; 9+ messages in thread From: Arnd Bergmann @ 2026-02-02 14:09 UTC (permalink / raw) To: Nicolas Dufresne, Arnd Bergmann, Detlev Casanova, Ezequiel Garcia, Mauro Carvalho Chehab, Heiko Stübner, Nathan Chancellor, Hans Verkuil Cc: Nick Desaulniers, Bill Wendling, Justin Stitt, linux-media, linux-rockchip, linux-arm-kernel, linux-kernel, llvm On Mon, Feb 2, 2026, at 14:42, Nicolas Dufresne wrote: > Le lundi 02 février 2026 à 10:47 +0100, Arnd Bergmann a écrit : >> From: Arnd Bergmann <arnd@arndb.de> >> >> The rkvdec_pps had a large set of bitfields, all of which >> as misaligned. This causes clang-21 and likely other versions to >> produce absolutely awful object code and a warning about very >> large stack usage, on targets without unaligned access: >> >> drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c:966:12: error: stack frame size (1472) exceeds limit (1280) in 'rkvdec_vp9_start' [-Werror,-Wframe-larger-than] > > We had already addressed and validated that on clang-21, which indicates me that > we likely are missing an architecture (or a config) in our CI. Can you document > which architecture, configuration and flags was affected so we can add it on our > side ? > > Our media pipeline before sending to Linus and the clang builds trace are in the > following link, in case it matters. > > https://gitlab.freedesktop.org/linux-media/media-committers/-/pipelines/1588731 > https://gitlab.freedesktop.org/linux-media/media-committers/-/jobs/91604655 The configuration that hit this for me was an ARMv7-M NOMMU build. I'm doing 'randconfig' builds here, so I inevitably hit some corner cases that all deterministic CI systems miss. I don't think that you should add ARMv7-M here, since that would take up useful build resources from something more important. There are no drviers/media/ actual users on ARMv7-M, and next time it is going to be something else. >> Part of the problem here is how all the bitfield accesses are >> inlined into a function that already has large structures on >> the stack. > > Another observation is that you had to enable ASAN to make it miss-behave on for > loop unrolling (with complex bitfield writes). All I've obtained by visiting > the Link: is that its armv7-a architecture. Right, this randconfig build likely got closer to the warning limit because of the inherent overhead in KASAN, but the problem with the unaligned bitfields was something that I could later reproduce without KASAN, on ARMv5 and MIPS32r2. This is something we should fix in clang. >> Mark set_field_order_cnt() as noinline_for_stack, and split out >> the following accesses in assemble_hw_pps() into another noinline >> function, both of which now using around 800 bytes of stack in the >> same configuration. >> >> There is clearly still something wrong with clang here, but >> splitting it into multiple functions reduces the risk of stack >> overflow. > > We've tried really hard to avoid this noninline_for_stack just because compilers > are buggy. I'll have a look again in case I find some ideas, but meanwhile, with > failing architecture in the commit message: > > Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com> Thanks! Arnd ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() 2026-02-02 14:09 ` Arnd Bergmann @ 2026-02-02 15:12 ` Nicolas Dufresne 2026-02-02 15:59 ` Arnd Bergmann 0 siblings, 1 reply; 9+ messages in thread From: Nicolas Dufresne @ 2026-02-02 15:12 UTC (permalink / raw) To: Arnd Bergmann, Arnd Bergmann, Detlev Casanova, Ezequiel Garcia, Mauro Carvalho Chehab, Heiko Stübner, Nathan Chancellor, Hans Verkuil Cc: Nick Desaulniers, Bill Wendling, Justin Stitt, linux-media, linux-rockchip, linux-arm-kernel, linux-kernel, llvm [-- Attachment #1: Type: text/plain, Size: 4529 bytes --] Hi Arnd, Le lundi 02 février 2026 à 15:09 +0100, Arnd Bergmann a écrit : > On Mon, Feb 2, 2026, at 14:42, Nicolas Dufresne wrote: > > Le lundi 02 février 2026 à 10:47 +0100, Arnd Bergmann a écrit : > > > From: Arnd Bergmann <arnd@arndb.de> > > > > > > The rkvdec_pps had a large set of bitfields, all of which > > > as misaligned. This causes clang-21 and likely other versions to > > > produce absolutely awful object code and a warning about very > > > large stack usage, on targets without unaligned access: > > > > > > drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c:966:12: error: stack frame size (1472) exceeds limit (1280) in 'rkvdec_vp9_start' [-Werror,-Wframe-larger-than] > > > > We had already addressed and validated that on clang-21, which indicates me that > > we likely are missing an architecture (or a config) in our CI. Can you document > > which architecture, configuration and flags was affected so we can add it on our > > side ? > > > > Our media pipeline before sending to Linus and the clang builds trace are in the > > following link, in case it matters. > > > > https://gitlab.freedesktop.org/linux-media/media-committers/-/pipelines/1588731 > > https://gitlab.freedesktop.org/linux-media/media-committers/-/jobs/91604655 > > The configuration that hit this for me was an ARMv7-M NOMMU build. I'm > doing 'randconfig' builds here, so I inevitably hit some corner cases > that all deterministic CI systems miss. I don't think that you should > add ARMv7-M here, since that would take up useful build resources > from something more important. There are no drviers/media/ actual > users on ARMv7-M, and next time it is going to be something else. > > > > Part of the problem here is how all the bitfield accesses are > > > inlined into a function that already has large structures on > > > the stack. > > > > Another observation is that you had to enable ASAN to make it miss-behave on for > > loop unrolling (with complex bitfield writes). All I've obtained by visiting > > the Link: is that its armv7-a architecture. > > Right, this randconfig build likely got closer to the warning > limit because of the inherent overhead in KASAN, but the problem > with the unaligned bitfields was something that I could later > reproduce without KASAN, on ARMv5 and MIPS32r2. > > This is something we should fix in clang. All fair comments. I plan to take this into fixes (no changes needed), hopefully for rc-2. Performance wise, this code is to replace read/mask/write into hardware registers which was significantly slower for this amount of registers (~200 32bit integers) and this type of IP (its not sram). This is run once per frame. In practice, if we hand code the read/mask/write, the performance should eventually converge to using bitfield and letting the compiler do this masking, I was being optimistic on how the compiler would behave. If performance of that is truly a problem, we can always just prepare the ram register ahead of the operation queue (instead of doing it in the executor). One thing to remind, you can't optimize the data structure layout, since they need to match the register layout. But while fixing some of the stack report previously, I did endup up moving few things out of loops (which is not clearly feasible in this patch). I did not checked all the code (only the failing one). One of the bad pattern which costed stack (and overhead probably) was the use of switch() statement to pick one of the unaligned register location, with that switch being part of an unrolled loop. If you ever spot these, and have time, please just manually unroll the switch out of the loop (its actually less code). > > > > Mark set_field_order_cnt() as noinline_for_stack, and split out > > > the following accesses in assemble_hw_pps() into another noinline > > > function, both of which now using around 800 bytes of stack in the > > > same configuration. > > > > > > There is clearly still something wrong with clang here, but > > > splitting it into multiple functions reduces the risk of stack > > > overflow. > > > > We've tried really hard to avoid this noninline_for_stack just because compilers > > are buggy. I'll have a look again in case I find some ideas, but meanwhile, with > > failing architecture in the commit message: > > > > Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com> > > Thanks! > > Arnd thanks to you, Nicolas [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() 2026-02-02 15:12 ` Nicolas Dufresne @ 2026-02-02 15:59 ` Arnd Bergmann 2026-02-02 16:31 ` Nicolas Dufresne 2026-02-02 22:45 ` David Laight 0 siblings, 2 replies; 9+ messages in thread From: Arnd Bergmann @ 2026-02-02 15:59 UTC (permalink / raw) To: Nicolas Dufresne, Arnd Bergmann, Detlev Casanova, Ezequiel Garcia, Mauro Carvalho Chehab, Heiko Stübner, Nathan Chancellor, Hans Verkuil Cc: Nick Desaulniers, Bill Wendling, Justin Stitt, linux-media, linux-rockchip, linux-arm-kernel, linux-kernel, llvm On Mon, Feb 2, 2026, at 16:12, Nicolas Dufresne wrote: > Le lundi 02 février 2026 à 15:09 +0100, Arnd Bergmann a écrit : >> On Mon, Feb 2, 2026, at 14:42, Nicolas Dufresne wrote: >> Right, this randconfig build likely got closer to the warning >> limit because of the inherent overhead in KASAN, but the problem >> with the unaligned bitfields was something that I could later >> reproduce without KASAN, on ARMv5 and MIPS32r2. >> >> This is something we should fix in clang. > > All fair comments. I plan to take this into fixes (no changes needed), hopefully > for rc-2. > > Performance wise, this code is to replace read/mask/write into hardware > registers which was significantly slower for this amount of registers (~200 > 32bit integers) and this type of IP (its not sram). This is run once per frame. > In practice, if we hand code the read/mask/write, the performance should > eventually converge to using bitfield and letting the compiler do this masking, > I was being optimistic on how the compiler would behave. If performance of that > is truly a problem, we can always just prepare the ram register ahead of the > operation queue (instead of doing it in the executor). I think there are multiple things going on here, some of which are more relevant than others: - The problem I'm addressing with my patch is purely a clang issue for CPU architectures with high register pressure when assembling the structure in memory. As a first-order approximation, you can see the lines in the output being 12.000 with clang, but only 600 with gcc in the godbolt.org output. The gcc version isn't that great either, but it is orders of magnitude fewer instructions. - MMIO reads are clearly a performance killer, so assembling the structure in memory and using memcpy_toio() to access the registers as you appear to be doing is the right idea. - using bitfields for hardware structures is non-portable. In particular, the order of the fields within a word depends on byteorder (CONFIG_CPU_BIG_ENDIAN), and the alignment depends on the architecture, e.g. 'struct { u32 a:16: u32 b: 32; u32 c:16}; has the second member cross a u32 boundary, which leads to padding between a and b, as well as after c on some architectures but not others. I would always recommend splitting up bitfields on word boundaries and adding explicit padding where necessary. - Since most of the fields are exactly 6 bits offset from a word boundary, you can try assembling all the *_field_order_cnt* fields in an array first that has all the bits in the correct order, but then shift the entire array six bits. Arnd ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() 2026-02-02 15:59 ` Arnd Bergmann @ 2026-02-02 16:31 ` Nicolas Dufresne 2026-02-02 22:45 ` David Laight 1 sibling, 0 replies; 9+ messages in thread From: Nicolas Dufresne @ 2026-02-02 16:31 UTC (permalink / raw) To: Arnd Bergmann, Arnd Bergmann, Detlev Casanova, Ezequiel Garcia, Mauro Carvalho Chehab, Heiko Stübner, Nathan Chancellor, Hans Verkuil Cc: Nick Desaulniers, Bill Wendling, Justin Stitt, linux-media, linux-rockchip, linux-arm-kernel, linux-kernel, llvm [-- Attachment #1: Type: text/plain, Size: 3634 bytes --] Hi, Le lundi 02 février 2026 à 16:59 +0100, Arnd Bergmann a écrit : > On Mon, Feb 2, 2026, at 16:12, Nicolas Dufresne wrote: > > Le lundi 02 février 2026 à 15:09 +0100, Arnd Bergmann a écrit : > > > On Mon, Feb 2, 2026, at 14:42, Nicolas Dufresne wrote: > > > > Right, this randconfig build likely got closer to the warning > > > limit because of the inherent overhead in KASAN, but the problem > > > with the unaligned bitfields was something that I could later > > > reproduce without KASAN, on ARMv5 and MIPS32r2. > > > > > > This is something we should fix in clang. > > > > All fair comments. I plan to take this into fixes (no changes needed), hopefully > > for rc-2. > > > > Performance wise, this code is to replace read/mask/write into hardware > > registers which was significantly slower for this amount of registers (~200 > > 32bit integers) and this type of IP (its not sram). This is run once per frame. > > In practice, if we hand code the read/mask/write, the performance should > > eventually converge to using bitfield and letting the compiler do this masking, > > I was being optimistic on how the compiler would behave. If performance of that > > is truly a problem, we can always just prepare the ram register ahead of the > > operation queue (instead of doing it in the executor). > > I think there are multiple things going on here, some of which are > more relevant than others: > > - The problem I'm addressing with my patch is purely a clang issue > for CPU architectures with high register pressure when assembling > the structure in memory. As a first-order approximation, you can > see the lines in the output being 12.000 with clang, but only > 600 with gcc in the godbolt.org output. The gcc version isn't that > great either, but it is orders of magnitude fewer instructions. > > - MMIO reads are clearly a performance killer, so assembling the > structure in memory and using memcpy_toio() to access the > registers as you appear to be doing is the right idea. > > - using bitfields for hardware structures is non-portable. In > particular, the order of the fields within a word depends on > byteorder (CONFIG_CPU_BIG_ENDIAN), and the alignment depends > on the architecture, e.g. 'struct { u32 a:16: u32 b: 32; u32 c:16}; > has the second member cross a u32 boundary, which leads to > padding between a and b, as well as after c on some architectures > but not others. I would always recommend splitting up bitfields > on word boundaries and adding explicit padding where necessary. Ok, got it, clearly the registers bitfield (which is a set of 32bit bitfield) is fine (appart from endian, but this is deliberatly ignored). These are the one I had mind, and are optimized with copy_toio. For the SPS/PPS bistream, which is shared memory with the IP, I tend to agree this might not have been the ideal choice, though the author did verify everything with pahole for the relevant architectures (in practice only two ARM64 SoC use this bitstream format). I'm happy to revisit this eventually. And would not hurt to share a common bitstream writer, that works with both endian in V4L2 (or use one from the core if that already exist). Nicolas > > - Since most of the fields are exactly 6 bits offset from a word > boundary, you can try assembling all the *_field_order_cnt* > fields in an array first that has all the bits in the correct > order, but then shift the entire array six bits. > > Arnd [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() 2026-02-02 15:59 ` Arnd Bergmann 2026-02-02 16:31 ` Nicolas Dufresne @ 2026-02-02 22:45 ` David Laight 1 sibling, 0 replies; 9+ messages in thread From: David Laight @ 2026-02-02 22:45 UTC (permalink / raw) To: Arnd Bergmann Cc: Nicolas Dufresne, Arnd Bergmann, Detlev Casanova, Ezequiel Garcia, Mauro Carvalho Chehab, Heiko Stübner, Nathan Chancellor, Hans Verkuil, Nick Desaulniers, Bill Wendling, Justin Stitt, linux-media, linux-rockchip, linux-arm-kernel, linux-kernel, llvm On Mon, 02 Feb 2026 16:59:05 +0100 "Arnd Bergmann" <arnd@arndb.de> wrote: ... > - Since most of the fields are exactly 6 bits offset from a word > boundary, you can try assembling all the *_field_order_cnt* > fields in an array first that has all the bits in the correct > order, but then shift the entire array six bits. How are they being written to the hardware? It would be very unusual for hardware not to have things 'reasonably aligned'. This makes me think that the data buffer is actually being 'bit-bashed' down some serial data interface. In which case the simple solution is to give the function that writes the data a 'bit offset' for the first word. And to re-iterate C bit-fields are completely non portable and entirely inappropriate for mapping onto device registers. You can use C structures (arranged with everything on its natural boundary so there are no holes) with members declared with the correct endianness. Indeed that is actually preferable to using numeric constants as the offsets are tied to the correct structure. David > > Arnd > ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-02-02 22:45 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-02 9:47 [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() Arnd Bergmann 2026-02-02 9:47 ` [PATCH 2/2] media: rkvdec: reduce stack usage in rkvdec_init_v4l2_vp9_count_tbl() Arnd Bergmann 2026-02-02 16:32 ` Nicolas Dufresne 2026-02-02 13:42 ` [PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps() Nicolas Dufresne 2026-02-02 14:09 ` Arnd Bergmann 2026-02-02 15:12 ` Nicolas Dufresne 2026-02-02 15:59 ` Arnd Bergmann 2026-02-02 16:31 ` Nicolas Dufresne 2026-02-02 22:45 ` David Laight
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox