* [PATCH RFC 0/2] kernel-doc: better handle data prototypes
@ 2026-03-20 9:46 Mauro Carvalho Chehab
2026-03-20 9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-20 9:46 UTC (permalink / raw)
To: Linux Doc Mailing List, Jonathan Corbet
Cc: Mauro Carvalho Chehab, Mauro Carvalho Chehab, linux-kernel
Hi Jon,
Don't merge this series. It is just a heads on about what I'm
working right now.
This is basically a proof of concept, not yet integrated with
kernel-doc. It helps to show that investing on a tokenizer
was a good idea.
I'm still testing the code.
Right now, kernel-doc logic to handle data types is very
complex, and the code is split into dump_<type> functions, which
in turn calls several ancillary routines. The most complex ones
are related to handling struct, with involves converting inner
struct/unions into members of the main struct.
By using this new code, all elements from most data types can
be parsed with a single code.
Please notice that the code was designed to pick a single
declaration, as this is how kdoc_parser will use it.
If you try to parse multiple ones, the output won't be right,
as it will pick the first declaration name and create a single
item with all data declarations on it.
As it is not based on regexes, it can properly handle some
problematic cases, like having:
{};
and:
;;;;;
in the middle of a struct/union.
For enums, if one has values inside the declaration, like:
enum { FOO, BAR } type;
It picks the right data type. Kernel-doc maps this currently as:
enum type
My plan is to integrate it at Kernel-doc and see how it goes.
It will likely rise some corner cases, but, once we get it right,
this will likely reduce the size and complexity of kdoc_parser.
If you want to test, you can use:
./parse_c.py
to use an example hardcoded on it, or it reads from a fname with:
$ ./parse_c.py x.h
CDataItem(decl_type=None, decl_name=None, parameterlist=['u16_data'], parametertypes={'u16_data': 'u16 u16_data[sizeof(u64) / sizeof(u16)]'})
None None
parameterlist:
- u16_data
parametertypes:
- u16_data: u16 u16_data[sizeof(u64) / sizeof(u16)]
(on this example, x.h has just:
u16 u16_data[sizeof(u64) / sizeof(u16)];
)
The logic stores decl_type and decl_name when the data is
struct/union/enum. If the data is just a declaration, it fills
only one element at parameterlist and at parametertypes.
Mauro Carvalho Chehab (2):
docs: kdoc: add a class to parse data items
HACK: add a parse_c.py file to test CDataParser
parse_c.py | 87 +++++++++++
tools/lib/python/kdoc/data_parser.py | 211 +++++++++++++++++++++++++++
2 files changed, 298 insertions(+)
create mode 100755 parse_c.py
create mode 100644 tools/lib/python/kdoc/data_parser.py
--
2.53.0
^ permalink raw reply [flat|nested] 4+ messages in thread* [PATCH RFC 1/2] docs: kdoc: add a class to parse data items
2026-03-20 9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
@ 2026-03-20 9:46 ` Mauro Carvalho Chehab
2026-03-20 9:46 ` [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser Mauro Carvalho Chehab
2026-03-24 15:33 ` [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
2 siblings, 0 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-20 9:46 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-kernel, Mauro Carvalho Chehab
Instead of using very complex regular expressions and hamming
inner structs/unions, use CTokenizer to handle data types.
It should be noticed that this doesn't handle "typedef".
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/data_parser.py | 211 +++++++++++++++++++++++++++
1 file changed, 211 insertions(+)
create mode 100644 tools/lib/python/kdoc/data_parser.py
diff --git a/tools/lib/python/kdoc/data_parser.py b/tools/lib/python/kdoc/data_parser.py
new file mode 100644
index 000000000000..f04915b67d6b
--- /dev/null
+++ b/tools/lib/python/kdoc/data_parser.py
@@ -0,0 +1,211 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2025: Mauro Carvalho Chehab <mchehab@kernel.org>.
+
+"""
+C lexical parser for variables.
+"""
+
+import logging
+import re
+
+from .c_lex import CTokenizer, CToken
+
+class CDataItem:
+ """
+ Represent a data declaration.
+ """
+ def __init__(self):
+ self.decl_name = None
+ self.decl_type = None
+ self.parameterlist = []
+ self.parametertypes = {}
+
+ def __repr__(self) -> str:
+ """
+ Return contents of the CDataItem.
+ Useful for debugging purposes.
+ """
+ return (f"CDataItem(decl_type={self.decl_type!r}, "
+ f"decl_name={self.decl_name!r}, "
+ f"parameterlist={self.parameterlist!r}, "
+ f"parametertypes={self.parametertypes!r})")
+
+class CDataParser:
+ """
+ Handles a C data prototype, converting it into a data element
+ describing it.
+ """
+
+ IGNORE_TOKENS = [CToken.SPACE, CToken.COMMENT]
+
+ def __init__(self, source):
+ self.source = source
+ self.item = CDataItem()
+
+ self._parse()
+
+ def _push_struct(self, tokens, stack, prev_kind, i):
+ """
+ Handles Structs and enums, picking the identifier just after
+ ``struct`` or ``union``.
+ """
+
+ if prev_kind:
+ j = prev_kind + 1
+ while j < len(tokens) and tokens[j].kind in self.IGNORE_TOKENS:
+ j += 1
+
+ if j < len(tokens) and tokens[j].kind == CToken.NAME:
+ stack.append(tokens[j].value)
+ return
+
+ name = "{unnamed " + tokens[prev_kind].value + "}"
+ stack.append(name)
+ self.item.parameterlist.append(name)
+ return
+
+ #
+ # Empty block. We still need to append for stack levels to match
+ #
+ stack.append(None)
+
+ def _parse(self):
+ """
+ Core algorithm it is a lightweight rewrite of the
+ walk-the-tokens logic we sketched in the previous answer.
+ """
+ tokens = CTokenizer(self.source).tokens
+
+ stack= []
+ current_type = []
+ parameters = []
+ types = {}
+
+ prev_kind = None
+ get_id = False
+ level = 0
+
+ for i in range(0, len(tokens)):
+ tok = tokens[i]
+ if tok.kind == CToken.COMMENT:
+ continue
+
+ if tok.kind in [CToken.STRUCT, CToken.UNION, CToken.ENUM]:
+ prev_kind = i
+
+ if tok.kind == CToken.BEGIN:
+ if tok.value == "{":
+ if (prev_kind and
+ tokens[prev_kind].kind in [CToken.STRUCT, CToken.UNION]):
+
+ self._push_struct(tokens, stack, prev_kind, i)
+ if not self.item.decl_name:
+ self.item.decl_name = stack[0]
+ else:
+ stack.append(None)
+
+ #
+ # Add previous tokens
+ #
+ if prev_kind:
+ get_id = True
+
+ if not self.item.decl_type:
+ self.item.decl_type = tokens[prev_kind].value
+
+ current_type = []
+
+ continue
+
+ level += 1
+
+ if tok.kind == CToken.END:
+ if tok.value == "}":
+ if stack:
+ stack.pop()
+
+ if get_id and prev_kind:
+ current_type = []
+ for j in range(prev_kind, i + 1):
+ current_type.append((level, tokens[j]))
+ if tok.kind == CToken.BEGIN:
+ break
+
+ while j < len(tokens):
+ if tokens[j].kind not in self.IGNORE_TOKENS:
+ break
+ j += 1
+
+ name = None
+
+ if tokens[j].kind == CToken.NAME:
+ name = tokens[j].value
+
+ if not self.item.decl_type and len(stack) == 1:
+ self.item.decl_name = stack[0]
+
+ self.item.parameterlist.append(name)
+ current_type.append((level, tok))
+
+ get_id = False
+ prev_kind = None
+ continue
+
+ level -= 1
+
+ if tok.kind != CToken.ENDSTMT:
+ current_type.append((level, tok))
+ continue
+
+ #
+ # End of an statement. Parse it if tokens are present
+ #
+
+ if not current_type:
+ current_type = []
+ continue
+
+ #
+ # the last NAME token with level 0 is the field name
+ #
+ name_token = None
+ for pos, t in enumerate(reversed(current_type)):
+ cur_level, cur_tok = t
+ if not cur_level and cur_tok.kind == CToken.NAME:
+ name_token = cur_tok. value
+ break
+
+ #
+ # TODO: we should likely emit a Warning here
+ #
+
+ if not name_token:
+ current_type = []
+ continue
+
+ #
+ # As we used reversed, we need to adjust pos here
+ #
+ pos = len(current_type) - pos - 1
+
+ #
+ # For the type, pick everything but the name
+ #
+
+ out = ""
+ for l, t in current_type:
+ out += t.value
+
+ names = []
+ for n in stack[1:] + [name_token]:
+ if n:
+ if not "{unnamed" in n:
+ names.append(n)
+
+ full_name = ".".join(names)
+
+ self.item.parameterlist.append(full_name)
+ self.item.parametertypes[full_name] = out.strip()
+
+ current_type = []
--
2.53.0
^ permalink raw reply related [flat|nested] 4+ messages in thread* [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser
2026-03-20 9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
2026-03-20 9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
@ 2026-03-20 9:46 ` Mauro Carvalho Chehab
2026-03-24 15:33 ` [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
2 siblings, 0 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-20 9:46 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-kernel, Mauro Carvalho Chehab
This patch should not be merged. It is a quick tool to test
CDataParser.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
parse_c.py | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 87 insertions(+)
create mode 100755 parse_c.py
diff --git a/parse_c.py b/parse_c.py
new file mode 100755
index 000000000000..740445998965
--- /dev/null
+++ b/parse_c.py
@@ -0,0 +1,87 @@
+#!/usr/bin/env python3
+# example.py
+"""
+Run a quick demo on a real C source file.
+
+Usage
+-----
+ python -m c_struct_parser.example <path/to/c/file.c>
+"""
+
+import argparse
+
+from tools.lib.python.kdoc.data_parser import CDataParser
+
+TEST = """
+struct property_entry {
+ const char *name;
+ size_t length;
+ bool is_inline; /* TEST */
+ struct foo {
+ char *bar[12];
+ struct foobar {
+ enum enum_type my_enum; /* TEST 2 */
+ struct {
+ uint_t test; /* TEST 3 */
+ static const int anonymous;
+ };
+ };
+ ;; /* This is valid, but should not occur in practice */
+ {}; /* Same here */
+ };
+ enum dev_prop_type type;
+ enum {
+ EXPRESSION_LITERAL,
+ EXPRESSION_BINARY,
+ EXPRESSION_UNARY,
+ EXPRESSION_FUNCTION,
+ EXPRESSION_ARRAY
+ } literal;
+
+ union {
+ const void *pointer;
+ union {
+ u8 boou8_data[sizeof(u64) / sizeof(u8)];
+ u16 u16_data[sizeof(u64) / sizeof(u16)];
+ u32 u32_data[sizeof(u64) / sizeof(u32)];
+ u64 u64_data[sizeof(u64) / sizeof(u64)];
+ const char *str[sizeof(u64) / sizeof(char *)];
+ };
+ };
+ char *prop_name;
+};
+"""
+
+
+def main():
+ p = argparse.ArgumentParser(description="Parse a C struct/union/enum definition.")
+
+ p.add_argument("fname", nargs="?", help="C source file to parse")
+
+ args = p.parse_args()
+
+ if args.fname:
+ with open(args.fname, "r", encoding="utf-8") as f:
+ source = f.read()
+ else:
+ source = TEST
+
+ parser = CDataParser(source)
+
+ item = parser.item
+
+ print(repr(item))
+
+ print(f"{item.decl_type} {item.decl_name}\n")
+
+ print("parameterlist:")
+ for p in item.parameterlist:
+ print(f" - {p}")
+
+ print("\nparametertypes:")
+ for k, v in item.parametertypes.items():
+ print(f" - {k}: {v}")
+
+
+if __name__ == "__main__":
+ main()
--
2.53.0
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH RFC 0/2] kernel-doc: better handle data prototypes
2026-03-20 9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
2026-03-20 9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
2026-03-20 9:46 ` [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser Mauro Carvalho Chehab
@ 2026-03-24 15:33 ` Mauro Carvalho Chehab
2 siblings, 0 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-24 15:33 UTC (permalink / raw)
To: Linux Doc Mailing List, Jonathan Corbet
Cc: Mauro Carvalho Chehab, linux-kernel
On Fri, 20 Mar 2026 10:46:39 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Hi Jon,
>
> Don't merge this series. It is just a heads on about what I'm
> working right now.
>
> This is basically a proof of concept, not yet integrated with
> kernel-doc. It helps to show that investing on a tokenizer
> was a good idea.
>
> I'm still testing the code.
Heh, getting it working is hard, but I ended with something that
should work with a somewhat complex scenario.
The new version is at my scratch repository at:
https://github.com/mchehab/linux PR_CDataParser-v2
I'm expecting that this parser should be able to handle:
- typedef (for data types);
- struct
- union
- enum
- var
So, after properly integrated(*), it should simplify a lot the
code inside kdoc_parser.
(*) right now, it is minimally integrated, handling just
struct/unions.
My current plan is to test it more with real-case scenarios,
aiming to submit it after 7.1-rc1, as it sounds to be that a
change like that is too late to be submitted like that.
IMO the newer code should be more reliable than the current
approach and should produce a better output once done.
--
Thanks,
Mauro
For this input:
<snip>
/**
* struct property_entry - property entry
*
* @name: name description
* @length: length description
* @is_inline: is_inline description
* @bar: bar description
* @my_enum: my_enum description
* @test: test description
* @anonymous: anon description
* @type: type description
* @literal: literal description
* @pointer: pointer description
* @value: value description
* @boou8_data: boou8_data description
* @u16_data: u16_data description
* @u32_data: u32_data description
* @u64_data: u64_data description
* @str: str description
* @prop_name: prop name description
*/
struct property_entry {
const char *name;
size_t length;
bool is_inline; /* TEST */
struct foo {
char *bar[12];
struct {
enum enum_type my_enum; /* TEST 2 */
struct {
uint_t test; /* TEST 3 */
static const int anonymous;
};
} foobar ;
;;
{};
};
enum dev_prop_type type;
enum {
EXPRESSION_LITERAL,
EXPRESSION_BINARY,
EXPRESSION_UNARY,
EXPRESSION_FUNCTION,
EXPRESSION_ARRAY
} literal;
union {
const void *pointer;
union {
u8 boou8_data[sizeof(u64) / sizeof(u8)];
u16 u16_data[sizeof(u64) / sizeof(u16)];
u32 u32_data[sizeof(u64) / sizeof(u32)];
u64 u64_data[sizeof(u64) / sizeof(u64)];
const char *str[sizeof(u64) / sizeof(char *)];
};
};
char *prop_name;
};
</snip>
Kernel-doc produces a proper result:
<snip>
Ignoring foobar
.. c:struct:: property_entry
property entry
.. container:: kernelindent
**Definition**::
struct property_entry {
const char *name;
size_t length;
bool is_inline;
struct foo {
char *bar[12];
struct {
enum enum_type my_enum;
struct {
uint_t test;
static const int anonymous;
};
} foobar;
{
};
};
enum dev_prop_type type;
enum {
EXPRESSION_LITERAL,
EXPRESSION_BINARY,
EXPRESSION_UNARY,
EXPRESSION_FUNCTION,
EXPRESSION_ARRAY } literal;
union {
const void *pointer;
union {
u8 boou8_data[sizeof(u64) / sizeof(u8)];
u16 u16_data[sizeof(u64) / sizeof(u16)];
u32 u32_data[sizeof(u64) / sizeof(u32)];
u64 u64_data[sizeof(u64) / sizeof(u64)];
const char *str[sizeof(u64) / sizeof(char *)];
};
};
char *prop_name;
}
};
**Members**
``{unnamed_struct}``
anonymous
``name``
name description
``length``
length description
``is_inline``
is_inline description
``bar``
bar description
``my_enum``
my_enum description
``{unnamed_struct}``
anonymous
``test``
test description
``anonymous``
anon description
``type``
type description
``literal``
literal description
``{unnamed_union}``
anonymous
``pointer``
pointer description
``boou8_data``
boou8_data description
``u16_data``
u16_data description
``u32_data``
u32_data description
``u64_data``
u64_data description
``str``
str description
``prop_name``
prop name description
</snip>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-03-24 15:33 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
2026-03-20 9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
2026-03-20 9:46 ` [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser Mauro Carvalho Chehab
2026-03-24 15:33 ` [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox