[PATCH RFC 0/2] kernel-doc: better handle data prototypes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] kernel-doc: better handle data prototypes
@ 2026-03-20  9:46 Mauro Carvalho Chehab
  2026-03-20  9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-20  9:46 UTC (permalink / raw)
  To: Linux Doc Mailing List, Jonathan Corbet
  Cc: Mauro Carvalho Chehab, Mauro Carvalho Chehab, linux-kernel

Hi Jon,

Don't merge this series. It is just a heads on about what I'm
working right now.

This is basically a proof of concept, not yet integrated with
kernel-doc. It helps to show that investing on a tokenizer
was a good idea.

I'm still testing the code.

Right now, kernel-doc logic to handle data types is very
complex, and the code is split into dump_<type> functions, which
in turn calls several ancillary routines. The most complex ones
are related to handling struct, with involves converting inner
struct/unions into members of the main struct.

By using this new code, all elements from most data types can
be parsed with a single code.

Please notice that the code was designed to pick a single
declaration, as this is how kdoc_parser will use it.
If you try to parse multiple ones, the output won't be right,
as it will pick the first declaration name and create a single
item with all data declarations on it.

As it is not based on regexes, it can properly handle some
problematic cases, like having:

    {};

and:
    ;;;;;

in the middle of a struct/union.

For enums, if one has values inside the declaration, like:

    enum { FOO, BAR } type;

It picks the right data type. Kernel-doc maps this currently as:
    enum type

My plan is to integrate it at Kernel-doc and see how it goes.
It will likely rise some corner cases, but, once we get it right,
this will likely reduce the size and complexity of kdoc_parser.

If you want to test, you can use:

    ./parse_c.py

to use an example hardcoded on it, or it reads from a fname with:

    $ ./parse_c.py x.h
    CDataItem(decl_type=None, decl_name=None, parameterlist=['u16_data'], parametertypes={'u16_data': 'u16 u16_data[sizeof(u64) / sizeof(u16)]'})
    None None

    parameterlist:
      - u16_data

    parametertypes:
      - u16_data: u16 u16_data[sizeof(u64) / sizeof(u16)]

   (on this example, x.h has just:
    u16 u16_data[sizeof(u64) / sizeof(u16)];
   )

The logic stores decl_type and decl_name when the data is
struct/union/enum. If the data is just a declaration, it fills
only one element at parameterlist and at parametertypes.

Mauro Carvalho Chehab (2):
  docs: kdoc: add a class to parse data items
  HACK: add a parse_c.py file to test CDataParser

 parse_c.py                           |  87 +++++++++++
 tools/lib/python/kdoc/data_parser.py | 211 +++++++++++++++++++++++++++
 2 files changed, 298 insertions(+)
 create mode 100755 parse_c.py
 create mode 100644 tools/lib/python/kdoc/data_parser.py

-- 
2.53.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC 1/2] docs: kdoc: add a class to parse data items
  2026-03-20  9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
@ 2026-03-20  9:46 ` Mauro Carvalho Chehab
  2026-03-20  9:46 ` [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser Mauro Carvalho Chehab
  2026-03-24 15:33 ` [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
  2 siblings, 0 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-20  9:46 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-kernel, Mauro Carvalho Chehab

Instead of using very complex regular expressions and hamming
inner structs/unions, use CTokenizer to handle data types.

It should be noticed that this doesn't handle "typedef".

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/data_parser.py | 211 +++++++++++++++++++++++++++
 1 file changed, 211 insertions(+)
 create mode 100644 tools/lib/python/kdoc/data_parser.py

diff --git a/tools/lib/python/kdoc/data_parser.py b/tools/lib/python/kdoc/data_parser.py
new file mode 100644
index 000000000000..f04915b67d6b
--- /dev/null
+++ b/tools/lib/python/kdoc/data_parser.py
@@ -0,0 +1,211 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2025: Mauro Carvalho Chehab <mchehab@kernel.org>.
+
+"""
+C lexical parser for variables.
+"""
+
+import logging
+import re
+
+from .c_lex import CTokenizer, CToken
+
+class CDataItem:
+    """
+    Represent a data declaration.
+    """
+    def __init__(self):
+        self.decl_name = None
+        self.decl_type = None
+        self.parameterlist = []
+        self.parametertypes = {}
+
+    def __repr__(self) -> str:
+        """
+        Return contents of the CDataItem.
+        Useful for debugging purposes.
+        """
+        return (f"CDataItem(decl_type={self.decl_type!r}, "
+                f"decl_name={self.decl_name!r}, "
+                f"parameterlist={self.parameterlist!r}, "
+                f"parametertypes={self.parametertypes!r})")
+
+class CDataParser:
+    """
+    Handles a C data prototype, converting it into a data element
+    describing it.
+    """
+
+    IGNORE_TOKENS = [CToken.SPACE, CToken.COMMENT]
+
+    def __init__(self, source):
+        self.source = source
+        self.item = CDataItem()
+
+        self._parse()
+
+    def _push_struct(self, tokens, stack, prev_kind, i):
+        """
+        Handles Structs and enums, picking the identifier just after
+        ``struct`` or ``union``.
+        """
+
+        if prev_kind:
+            j = prev_kind + 1
+            while j < len(tokens) and tokens[j].kind in self.IGNORE_TOKENS:
+                j += 1
+
+            if j < len(tokens) and tokens[j].kind == CToken.NAME:
+                stack.append(tokens[j].value)
+                return
+
+            name = "{unnamed " + tokens[prev_kind].value + "}"
+            stack.append(name)
+            self.item.parameterlist.append(name)
+            return
+
+        #
+        # Empty block. We still need to append for stack levels to match
+        #
+        stack.append(None)
+
+    def _parse(self):
+        """
+        Core algorithm  it is a lightweight rewrite of the
+        walk-the-tokens logic we sketched in the previous answer.
+        """
+        tokens = CTokenizer(self.source).tokens
+
+        stack= []
+        current_type = []
+        parameters = []
+        types = {}
+
+        prev_kind = None
+        get_id = False
+        level = 0
+
+        for i in range(0, len(tokens)):
+            tok = tokens[i]
+            if tok.kind == CToken.COMMENT:
+                continue
+
+            if tok.kind in [CToken.STRUCT, CToken.UNION, CToken.ENUM]:
+                prev_kind = i
+
+            if tok.kind == CToken.BEGIN:
+                if tok.value == "{":
+                    if (prev_kind and
+                        tokens[prev_kind].kind in [CToken.STRUCT, CToken.UNION]):
+
+                        self._push_struct(tokens, stack, prev_kind, i)
+                        if not self.item.decl_name:
+                            self.item.decl_name = stack[0]
+                    else:
+                        stack.append(None)
+
+                        #
+                        # Add previous tokens
+                        #
+                        if prev_kind:
+                            get_id = True
+
+                    if not self.item.decl_type:
+                        self.item.decl_type = tokens[prev_kind].value
+
+                    current_type = []
+
+                    continue
+
+                level += 1
+
+            if tok.kind == CToken.END:
+                if tok.value == "}":
+                    if stack:
+                        stack.pop()
+
+                    if get_id and prev_kind:
+                        current_type = []
+                        for j in range(prev_kind, i + 1):
+                            current_type.append((level, tokens[j]))
+                            if tok.kind == CToken.BEGIN:
+                                break
+
+                        while j < len(tokens):
+                            if tokens[j].kind not in self.IGNORE_TOKENS:
+                                break
+                            j += 1
+
+                        name = None
+
+                        if tokens[j].kind == CToken.NAME:
+                            name = tokens[j].value
+
+                        if not self.item.decl_type and len(stack) ==  1:
+                            self.item.decl_name = stack[0]
+
+                            self.item.parameterlist.append(name)
+                            current_type.append((level, tok))
+
+                    get_id = False
+                    prev_kind = None
+                    continue
+
+                level -= 1
+
+            if tok.kind != CToken.ENDSTMT:
+                current_type.append((level, tok))
+                continue
+
+            #
+            # End of an statement. Parse it if tokens are present
+            #
+
+            if not current_type:
+                current_type = []
+                continue
+
+            #
+            # the last NAME token with level 0 is the field name
+            #
+            name_token = None
+            for pos, t in enumerate(reversed(current_type)):
+                cur_level, cur_tok = t
+                if not cur_level and cur_tok.kind == CToken.NAME:
+                    name_token = cur_tok. value
+                    break
+
+            #
+            # TODO: we should likely emit a Warning here
+            #
+
+            if not name_token:
+                current_type = []
+                continue
+
+            #
+            # As we used reversed, we need to adjust pos here
+            #
+            pos = len(current_type) - pos - 1
+
+            #
+            # For the type, pick everything but the name
+            #
+
+            out = ""
+            for l, t in current_type:
+                out += t.value
+
+            names = []
+            for n in stack[1:] + [name_token]:
+                if n:
+                    if not "{unnamed" in n:
+                        names.append(n)
+
+            full_name = ".".join(names)
+
+            self.item.parameterlist.append(full_name)
+            self.item.parametertypes[full_name] = out.strip()
+
+            current_type = []
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser
  2026-03-20  9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
  2026-03-20  9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
@ 2026-03-20  9:46 ` Mauro Carvalho Chehab
  2026-03-24 15:33 ` [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
  2 siblings, 0 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-20  9:46 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-kernel, Mauro Carvalho Chehab

This patch should not be merged. It is a quick tool to test
CDataParser.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 parse_c.py | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100755 parse_c.py

diff --git a/parse_c.py b/parse_c.py
new file mode 100755
index 000000000000..740445998965
--- /dev/null
+++ b/parse_c.py
@@ -0,0 +1,87 @@
+#!/usr/bin/env python3
+# example.py
+"""
+Run a quick demo on a real C source file.
+
+Usage
+-----
+    python -m c_struct_parser.example <path/to/c/file.c>
+"""
+
+import argparse
+
+from tools.lib.python.kdoc.data_parser import CDataParser
+
+TEST = """
+struct property_entry {
+	const char *name;
+	size_t length;
+	bool is_inline;   /* TEST */
+	struct foo {
+		char *bar[12];
+		struct foobar {
+			enum enum_type my_enum; /* TEST	2 */
+			struct {
+				uint_t test; /* TEST 3 */
+				static const int anonymous;
+			};
+		};
+		;;        /* This is valid, but should not occur in practice */
+		{};       /* Same here */
+	};
+	enum dev_prop_type type;
+		enum {
+			EXPRESSION_LITERAL,
+			EXPRESSION_BINARY,
+			EXPRESSION_UNARY,
+			EXPRESSION_FUNCTION,
+			EXPRESSION_ARRAY
+		} literal;
+
+	union {
+		const void *pointer;
+		union {
+			u8 boou8_data[sizeof(u64) / sizeof(u8)];
+			u16 u16_data[sizeof(u64) / sizeof(u16)];
+			u32 u32_data[sizeof(u64) / sizeof(u32)];
+			u64 u64_data[sizeof(u64) / sizeof(u64)];
+			const char *str[sizeof(u64) / sizeof(char *)];
+		};
+	};
+	char *prop_name;
+};
+"""
+
+
+def main():
+    p = argparse.ArgumentParser(description="Parse a C struct/union/enum definition.")
+
+    p.add_argument("fname", nargs="?", help="C source file to parse")
+
+    args = p.parse_args()
+
+    if args.fname:
+        with open(args.fname, "r", encoding="utf-8") as f:
+            source = f.read()
+    else:
+        source = TEST
+
+    parser = CDataParser(source)
+
+    item = parser.item
+
+    print(repr(item))
+
+    print(f"{item.decl_type} {item.decl_name}\n")
+
+    print("parameterlist:")
+    for p in item.parameterlist:
+        print(f"  - {p}")
+
+    print("\nparametertypes:")
+    for k, v in item.parametertypes.items():
+        print(f"  - {k}: {v}")
+
+
+if __name__ == "__main__":
+    main()
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC 0/2] kernel-doc: better handle data prototypes
  2026-03-20  9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
  2026-03-20  9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
  2026-03-20  9:46 ` [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser Mauro Carvalho Chehab
@ 2026-03-24 15:33 ` Mauro Carvalho Chehab
  2 siblings, 0 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-24 15:33 UTC (permalink / raw)
  To: Linux Doc Mailing List, Jonathan Corbet
  Cc: Mauro Carvalho Chehab, linux-kernel

On Fri, 20 Mar 2026 10:46:39 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Hi Jon,
> 
> Don't merge this series. It is just a heads on about what I'm
> working right now.
> 
> This is basically a proof of concept, not yet integrated with
> kernel-doc. It helps to show that investing on a tokenizer
> was a good idea.
> 
> I'm still testing the code.

Heh, getting it working is hard, but I ended with something that
should work with a somewhat complex scenario.

The new version is at my scratch repository at:

	https://github.com/mchehab/linux PR_CDataParser-v2

I'm expecting that this parser should be able to handle:

	- typedef (for data types);
	- struct
	- union
	- enum
	- var

So, after properly integrated(*), it should simplify a lot the
code inside kdoc_parser.

(*) right now, it is minimally integrated, handling just
    struct/unions.
	
My current plan is to test it more with real-case scenarios,
aiming to submit it after 7.1-rc1, as it sounds to be that a
change like that is too late to be submitted like that.

IMO the newer code should be more reliable than the current
approach and should produce a better output once done.


-- 
Thanks,
Mauro

For this input:

<snip>
/**
 * struct property_entry - property entry
 *
 * @name: name description
 * @length: length description
 * @is_inline: is_inline description
 * @bar: bar description
 * @my_enum: my_enum description
 * @test: test description
 * @anonymous: anon description
 * @type: type description
 * @literal: literal description
 * @pointer: pointer description
 * @value: value description
 * @boou8_data: boou8_data description
 * @u16_data: u16_data description
 * @u32_data: u32_data description
 * @u64_data: u64_data description
 * @str: str description
 * @prop_name: prop name description
 */

struct property_entry {
	const char *name;
	size_t length;
	bool is_inline;   /* TEST */
	struct foo {
		char *bar[12];
		struct {
			enum enum_type my_enum; /* TEST	2 */
			struct {
				uint_t test; /* TEST 3 */
				static const int anonymous;
			};
		} foobar ;
		;;
		{};
	};
	enum dev_prop_type type;

	enum {
		EXPRESSION_LITERAL,
		EXPRESSION_BINARY,
		EXPRESSION_UNARY,
		EXPRESSION_FUNCTION,
		EXPRESSION_ARRAY
	} literal;

	union {
		const void *pointer;
		union {
			u8 boou8_data[sizeof(u64) / sizeof(u8)];
			u16 u16_data[sizeof(u64) / sizeof(u16)];
			u32 u32_data[sizeof(u64) / sizeof(u32)];
			u64 u64_data[sizeof(u64) / sizeof(u64)];
			const char *str[sizeof(u64) / sizeof(char *)];
		};
	};
	char *prop_name;
};
</snip>

Kernel-doc produces a proper result:

<snip>
Ignoring foobar


.. c:struct:: property_entry

  property entry

.. container:: kernelindent

  **Definition**::

    struct property_entry {
        const char *name;
        size_t length;
        bool is_inline;
        struct foo {
            char *bar[12];
            struct {
                enum enum_type my_enum;
                struct {
                    uint_t test;
                    static const int anonymous;
                };
            } foobar;
            {
            };
        };
        enum dev_prop_type type;
        enum {
            EXPRESSION_LITERAL,
            EXPRESSION_BINARY,
            EXPRESSION_UNARY,
            EXPRESSION_FUNCTION,
        EXPRESSION_ARRAY } literal;
        union {
            const void *pointer;
            union {
                u8 boou8_data[sizeof(u64) / sizeof(u8)];
                u16 u16_data[sizeof(u64) / sizeof(u16)];
                u32 u32_data[sizeof(u64) / sizeof(u32)];
                u64 u64_data[sizeof(u64) / sizeof(u64)];
                const char *str[sizeof(u64) / sizeof(char *)];
            };
        };
        char *prop_name;
        }
    };

  **Members**

  ``{unnamed_struct}``
    anonymous

  ``name``
    name description

  ``length``
    length description

  ``is_inline``
    is_inline description

  ``bar``
    bar description

  ``my_enum``
    my_enum description

  ``{unnamed_struct}``
    anonymous

  ``test``
    test description

  ``anonymous``
    anon description

  ``type``
    type description

  ``literal``
    literal description

  ``{unnamed_union}``
    anonymous

  ``pointer``
    pointer description

  ``boou8_data``
    boou8_data description

  ``u16_data``
    u16_data description

  ``u32_data``
    u32_data description

  ``u64_data``
    u64_data description

  ``str``
    str description

  ``prop_name``
    prop name description
</snip>




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-24 15:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20  9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
2026-03-20  9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
2026-03-20  9:46 ` [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser Mauro Carvalho Chehab
2026-03-24 15:33 ` [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox