[PATCH RFC 0/2] kernel-doc: better handle data prototypes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/2] kernel-doc: better handle data prototypes
@ 2026-03-20  9:46 Mauro Carvalho Chehab
  2026-03-20  9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-20  9:46 UTC (permalink / raw)
  To: Linux Doc Mailing List, Jonathan Corbet
  Cc: Mauro Carvalho Chehab, Mauro Carvalho Chehab, linux-kernel

Hi Jon,

Don't merge this series. It is just a heads on about what I'm
working right now.

This is basically a proof of concept, not yet integrated with
kernel-doc. It helps to show that investing on a tokenizer
was a good idea.

I'm still testing the code.

Right now, kernel-doc logic to handle data types is very
complex, and the code is split into dump_<type> functions, which
in turn calls several ancillary routines. The most complex ones
are related to handling struct, with involves converting inner
struct/unions into members of the main struct.

By using this new code, all elements from most data types can
be parsed with a single code.

Please notice that the code was designed to pick a single
declaration, as this is how kdoc_parser will use it.
If you try to parse multiple ones, the output won't be right,
as it will pick the first declaration name and create a single
item with all data declarations on it.

As it is not based on regexes, it can properly handle some
problematic cases, like having:

    {};

and:
    ;;;;;

in the middle of a struct/union.

For enums, if one has values inside the declaration, like:

    enum { FOO, BAR } type;

It picks the right data type. Kernel-doc maps this currently as:
    enum type

My plan is to integrate it at Kernel-doc and see how it goes.
It will likely rise some corner cases, but, once we get it right,
this will likely reduce the size and complexity of kdoc_parser.

If you want to test, you can use:

    ./parse_c.py

to use an example hardcoded on it, or it reads from a fname with:

    $ ./parse_c.py x.h
    CDataItem(decl_type=None, decl_name=None, parameterlist=['u16_data'], parametertypes={'u16_data': 'u16 u16_data[sizeof(u64) / sizeof(u16)]'})
    None None

    parameterlist:
      - u16_data

    parametertypes:
      - u16_data: u16 u16_data[sizeof(u64) / sizeof(u16)]

   (on this example, x.h has just:
    u16 u16_data[sizeof(u64) / sizeof(u16)];
   )

The logic stores decl_type and decl_name when the data is
struct/union/enum. If the data is just a declaration, it fills
only one element at parameterlist and at parametertypes.

Mauro Carvalho Chehab (2):
  docs: kdoc: add a class to parse data items
  HACK: add a parse_c.py file to test CDataParser

 parse_c.py                           |  87 +++++++++++
 tools/lib/python/kdoc/data_parser.py | 211 +++++++++++++++++++++++++++
 2 files changed, 298 insertions(+)
 create mode 100755 parse_c.py
 create mode 100644 tools/lib/python/kdoc/data_parser.py

-- 
2.53.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC 1/2] docs: kdoc: add a class to parse data items
  2026-03-20  9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
@ 2026-03-20  9:46 ` Mauro Carvalho Chehab
  2026-03-20  9:46 ` [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser Mauro Carvalho Chehab
  2026-03-24 15:33 ` [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
  2 siblings, 0 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-20  9:46 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-kernel, Mauro Carvalho Chehab

Instead of using very complex regular expressions and hamming
inner structs/unions, use CTokenizer to handle data types.

It should be noticed that this doesn't handle "typedef".

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/data_parser.py | 211 +++++++++++++++++++++++++++
 1 file changed, 211 insertions(+)
 create mode 100644 tools/lib/python/kdoc/data_parser.py

diff --git a/tools/lib/python/kdoc/data_parser.py b/tools/lib/python/kdoc/data_parser.py
new file mode 100644
index 000000000000..f04915b67d6b
--- /dev/null
+++ b/tools/lib/python/kdoc/data_parser.py
@@ -0,0 +1,211 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2025: Mauro Carvalho Chehab <mchehab@kernel.org>.
+
+"""
+C lexical parser for variables.
+"""
+
+import logging
+import re
+
+from .c_lex import CTokenizer, CToken
+
+class CDataItem:
+    """
+    Represent a data declaration.
+    """
+    def __init__(self):
+        self.decl_name = None
+        self.decl_type = None
+        self.parameterlist = []
+        self.parametertypes = {}
+
+    def __repr__(self) -> str:
+        """
+        Return contents of the CDataItem.
+        Useful for debugging purposes.
+        """
+        return (f"CDataItem(decl_type={self.decl_type!r}, "
+                f"decl_name={self.decl_name!r}, "
+                f"parameterlist={self.parameterlist!r}, "
+                f"parametertypes={self.parametertypes!r})")
+
+class CDataParser:
+    """
+    Handles a C data prototype, converting it into a data element
+    describing it.
+    """
+
+    IGNORE_TOKENS = [CToken.SPACE, CToken.COMMENT]
+
+    def __init__(self, source):
+        self.source = source
+        self.item = CDataItem()
+
+        self._parse()
+
+    def _push_struct(self, tokens, stack, prev_kind, i):
+        """
+        Handles Structs and enums, picking the identifier just after
+        ``struct`` or ``union``.
+        """
+
+        if prev_kind:
+            j = prev_kind + 1
+            while j < len(tokens) and tokens[j].kind in self.IGNORE_TOKENS:
+                j += 1
+
+            if j < len(tokens) and tokens[j].kind == CToken.NAME:
+                stack.append(tokens[j].value)
+                return
+
+            name = "{unnamed " + tokens[prev_kind].value + "}"
+            stack.append(name)
+            self.item.parameterlist.append(name)
+            return
+
+        #
+        # Empty block. We still need to append for stack levels to match
+        #
+        stack.append(None)
+
+    def _parse(self):
+        """
+        Core algorithm  it is a lightweight rewrite of the
+        walk-the-tokens logic we sketched in the previous answer.
+        """
+        tokens = CTokenizer(self.source).tokens
+
+        stack= []
+        current_type = []
+        parameters = []
+        types = {}
+
+        prev_kind = None
+        get_id = False
+        level = 0
+
+        for i in range(0, len(tokens)):
+            tok = tokens[i]
+            if tok.kind == CToken.COMMENT:
+                continue
+
+            if tok.kind in [CToken.STRUCT, CToken.UNION, CToken.ENUM]:
+                prev_kind = i
+
+            if tok.kind == CToken.BEGIN:
+                if tok.value == "{":
+                    if (prev_kind and
+                        tokens[prev_kind].kind in [CToken.STRUCT, CToken.UNION]):
+
+                        self._push_struct(tokens, stack, prev_kind, i)
+                        if not self.item.decl_name:
+                            self.item.decl_name = stack[0]
+                    else:
+                        stack.append(None)
+
+                        #
+                        # Add previous tokens
+                        #
+                        if prev_kind:
+                            get_id = True
+
+                    if not self.item.decl_type:
+                        self.item.decl_type = tokens[prev_kind].value
+
+                    current_type = []
+
+                    continue
+
+                level += 1
+
+            if tok.kind == CToken.END:
+                if tok.value == "}":
+                    if stack:
+                        stack.pop()
+
+                    if get_id and prev_kind:
+                        current_type = []
+                        for j in range(prev_kind, i + 1):
+                            current_type.append((level, tokens[j]))
+                            if tok.kind == CToken.BEGIN:
+                                break
+
+                        while j < len(tokens):
+                            if tokens[j].kind not in self.IGNORE_TOKENS:
+                                break
+                            j += 1
+
+                        name = None
+
+                        if tokens[j].kind == CToken.NAME:
+                            name = tokens[j].value
+
+                        if not self.item.decl_type and len(stack) ==  1:
+                            self.item.decl_name = stack[0]
+
+                            self.item.parameterlist.append(name)
+                            current_type.append((level, tok))
+
+                    get_id = False
+                    prev_kind = None
+                    continue
+
+                level -= 1
+
+            if tok.kind != CToken.ENDSTMT:
+                current_type.append((level, tok))
+                continue
+
+            #
+            # End of an statement. Parse it if tokens are present
+            #
+
+            if not current_type:
+                current_type = []
+                continue
+
+            #
+            # the last NAME token with level 0 is the field name
+            #
+            name_token = None
+            for pos, t in enumerate(reversed(current_type)):
+                cur_level, cur_tok = t
+                if not cur_level and cur_tok.kind == CToken.NAME:
+                    name_token = cur_tok. value
+                    break
+
+            #
+            # TODO: we should likely emit a Warning here
+            #
+
+            if not name_token:
+                current_type = []
+                continue
+
+            #
+            # As we used reversed, we need to adjust pos here
+            #
+            pos = len(current_type) - pos - 1
+
+            #
+            # For the type, pick everything but the name
+            #
+
+            out = ""
+            for l, t in current_type:
+                out += t.value
+
+            names = []
+            for n in stack[1:] + [name_token]:
+                if n:
+                    if not "{unnamed" in n:
+                        names.append(n)
+
+            full_name = ".".join(names)
+
+            self.item.parameterlist.append(full_name)
+            self.item.parametertypes[full_name] = out.strip()
+
+            current_type = []
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser
  2026-03-20  9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
  2026-03-20  9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
@ 2026-03-20  9:46 ` Mauro Carvalho Chehab
  2026-03-24 15:33 ` [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
  2 siblings, 0 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-20  9:46 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-kernel, Mauro Carvalho Chehab

This patch should not be merged. It is a quick tool to test
CDataParser.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 parse_c.py | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100755 parse_c.py

diff --git a/parse_c.py b/parse_c.py
new file mode 100755
index 000000000000..740445998965
--- /dev/null
+++ b/parse_c.py
@@ -0,0 +1,87 @@
+#!/usr/bin/env python3
+# example.py
+"""
+Run a quick demo on a real C source file.
+
+Usage
+-----
+    python -m c_struct_parser.example <path/to/c/file.c>
+"""
+
+import argparse
+
+from tools.lib.python.kdoc.data_parser import CDataParser
+
+TEST = """
+struct property_entry {
+	const char *name;
+	size_t length;
+	bool is_inline;   /* TEST */
+	struct foo {
+		char *bar[12];
+		struct foobar {
+			enum enum_type my_enum; /* TEST	2 */
+			struct {
+				uint_t test; /* TEST 3 */
+				static const int anonymous;
+			};
+		};
+		;;        /* This is valid, but should not occur in practice */
+		{};       /* Same here */
+	};
+	enum dev_prop_type type;
+		enum {
+			EXPRESSION_LITERAL,
+			EXPRESSION_BINARY,
+			EXPRESSION_UNARY,
+			EXPRESSION_FUNCTION,
+			EXPRESSION_ARRAY
+		} literal;
+
+	union {
+		const void *pointer;
+		union {
+			u8 boou8_data[sizeof(u64) / sizeof(u8)];
+			u16 u16_data[sizeof(u64) / sizeof(u16)];
+			u32 u32_data[sizeof(u64) / sizeof(u32)];
+			u64 u64_data[sizeof(u64) / sizeof(u64)];
+			const char *str[sizeof(u64) / sizeof(char *)];
+		};
+	};
+	char *prop_name;
+};
+"""
+
+
+def main():
+    p = argparse.ArgumentParser(description="Parse a C struct/union/enum definition.")
+
+    p.add_argument("fname", nargs="?", help="C source file to parse")
+
+    args = p.parse_args()
+
+    if args.fname:
+        with open(args.fname, "r", encoding="utf-8") as f:
+            source = f.read()
+    else:
+        source = TEST
+
+    parser = CDataParser(source)
+
+    item = parser.item
+
+    print(repr(item))
+
+    print(f"{item.decl_type} {item.decl_name}\n")
+
+    print("parameterlist:")
+    for p in item.parameterlist:
+        print(f"  - {p}")
+
+    print("\nparametertypes:")
+    for k, v in item.parametertypes.items():
+        print(f"  - {k}: {v}")
+
+
+if __name__ == "__main__":
+    main()
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC 0/2] kernel-doc: better handle data prototypes
  2026-03-20  9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
  2026-03-20  9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
  2026-03-20  9:46 ` [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser Mauro Carvalho Chehab
@ 2026-03-24 15:33 ` Mauro Carvalho Chehab
  2 siblings, 0 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-24 15:33 UTC (permalink / raw)
  To: Linux Doc Mailing List, Jonathan Corbet
  Cc: Mauro Carvalho Chehab, linux-kernel

On Fri, 20 Mar 2026 10:46:39 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> Hi Jon,
> 
> Don't merge this series. It is just a heads on about what I'm
> working right now.
> 
> This is basically a proof of concept, not yet integrated with
> kernel-doc. It helps to show that investing on a tokenizer
> was a good idea.
> 
> I'm still testing the code.

Heh, getting it working is hard, but I ended with something that
should work with a somewhat complex scenario.

The new version is at my scratch repository at:

	https://github.com/mchehab/linux PR_CDataParser-v2

I'm expecting that this parser should be able to handle:

	- typedef (for data types);
	- struct
	- union
	- enum
	- var

So, after properly integrated(*), it should simplify a lot the
code inside kdoc_parser.

(*) right now, it is minimally integrated, handling just
    struct/unions.
	
My current plan is to test it more with real-case scenarios,
aiming to submit it after 7.1-rc1, as it sounds to be that a
change like that is too late to be submitted like that.

IMO the newer code should be more reliable than the current
approach and should produce a better output once done.


-- 
Thanks,
Mauro

For this input:

<snip>
/**
 * struct property_entry - property entry
 *
 * @name: name description
 * @length: length description
 * @is_inline: is_inline description
 * @bar: bar description
 * @my_enum: my_enum description
 * @test: test description
 * @anonymous: anon description
 * @type: type description
 * @literal: literal description
 * @pointer: pointer description
 * @value: value description
 * @boou8_data: boou8_data description
 * @u16_data: u16_data description
 * @u32_data: u32_data description
 * @u64_data: u64_data description
 * @str: str description
 * @prop_name: prop name description
 */

struct property_entry {
	const char *name;
	size_t length;
	bool is_inline;   /* TEST */
	struct foo {
		char *bar[12];
		struct {
			enum enum_type my_enum; /* TEST	2 */
			struct {
				uint_t test; /* TEST 3 */
				static const int anonymous;
			};
		} foobar ;
		;;
		{};
	};
	enum dev_prop_type type;

	enum {
		EXPRESSION_LITERAL,
		EXPRESSION_BINARY,
		EXPRESSION_UNARY,
		EXPRESSION_FUNCTION,
		EXPRESSION_ARRAY
	} literal;

	union {
		const void *pointer;
		union {
			u8 boou8_data[sizeof(u64) / sizeof(u8)];
			u16 u16_data[sizeof(u64) / sizeof(u16)];
			u32 u32_data[sizeof(u64) / sizeof(u32)];
			u64 u64_data[sizeof(u64) / sizeof(u64)];
			const char *str[sizeof(u64) / sizeof(char *)];
		};
	};
	char *prop_name;
};
</snip>

Kernel-doc produces a proper result:

<snip>
Ignoring foobar


.. c:struct:: property_entry

  property entry

.. container:: kernelindent

  **Definition**::

    struct property_entry {
        const char *name;
        size_t length;
        bool is_inline;
        struct foo {
            char *bar[12];
            struct {
                enum enum_type my_enum;
                struct {
                    uint_t test;
                    static const int anonymous;
                };
            } foobar;
            {
            };
        };
        enum dev_prop_type type;
        enum {
            EXPRESSION_LITERAL,
            EXPRESSION_BINARY,
            EXPRESSION_UNARY,
            EXPRESSION_FUNCTION,
        EXPRESSION_ARRAY } literal;
        union {
            const void *pointer;
            union {
                u8 boou8_data[sizeof(u64) / sizeof(u8)];
                u16 u16_data[sizeof(u64) / sizeof(u16)];
                u32 u32_data[sizeof(u64) / sizeof(u32)];
                u64 u64_data[sizeof(u64) / sizeof(u64)];
                const char *str[sizeof(u64) / sizeof(char *)];
            };
        };
        char *prop_name;
        }
    };

  **Members**

  ``{unnamed_struct}``
    anonymous

  ``name``
    name description

  ``length``
    length description

  ``is_inline``
    is_inline description

  ``bar``
    bar description

  ``my_enum``
    my_enum description

  ``{unnamed_struct}``
    anonymous

  ``test``
    test description

  ``anonymous``
    anon description

  ``type``
    type description

  ``literal``
    literal description

  ``{unnamed_union}``
    anonymous

  ``pointer``
    pointer description

  ``boou8_data``
    boou8_data description

  ``u16_data``
    u16_data description

  ``u32_data``
    u32_data description

  ``u64_data``
    u64_data description

  ``str``
    str description

  ``prop_name``
    prop name description
</snip>




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-24 15:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20  9:46 [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab
2026-03-20  9:46 ` [PATCH RFC 1/2] docs: kdoc: add a class to parse data items Mauro Carvalho Chehab
2026-03-20  9:46 ` [PATCH RFC 2/2] HACK: add a parse_c.py file to test CDataParser Mauro Carvalho Chehab
2026-03-24 15:33 ` [PATCH RFC 0/2] kernel-doc: better handle data prototypes Mauro Carvalho Chehab

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.