[PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms
@ 2026-03-17 18:09 Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 01/22] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
                   ` (22 more replies)
  0 siblings, 23 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Kees Cook, Mauro Carvalho Chehab
  Cc: Mauro Carvalho Chehab, linux-doc, linux-hardening, linux-kernel,
	Gustavo A. R. Silva, Aleksandr Loktionov, Randy Dunlap,
	Shuah Khan, Vincent Mailhol

Hi Jon,

It follows v3 of the series. I basically addressed there the
points you mentioned. Besides that, I did a fix at CMatch
group(0) logic, and opted to create a special token for ";",
as it simplifies the code a little bit and will likely help to
simplify future changes.

-

This patch series change how kdoc parser handles macro replacements.

Instead of heavily relying on regular expressions that can sometimes
be very complex, it uses a C lexical tokenizer. This ensures that
BEGIN/END blocks on functions and structs are properly handled,
even when nested.

Checking before/after the patch series, for both man pages and
rst only had:
    - whitespace differences;
    - struct_group macros now are shown as inner anonimous structs
      as it should be.

Also, I didn't notice any relevant change on the documentation build
time. With that regards, right now, every time a CMatch replacement
rule takes in place, it does:

    for each transform:
    - tokenizes the source code;
    - handle CMatch;
    - convert tokens back to a string.

A possible optimization would be to do, instead:

    - tokenizes source code;
    - for each transform handle CMatch;
    - convert tokens back to a string.

For now, I opted not do do it, because:

    - too much changes on a single row;
    - docs build time is taking ~3:30 minutes, which is
      about the same time it ws taken before the changes;
    - there is a very dirty hack inside function_xforms:
         (KernRe(r"_noprof"), ""). This is meant to change
      function prototypes instead of function arguments.

So, if ok for you, I would prefer to merge this one first. We can later
optimize kdoc_parser to avoid multiple token <-> string conversions.

-

One important aspect of this series is that it introduces unittests
for kernel-doc. I used it a lot during the development of this series,
to ensure that the changes I was doing were producing the expected
results. Tests are on two separate files that can be executed directly.

Alternatively, there is a run.py script that runs all of them (and
any other python script named  tools/unittests/test_*.py"):

  $ tools/unittests/run.py
  test_cmatch:
      TestSearch:
          test_search_acquires_multiple:                               OK
          test_search_acquires_nested_paren:                           OK
          test_search_acquires_simple:                                 OK
          test_search_must_hold:                                       OK
          test_search_must_hold_shared:                                OK
          test_search_no_false_positive:                               OK
          test_search_no_function:                                     OK
          test_search_no_macro_remains:                                OK
      TestSubMultipleMacros:
          test_acquires_multiple:                                      OK
          test_acquires_nested_paren:                                  OK
          test_acquires_simple:                                        OK
          test_mixed_macros:                                           OK
          test_must_hold:                                              OK
          test_must_hold_shared:                                       OK
          test_no_false_positive:                                      OK
          test_no_function:                                            OK
          test_no_macro_remains:                                       OK
      TestSubSimple:
          test_rise_early_greedy:                                      OK
          test_rise_multiple_greedy:                                   OK
          test_strip_multiple_acquires:                                OK
          test_sub_count_parameter:                                    OK
          test_sub_mixed_placeholders:                                 OK
          test_sub_multiple_placeholders:                              OK
          test_sub_no_placeholder:                                     OK
          test_sub_single_placeholder:                                 OK
          test_sub_with_capture:                                       OK
          test_sub_zero_placeholder:                                   OK
      TestSubWithLocalXforms:
          test_functions_with_acquires_and_releases:                   OK
          test_raw_struct_group:                                       OK
          test_raw_struct_group_tagged:                                OK
          test_struct_group:                                           OK
          test_struct_group_attr:                                      OK
          test_struct_group_tagged_with_private:                       OK
          test_struct_kcov:                                            OK
          test_vars_stackdepot:                                        OK
  
  test_tokenizer:
      TestPublicPrivate:
          test_balanced_inner_private:                                 OK
          test_balanced_non_greddy_private:                            OK
          test_balanced_private:                                       OK
          test_no private:                                             OK
          test_unbalanced_inner_private:                               OK
          test_unbalanced_private:                                     OK
          test_unbalanced_struct_group_tagged_with_private:            OK
          test_unbalanced_two_struct_group_tagged_first_with_private:  OK
          test_unbalanced_without_end_of_line:                         OK
      TestTokenizer:
          test_basic_tokens:                                           OK
          test_depth_counters:                                         OK
          test_mismatch_error:                                         OK
  
  
  Ran 47 tests

---

v3:
  - Avoided code addition/removal by applying the changes directly
    at the new kdoc/c_lex.py file;
  - ";" has now its own token (ENDSTMT). That simplifies the code
    a little bit and will help further improvements;
  - renamed TOKEN_LIST to RE_SCANNER_LIST;
  - simplified regular expressions where possible;
  - added some comments for some weird stuff like \s\S regex;
  - CTokenizer __init__() method moved to the beginning of the class;
  - fixed a logic parsing CToken.BEGIN when picking group(0);
  - fixed two typos.

v2:
  - Added 8 more patches fixing several bugs and modifying unittests
    accordingly:
    - don't raise exceptions when not needed;
    - don't report errors reporting lack of END if there's no BEGIN
      at the last replacement string;
    - document private scope propagation;
    - some changes at unittests to reflect current status;
    - addition of two unittests to check error raise logic at c_lex.


Mauro Carvalho Chehab (22):
  docs: python: add helpers to run unit tests
  unittests: add a testbench to check public/private kdoc comments
  docs: kdoc: don't add broken comments inside prototypes
  docs: kdoc: properly handle empty enum arguments
  docs: add a C tokenizer to be used by kernel-doc
  docs: kdoc: use tokenizer to handle comments on structs
  unittests: test_private: modify it to use CTokenizer directly
  unittests: test_tokenizer: check if the tokenizer works
  unittests: add a runner to execute all unittests
  docs: kdoc: create a CMatch to match nested C blocks
  tools: unittests: add tests for CMatch
  docs: c_lex: properly implement a sub() method for CMatch
  unittests: test_cmatch: add tests for sub()
  docs: kdoc: replace NestedMatch with CMatch
  docs: kdoc_re: get rid of NestedMatch class
  docs: xforms_lists: handle struct_group directly
  docs: xforms_lists: better evaluate struct_group macros
  docs: c_lex: setup a logger to report tokenizer issues
  docs: kernel-doc.rst: document private: scope propagation
  docs: kdoc: ensure that comments are dropped before calling
    split_struct_proto()
  docs: kdoc_parser: avoid tokenizing structs everytime
  docs: xforms_lists: use CMatch for all identifiers

 Documentation/doc-guide/kernel-doc.rst |   6 +
 Documentation/tools/python.rst         |   2 +
 Documentation/tools/unittest.rst       |  24 +
 tools/lib/python/kdoc/c_lex.py         | 655 ++++++++++++++++++++
 tools/lib/python/kdoc/kdoc_parser.py   |  35 +-
 tools/lib/python/kdoc/kdoc_re.py       | 201 ------
 tools/lib/python/kdoc/xforms_lists.py  | 237 ++++---
 tools/lib/python/unittest_helper.py    | 353 +++++++++++
 tools/unittests/run.py                 |  17 +
 tools/unittests/test_cmatch.py         | 821 +++++++++++++++++++++++++
 tools/unittests/test_tokenizer.py      | 462 ++++++++++++++
 11 files changed, 2470 insertions(+), 343 deletions(-)
 create mode 100644 Documentation/tools/unittest.rst
 create mode 100644 tools/lib/python/kdoc/c_lex.py
 create mode 100755 tools/lib/python/unittest_helper.py
 create mode 100755 tools/unittests/run.py
 create mode 100755 tools/unittests/test_cmatch.py
 create mode 100755 tools/unittests/test_tokenizer.py

-- 
2.52.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 01/22] docs: python: add helpers to run unit tests
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 02/22] unittests: add a testbench to check public/private kdoc comments Mauro Carvalho Chehab
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Mauro Carvalho Chehab, Shuah Khan

While python internal libraries have support for unit tests, its
output is not nice. Add a helper module to improve its output.

I wrote this module last year while testing some scripts I used
internally. The initial skeleton was generated with the help of
LLM tools, but it was higly modified to ensure that it will work
as I would expect.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <37999041f616ddef41e84cf2686c0264d1a51dc9.1773074166.git.mchehab+huawei@kernel.org>
---
 Documentation/tools/python.rst      |   2 +
 Documentation/tools/unittest.rst    |  24 ++
 tools/lib/python/unittest_helper.py | 353 ++++++++++++++++++++++++++++
 3 files changed, 379 insertions(+)
 create mode 100644 Documentation/tools/unittest.rst
 create mode 100755 tools/lib/python/unittest_helper.py

diff --git a/Documentation/tools/python.rst b/Documentation/tools/python.rst
index 1444c1816735..3b7299161f20 100644
--- a/Documentation/tools/python.rst
+++ b/Documentation/tools/python.rst
@@ -11,3 +11,5 @@ Python libraries
    feat
    kdoc
    kabi
+
+   unittest
diff --git a/Documentation/tools/unittest.rst b/Documentation/tools/unittest.rst
new file mode 100644
index 000000000000..14a2b2a65236
--- /dev/null
+++ b/Documentation/tools/unittest.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Python unittest
+===============
+
+Checking consistency of python modules can be complex. Sometimes, it is
+useful to define a set of unit tests to help checking them.
+
+While the actual test implementation is usecase dependent, Python already
+provides a standard way to add unit tests by using ``import unittest``.
+
+Using such class, requires setting up a test suite. Also, the default format
+is a little bit ackward. To improve it and provide a more uniform way to
+report errors, some unittest classes and functions are defined.
+
+
+Unittest helper module
+======================
+
+.. automodule:: lib.python.unittest_helper
+   :members:
+   :show-inheritance:
+   :undoc-members:
diff --git a/tools/lib/python/unittest_helper.py b/tools/lib/python/unittest_helper.py
new file mode 100755
index 000000000000..55d444cd73d4
--- /dev/null
+++ b/tools/lib/python/unittest_helper.py
@@ -0,0 +1,353 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2025-2026: Mauro Carvalho Chehab <mchehab@kernel.org>.
+#
+# pylint: disable=C0103,R0912,R0914,E1101
+
+"""
+Provides helper functions and classes execute python unit tests.
+
+Those help functions provide a nice colored output summary of each
+executed test and, when a test fails, it shows the different in diff
+format when running in verbose mode, like::
+
+    $ tools/unittests/nested_match.py -v
+    ...
+    Traceback (most recent call last):
+    File "/new_devel/docs/tools/unittests/nested_match.py", line 69, in test_count_limit
+        self.assertEqual(replaced, "bar(a); bar(b); foo(c)")
+        ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    AssertionError: 'bar(a) foo(b); foo(c)' != 'bar(a); bar(b); foo(c)'
+    - bar(a) foo(b); foo(c)
+    ?       ^^^^
+    + bar(a); bar(b); foo(c)
+    ?       ^^^^^
+    ...
+
+It also allows filtering what tests will be executed via ``-k`` parameter.
+
+Typical usage is to do::
+
+    from unittest_helper import run_unittest
+    ...
+
+    if __name__ == "__main__":
+        run_unittest(__file__)
+
+If passing arguments is needed, on a more complex scenario, it can be
+used like on this example::
+
+    from unittest_helper import TestUnits, run_unittest
+    ...
+    env = {'sudo': ""}
+    ...
+    if __name__ == "__main__":
+        runner = TestUnits()
+        base_parser = runner.parse_args()
+        base_parser.add_argument('--sudo', action='store_true',
+                                help='Enable tests requiring sudo privileges')
+
+        args = base_parser.parse_args()
+
+        # Update module-level flag
+        if args.sudo:
+            env['sudo'] = "1"
+
+        # Run tests with customized arguments
+        runner.run(__file__, parser=base_parser, args=args, env=env)
+"""
+
+import argparse
+import atexit
+import os
+import re
+import unittest
+import sys
+
+from unittest.mock import patch
+
+
+class Summary(unittest.TestResult):
+    """
+    Overrides ``unittest.TestResult`` class to provide a nice colored
+    summary. When in verbose mode, displays actual/expected difference in
+    unified diff format.
+    """
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+        #: Dictionary to store organized test results.
+        self.test_results = {}
+
+        #: max length of the test names.
+        self.max_name_length = 0
+
+    def startTest(self, test):
+        super().startTest(test)
+        test_id = test.id()
+        parts = test_id.split(".")
+
+        # Extract module, class, and method names
+        if len(parts) >= 3:
+            module_name = parts[-3]
+        else:
+            module_name = ""
+        if len(parts) >= 2:
+            class_name = parts[-2]
+        else:
+            class_name = ""
+
+        method_name = parts[-1]
+
+        # Build the hierarchical structure
+        if module_name not in self.test_results:
+            self.test_results[module_name] = {}
+
+        if class_name not in self.test_results[module_name]:
+            self.test_results[module_name][class_name] = []
+
+        # Track maximum test name length for alignment
+        display_name = f"{method_name}:"
+
+        self.max_name_length = max(len(display_name), self.max_name_length)
+
+    def _record_test(self, test, status):
+        test_id = test.id()
+        parts = test_id.split(".")
+        if len(parts) >= 3:
+            module_name = parts[-3]
+        else:
+            module_name = ""
+        if len(parts) >= 2:
+            class_name = parts[-2]
+        else:
+            class_name = ""
+        method_name = parts[-1]
+        self.test_results[module_name][class_name].append((method_name, status))
+
+    def addSuccess(self, test):
+        super().addSuccess(test)
+        self._record_test(test, "OK")
+
+    def addFailure(self, test, err):
+        super().addFailure(test, err)
+        self._record_test(test, "FAIL")
+
+    def addError(self, test, err):
+        super().addError(test, err)
+        self._record_test(test, "ERROR")
+
+    def addSkip(self, test, reason):
+        super().addSkip(test, reason)
+        self._record_test(test, f"SKIP ({reason})")
+
+    def printResults(self):
+        """
+        Print results using colors if tty.
+        """
+        # Check for ANSI color support
+        use_color = sys.stdout.isatty()
+        COLORS = {
+            "OK":            "\033[32m",   # Green
+            "FAIL":          "\033[31m",   # Red
+            "SKIP":          "\033[1;33m", # Yellow
+            "PARTIAL":       "\033[33m",   # Orange
+            "EXPECTED_FAIL": "\033[36m",   # Cyan
+            "reset":         "\033[0m",    # Reset to default terminal color
+        }
+        if not use_color:
+            for c in COLORS:
+                COLORS[c] = ""
+
+        # Calculate maximum test name length
+        if not self.test_results:
+            return
+        try:
+            lengths = []
+            for module in self.test_results.values():
+                for tests in module.values():
+                    for test_name, _ in tests:
+                        lengths.append(len(test_name) + 1)  # +1 for colon
+            max_length = max(lengths) + 2  # Additional padding
+        except ValueError:
+            sys.exit("Test list is empty")
+
+        # Print results
+        for module_name, classes in self.test_results.items():
+            print(f"{module_name}:")
+            for class_name, tests in classes.items():
+                print(f"    {class_name}:")
+                for test_name, status in tests:
+                    # Get base status without reason for SKIP
+                    if status.startswith("SKIP"):
+                        status_code = status.split()[0]
+                    else:
+                        status_code = status
+                    color = COLORS.get(status_code, "")
+                    print(
+                        f"        {test_name + ':':<{max_length}}{color}{status}{COLORS['reset']}"
+                    )
+            print()
+
+        # Print summary
+        print(f"\nRan {self.testsRun} tests", end="")
+        if hasattr(self, "timeTaken"):
+            print(f" in {self.timeTaken:.3f}s", end="")
+        print()
+
+        if not self.wasSuccessful():
+            print(f"\n{COLORS['FAIL']}FAILED (", end="")
+            failures = getattr(self, "failures", [])
+            errors = getattr(self, "errors", [])
+            if failures:
+                print(f"failures={len(failures)}", end="")
+            if errors:
+                if failures:
+                    print(", ", end="")
+                print(f"errors={len(errors)}", end="")
+            print(f"){COLORS['reset']}")
+
+
+def flatten_suite(suite):
+    """Flatten test suite hierarchy."""
+    tests = []
+    for item in suite:
+        if isinstance(item, unittest.TestSuite):
+            tests.extend(flatten_suite(item))
+        else:
+            tests.append(item)
+    return tests
+
+
+class TestUnits:
+    """
+    Helper class to set verbosity level.
+
+    This class discover test files, import its unittest classes and
+    executes the test on it.
+    """
+    def parse_args(self):
+        """Returns a parser for command line arguments."""
+        parser = argparse.ArgumentParser(description="Test runner with regex filtering")
+        parser.add_argument("-v", "--verbose", action="count", default=1)
+        parser.add_argument("-f", "--failfast", action="store_true")
+        parser.add_argument("-k", "--keyword",
+                            help="Regex pattern to filter test methods")
+        return parser
+
+    def run(self, caller_file=None, pattern=None,
+            suite=None, parser=None, args=None, env=None):
+        """
+        Execute all tests from the unity test file.
+
+        It contains several optional parameters:
+
+        ``caller_file``:
+            -  name of the file that contains test.
+
+               typical usage is to place __file__ at the caller test, e.g.::
+
+                    if __name__ == "__main__":
+                        TestUnits().run(__file__)
+
+        ``pattern``:
+            - optional pattern to match multiple file names. Defaults
+              to basename of ``caller_file``.
+
+        ``suite``:
+            - an unittest suite initialized by the caller using
+              ``unittest.TestLoader().discover()``.
+
+        ``parser``:
+            - an argparse parser. If not defined, this helper will create
+              one.
+
+        ``args``:
+            - an ``argparse.Namespace`` data filled by the caller.
+
+        ``env``:
+            - environment variables that will be passed to the test suite
+
+        At least ``caller_file`` or ``suite`` must be used, otherwise a
+        ``TypeError`` will be raised.
+        """
+        if not args:
+            if not parser:
+                parser = self.parse_args()
+            args = parser.parse_args()
+
+        if not caller_file and not suite:
+            raise TypeError("Either caller_file or suite is needed at TestUnits")
+
+        verbose = args.verbose
+
+        if not env:
+            env = os.environ.copy()
+
+        env["VERBOSE"] = f"{verbose}"
+
+        patcher = patch.dict(os.environ, env)
+        patcher.start()
+        # ensure it gets stopped after
+        atexit.register(patcher.stop)
+
+
+        if verbose >= 2:
+            unittest.TextTestRunner(verbosity=verbose).run = lambda suite: suite
+
+        # Load ONLY tests from the calling file
+        if not suite:
+            if not pattern:
+                pattern = caller_file
+
+            loader = unittest.TestLoader()
+            suite = loader.discover(start_dir=os.path.dirname(caller_file),
+                                    pattern=os.path.basename(caller_file))
+
+        # Flatten the suite for environment injection
+        tests_to_inject = flatten_suite(suite)
+
+        # Filter tests by method name if -k specified
+        if args.keyword:
+            try:
+                pattern = re.compile(args.keyword)
+                filtered_suite = unittest.TestSuite()
+                for test in tests_to_inject:  # Use the pre-flattened list
+                    method_name = test.id().split(".")[-1]
+                    if pattern.search(method_name):
+                        filtered_suite.addTest(test)
+                suite = filtered_suite
+            except re.error as e:
+                sys.stderr.write(f"Invalid regex pattern: {e}\n")
+                sys.exit(1)
+        else:
+            # Maintain original suite structure if no keyword filtering
+            suite = unittest.TestSuite(tests_to_inject)
+
+        if verbose >= 2:
+            resultclass = None
+        else:
+            resultclass = Summary
+
+        runner = unittest.TextTestRunner(verbosity=args.verbose,
+                                            resultclass=resultclass,
+                                            failfast=args.failfast)
+        result = runner.run(suite)
+        if resultclass:
+            result.printResults()
+
+        sys.exit(not result.wasSuccessful())
+
+
+def run_unittest(fname):
+    """
+    Basic usage of TestUnits class.
+
+    Use it when there's no need to pass any extra argument to the tests
+    with. The recommended way is to place this at the end of each
+    unittest module::
+
+        if __name__ == "__main__":
+            run_unittest(__file__)
+    """
+    TestUnits().run(fname)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 02/22] unittests: add a testbench to check public/private kdoc comments
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 01/22] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 03/22] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel

Add unit tests to check if the public/private and comments strip
is working properly.

Running it shows that, on several cases, public/private is not
doing what it is expected:

  test_private:
    TestPublicPrivate:
        test balanced_inner_private:                                 OK
        test balanced_non_greddy_private:                            OK
        test balanced_private:                                       OK
        test no private:                                             OK
        test unbalanced_inner_private:                               FAIL
        test unbalanced_private:                                     FAIL
        test unbalanced_struct_group_tagged_with_private:            FAIL
        test unbalanced_two_struct_group_tagged_first_with_private:  FAIL
        test unbalanced_without_end_of_line:                         FAIL

  Ran 9 tests

  FAILED (failures=5)

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <144f4952e0cb74fe9c9adc117e9a21ec8aa1cc10.1773074166.git.mchehab+huawei@kernel.org>
---
 tools/unittests/test_private.py | 331 ++++++++++++++++++++++++++++++++
 1 file changed, 331 insertions(+)
 create mode 100755 tools/unittests/test_private.py

diff --git a/tools/unittests/test_private.py b/tools/unittests/test_private.py
new file mode 100755
index 000000000000..eae245ae8a12
--- /dev/null
+++ b/tools/unittests/test_private.py
@@ -0,0 +1,331 @@
+#!/usr/bin/env python3
+
+"""
+Unit tests for struct/union member extractor class.
+"""
+
+
+import os
+import re
+import unittest
+import sys
+
+from unittest.mock import MagicMock
+
+SRC_DIR = os.path.dirname(os.path.realpath(__file__))
+sys.path.insert(0, os.path.join(SRC_DIR, "../lib/python"))
+
+from kdoc.kdoc_parser import trim_private_members
+from unittest_helper import run_unittest
+
+#
+# List of tests.
+#
+# The code will dynamically generate one test for each key on this dictionary.
+#
+
+#: Tests to check if CTokenizer is handling properly public/private comments.
+TESTS_PRIVATE = {
+    #
+    # Simplest case: no private. Ensure that trimming won't affect struct
+    #
+    "no private": {
+        "source": """
+            struct foo {
+                int a;
+                int b;
+                int c;
+            };
+        """,
+        "trimmed": """
+            struct foo {
+                int a;
+                int b;
+                int c;
+            };
+        """,
+    },
+
+    #
+    # Play "by the books" by always having a public in place
+    #
+
+    "balanced_private": {
+        "source": """
+            struct foo {
+                int a;
+                /* private: */
+                int b;
+                /* public: */
+                int c;
+            };
+        """,
+        "trimmed": """
+            struct foo {
+                int a;
+                int c;
+            };
+        """,
+    },
+
+    "balanced_non_greddy_private": {
+        "source": """
+            struct foo {
+                int a;
+                /* private: */
+                int b;
+                /* public: */
+                int c;
+                /* private: */
+                int d;
+                /* public: */
+                int e;
+
+            };
+        """,
+        "trimmed": """
+            struct foo {
+                int a;
+                int c;
+                int e;
+            };
+        """,
+    },
+
+    "balanced_inner_private": {
+        "source": """
+            struct foo {
+                struct {
+                    int a;
+                    /* private: ignore below */
+                    int b;
+                /* public: but this should not be ignored */
+                };
+                int b;
+            };
+        """,
+        "trimmed": """
+            struct foo {
+                struct {
+                    int a;
+                };
+                int b;
+            };
+        """,
+    },
+
+    #
+    # Test what happens if there's no public after private place
+    #
+
+    "unbalanced_private": {
+        "source": """
+            struct foo {
+                int a;
+                /* private: */
+                int b;
+                int c;
+            };
+        """,
+        "trimmed": """
+            struct foo {
+                int a;
+            };
+        """,
+    },
+
+    "unbalanced_inner_private": {
+        "source": """
+            struct foo {
+                struct {
+                    int a;
+                    /* private: ignore below */
+                    int b;
+                /* but this should not be ignored */
+                };
+                int b;
+            };
+        """,
+        "trimmed": """
+            struct foo {
+                struct {
+                    int a;
+                };
+                int b;
+            };
+        """,
+    },
+
+    "unbalanced_struct_group_tagged_with_private": {
+        "source": """
+            struct page_pool_params {
+                struct_group_tagged(page_pool_params_fast, fast,
+                        unsigned int    order;
+                        unsigned int    pool_size;
+                        int             nid;
+                        struct device   *dev;
+                        struct napi_struct *napi;
+                        enum dma_data_direction dma_dir;
+                        unsigned int    max_len;
+                        unsigned int    offset;
+                };
+                struct_group_tagged(page_pool_params_slow, slow,
+                        struct net_device *netdev;
+                        unsigned int queue_idx;
+                        unsigned int    flags;
+                        /* private: used by test code only */
+                        void (*init_callback)(netmem_ref netmem, void *arg);
+                        void *init_arg;
+                };
+            };
+        """,
+        "trimmed": """
+            struct page_pool_params {
+                struct_group_tagged(page_pool_params_fast, fast,
+                        unsigned int    order;
+                        unsigned int    pool_size;
+                        int             nid;
+                        struct device   *dev;
+                        struct napi_struct *napi;
+                        enum dma_data_direction dma_dir;
+                        unsigned int    max_len;
+                        unsigned int    offset;
+                };
+                struct_group_tagged(page_pool_params_slow, slow,
+                        struct net_device *netdev;
+                        unsigned int queue_idx;
+                        unsigned int    flags;
+                };
+            };
+        """,
+    },
+
+    "unbalanced_two_struct_group_tagged_first_with_private": {
+        "source": """
+            struct page_pool_params {
+                struct_group_tagged(page_pool_params_slow, slow,
+                        struct net_device *netdev;
+                        unsigned int queue_idx;
+                        unsigned int    flags;
+                        /* private: used by test code only */
+                        void (*init_callback)(netmem_ref netmem, void *arg);
+                        void *init_arg;
+                };
+                struct_group_tagged(page_pool_params_fast, fast,
+                        unsigned int    order;
+                        unsigned int    pool_size;
+                        int             nid;
+                        struct device   *dev;
+                        struct napi_struct *napi;
+                        enum dma_data_direction dma_dir;
+                        unsigned int    max_len;
+                        unsigned int    offset;
+                };
+            };
+        """,
+        "trimmed": """
+            struct page_pool_params {
+                struct_group_tagged(page_pool_params_slow, slow,
+                        struct net_device *netdev;
+                        unsigned int queue_idx;
+                        unsigned int    flags;
+                };
+                struct_group_tagged(page_pool_params_fast, fast,
+                        unsigned int    order;
+                        unsigned int    pool_size;
+                        int             nid;
+                        struct device   *dev;
+                        struct napi_struct *napi;
+                        enum dma_data_direction dma_dir;
+                        unsigned int    max_len;
+                        unsigned int    offset;
+                };
+            };
+        """,
+    },
+    "unbalanced_without_end_of_line": {
+        "source": """ \
+            struct page_pool_params { \
+                struct_group_tagged(page_pool_params_slow, slow, \
+                        struct net_device *netdev; \
+                        unsigned int queue_idx; \
+                        unsigned int    flags;
+                        /* private: used by test code only */
+                        void (*init_callback)(netmem_ref netmem, void *arg); \
+                        void *init_arg; \
+                }; \
+                struct_group_tagged(page_pool_params_fast, fast, \
+                        unsigned int    order; \
+                        unsigned int    pool_size; \
+                        int             nid; \
+                        struct device   *dev; \
+                        struct napi_struct *napi; \
+                        enum dma_data_direction dma_dir; \
+                        unsigned int    max_len; \
+                        unsigned int    offset; \
+                }; \
+            };
+        """,
+        "trimmed": """
+            struct page_pool_params {
+                struct_group_tagged(page_pool_params_slow, slow,
+                        struct net_device *netdev;
+                        unsigned int queue_idx;
+                        unsigned int    flags;
+                };
+                struct_group_tagged(page_pool_params_fast, fast,
+                        unsigned int    order;
+                        unsigned int    pool_size;
+                        int             nid;
+                        struct device   *dev;
+                        struct napi_struct *napi;
+                        enum dma_data_direction dma_dir;
+                        unsigned int    max_len;
+                        unsigned int    offset;
+                };
+            };
+        """,
+    },
+}
+
+
+class TestPublicPrivate(unittest.TestCase):
+    """
+    Main test class. Populated dynamically at runtime.
+    """
+
+    def setUp(self):
+        self.maxDiff = None
+
+    def add_test(cls, name, source, trimmed):
+        """
+        Dynamically add a test to the class
+        """
+        def test(cls):
+            result = trim_private_members(source)
+
+            result = re.sub(r"\s++", " ", result).strip()
+            expected = re.sub(r"\s++", " ", trimmed).strip()
+
+            msg = f"failed when parsing this source:\n" + source
+
+            cls.assertEqual(result, expected, msg=msg)
+
+        test.__name__ = f'test {name}'
+
+        setattr(TestPublicPrivate, test.__name__, test)
+
+
+#
+# Populate TestPublicPrivate class
+#
+test_class = TestPublicPrivate()
+for name, test in TESTS_PRIVATE.items():
+    test_class.add_test(name, test["source"], test["trimmed"])
+
+
+#
+# main
+#
+if __name__ == "__main__":
+    run_unittest(__file__)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 03/22] docs: kdoc: don't add broken comments inside prototypes
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 01/22] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 02/22] unittests: add a testbench to check public/private kdoc comments Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-18  7:47   ` Loktionov, Aleksandr
  2026-03-17 18:09 ` [PATCH v3 04/22] docs: kdoc: properly handle empty enum arguments Mauro Carvalho Chehab
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Aleksandr Loktionov, Randy Dunlap

Parsing a file like drivers/scsi/isci/host.h, which contains
broken kernel-doc markups makes it create a prototype that contains
unmatched end comments.

That causes, for instance, struct sci_power_control to be shown this
this prototype:

    struct sci_power_control {
        * it is not. */ bool timer_started;
        */ struct sci_timer timer;
        * requesters field. */ u8 phys_waiting;
        */ u8 phys_granted_power;
        * mapped into requesters via struct sci_phy.phy_index */ struct isci_phy *requesters[SCI_MAX_PHYS];
    };

as comments won't start with "/*" anymore.

Fix the logic to detect such cases, and keep adding the comments
inside it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <18e577dbbd538dcc22945ff139fe3638344e14f0.1773074166.git.mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/kdoc_parser.py | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index edf70ba139a5..086579d00b5c 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -1355,6 +1355,12 @@ class KernelDoc:
         elif doc_content.search(line):
             self.emit_msg(ln, f"Incorrect use of kernel-doc format: {line}")
             self.state = state.PROTO
+
+            #
+            # Don't let it add partial comments at the code, as breaks the
+            # logic meant to remove comments from prototypes.
+            #
+            self.process_proto_type(ln, "/**\n" + line)
         # else ... ??
 
     def process_inline_text(self, ln, line):
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 04/22] docs: kdoc: properly handle empty enum arguments
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (2 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 03/22] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 05/22] docs: add a C tokenizer to be used by kernel-doc Mauro Carvalho Chehab
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Aleksandr Loktionov, Randy Dunlap

Depending on how the enum proto is written, a comma at the end
may incorrectly make kernel-doc parse an arg like " ".

Strip spaces before checking if arg is empty.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <4182bfb7e5f5b4bbaf05cee1bede691e56247eaf.1773074166.git.mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/kdoc_parser.py | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 086579d00b5c..4b3c555e6c8e 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -810,9 +810,10 @@ class KernelDoc:
         member_set = set()
         members = KernRe(r'\([^;)]*\)').sub('', members)
         for arg in members.split(','):
-            if not arg:
-                continue
             arg = KernRe(r'^\s*(\w+).*').sub(r'\1', arg)
+            if not arg.strip():
+                continue
+
             self.entry.parameterlist.append(arg)
             if arg not in self.entry.parameterdescs:
                 self.entry.parameterdescs[arg] = self.undescribed
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 05/22] docs: add a C tokenizer to be used by kernel-doc
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (3 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 04/22] docs: kdoc: properly handle empty enum arguments Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 06/22] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel

Handling C code purely using regular expressions doesn't work well.

Add a C tokenizer to help doing it the right way.

The tokenizer was written using as basis the Python re documentation
tokenizer example from:
    https://docs.python.org/3/library/re.html#writing-a-tokenizer

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/c_lex.py | 292 +++++++++++++++++++++++++++++++++
 1 file changed, 292 insertions(+)
 create mode 100644 tools/lib/python/kdoc/c_lex.py

diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
new file mode 100644
index 000000000000..9d726f821f3f
--- /dev/null
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -0,0 +1,292 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2025: Mauro Carvalho Chehab <mchehab@kernel.org>.
+
+"""
+Regular expression ancillary classes.
+
+Those help caching regular expressions and do matching for kernel-doc.
+
+Please notice that the code here may rise exceptions to indicate bad
+usage inside kdoc to indicate problems at the replace pattern.
+
+Other errors are logged via log instance.
+"""
+
+import logging
+import re
+
+from .kdoc_re import KernRe
+
+log = logging.getLogger(__name__)
+
+
+class CToken():
+    """
+    Data class to define a C token.
+    """
+
+    # Tokens that can be used by the parser. Works like an C enum.
+
+    COMMENT = 0     #: A standard C or C99 comment, including delimiter.
+    STRING = 1      #: A string, including quotation marks.
+    CHAR = 2        #: A character, including apostophes.
+    NUMBER = 3      #: A number.
+    PUNC = 4        #: A puntuation mark: / ``,`` / ``.``.
+    BEGIN = 5       #: A begin character: ``{`` / ``[`` / ``(``.
+    END = 6         #: A end character: ``}`` / ``]`` / ``)``.
+    CPP = 7         #: A preprocessor macro.
+    HASH = 8        #: The hash character - useful to handle other macros.
+    OP = 9          #: A C operator (add, subtract, ...).
+    STRUCT = 10     #: A ``struct`` keyword.
+    UNION = 11      #: An ``union`` keyword.
+    ENUM = 12       #: A ``struct`` keyword.
+    TYPEDEF = 13    #: A ``typedef`` keyword.
+    NAME = 14       #: A name. Can be an ID or a type.
+    SPACE = 15      #: Any space characters, including new lines
+    ENDSTMT = 16    #: End of an statement (``;``).
+
+    BACKREF = 17    #: Not a valid C sequence, but used at sub regex patterns.
+
+    MISMATCH = 255  #: an error indicator: should never happen in practice.
+
+    # Dict to convert from an enum interger into a string.
+    _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
+
+    # Dict to convert from string to an enum-like integer value.
+    _name_to_val = {k: v for v, k in _name_by_val.items()}
+
+    @staticmethod
+    def to_name(val):
+        """Convert from an integer value from CToken enum into a string"""
+
+        return CToken._name_by_val.get(val, f"UNKNOWN({val})")
+
+    @staticmethod
+    def from_name(name):
+        """Convert a string into a CToken enum value"""
+        if name in CToken._name_to_val:
+            return CToken._name_to_val[name]
+
+        return CToken.MISMATCH
+
+
+    def __init__(self, kind, value=None, pos=0,
+                 brace_level=0, paren_level=0, bracket_level=0):
+        self.kind = kind
+        self.value = value
+        self.pos = pos
+        self.level = (bracket_level, paren_level, brace_level)
+
+    def __repr__(self):
+        name = self.to_name(self.kind)
+        if isinstance(self.value, str):
+            value = '"' + self.value + '"'
+        else:
+            value = self.value
+
+        return f"CToken(CToken.{name}, {value}, {self.pos}, {self.level})"
+
+#: Regexes to parse C code, transforming it into tokens.
+RE_SCANNER_LIST = [
+    #
+    # Note that \s\S is different than .*, as it also catches \n
+    #
+    (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
+
+    (CToken.STRING,  r'"(?:\\.|[^"\\])*"'),
+    (CToken.CHAR,    r"'(?:\\.|[^'\\])'"),
+
+    (CToken.NUMBER,  r"0[xX][\da-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
+                     r"\d+(?:\.\d*)?(?:[eE][+-]?\d+)?[fFlL]*"),
+
+    (CToken.ENDSTMT, r"(?:\s+;|;)"),
+
+    (CToken.PUNC,    r"[,\.]"),
+
+    (CToken.BEGIN,   r"[\[\(\{]"),
+
+    (CToken.END,     r"[\]\)\}]"),
+
+    (CToken.CPP,     r"#\s*(?:define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
+
+    (CToken.HASH,    r"#"),
+
+    (CToken.OP,      r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
+                     r"|&=|\|=|\^=|[=\+\-\*/%<>&\|\^~!\?\:]"),
+
+    (CToken.STRUCT,  r"\bstruct\b"),
+    (CToken.UNION,   r"\bunion\b"),
+    (CToken.ENUM,    r"\benum\b"),
+    (CToken.TYPEDEF, r"\btypedef\b"),
+
+    (CToken.NAME,    r"[A-Za-z_]\w*"),
+
+    (CToken.SPACE,   r"\s+"),
+
+    (CToken.BACKREF, r"\\\d+"),
+
+    (CToken.MISMATCH,r"."),
+]
+
+def fill_re_scanner(token_list):
+    """Ancillary routine to convert RE_SCANNER_LIST into a finditer regex"""
+    re_tokens = []
+
+    for kind, pattern in token_list:
+        name = CToken.to_name(kind)
+        re_tokens.append(f"(?P<{name}>{pattern})")
+
+    return KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)
+
+#: Handle C continuation lines.
+RE_CONT = KernRe(r"\\\n")
+
+RE_COMMENT_START = KernRe(r'/\*\s*')
+
+#: tokenizer regex. Will be filled at the first CTokenizer usage.
+RE_SCANNER = fill_re_scanner(RE_SCANNER_LIST)
+
+
+class CTokenizer():
+    """
+    Scan C statements and definitions and produce tokens.
+
+    When converted to string, it drops comments and handle public/private
+    values, respecting depth.
+    """
+
+    # This class is inspired and follows the basic concepts of:
+    #   https://docs.python.org/3/library/re.html#writing-a-tokenizer
+
+    def __init__(self, source=None, log=None):
+        """
+        Create a regular expression to handle RE_SCANNER_LIST.
+
+        While I generally don't like using regex group naming via:
+            (?P<name>...)
+
+        in this particular case, it makes sense, as we can pick the name
+        when matching a code via RE_SCANNER.
+        """
+
+        self.tokens = []
+
+        if not source:
+            return
+
+        if isinstance(source, list):
+            self.tokens = source
+            return
+
+        #
+        # While we could just use _tokenize directly via interator,
+        # As we'll need to use the tokenizer several times inside kernel-doc
+        # to handle macro transforms, cache the results on a list, as
+        # re-using it is cheaper than having to parse everytime.
+        #
+        for tok in self._tokenize(source):
+            self.tokens.append(tok)
+
+    def _tokenize(self, source):
+        """
+        Iterator that parses ``source``, splitting it into tokens, as defined
+        at ``self.RE_SCANNER_LIST``.
+
+        The interactor returns a CToken class object.
+        """
+
+        # Handle continuation lines. Note that kdoc_parser already has a
+        # logic to do that. Still, let's keep it for completeness, as we might
+        # end re-using this tokenizer outsize kernel-doc some day - or we may
+        # eventually remove from there as a future cleanup.
+        source = RE_CONT.sub("", source)
+
+        brace_level = 0
+        paren_level = 0
+        bracket_level = 0
+
+        for match in RE_SCANNER.finditer(source):
+            kind = CToken.from_name(match.lastgroup)
+            pos = match.start()
+            value = match.group()
+
+            if kind == CToken.MISMATCH:
+                log.error(f"Unexpected token '{value}' on pos {pos}:\n\t'{source}'")
+            elif kind == CToken.BEGIN:
+                if value == '(':
+                    paren_level += 1
+                elif value == '[':
+                    bracket_level += 1
+                else:  # value == '{'
+                    brace_level += 1
+
+            elif kind == CToken.END:
+                if value == ')' and paren_level > 0:
+                    paren_level -= 1
+                elif value == ']' and bracket_level > 0:
+                    bracket_level -= 1
+                elif brace_level > 0:    # value == '}'
+                    brace_level -= 1
+
+            yield CToken(kind, value, pos,
+                         brace_level, paren_level, bracket_level)
+
+    def __str__(self):
+        out=""
+        show_stack = [True]
+
+        for i, tok in enumerate(self.tokens):
+            if tok.kind == CToken.BEGIN:
+                show_stack.append(show_stack[-1])
+
+            elif tok.kind == CToken.END:
+                prev = show_stack[-1]
+                if len(show_stack) > 1:
+                    show_stack.pop()
+
+                if not prev and show_stack[-1]:
+                    #
+                    # Try to preserve indent
+                    #
+                    out += "\t" * (len(show_stack) - 1)
+
+                    out += str(tok.value)
+                    continue
+
+            elif tok.kind == CToken.COMMENT:
+                comment = RE_COMMENT_START.sub("", tok.value)
+
+                if comment.startswith("private:"):
+                    show_stack[-1] = False
+                    show = False
+                elif comment.startswith("public:"):
+                    show_stack[-1] = True
+
+                continue
+
+            if not show_stack[-1]:
+                continue
+
+            if i < len(self.tokens) - 1:
+                next_tok = self.tokens[i + 1]
+
+                # Do some cleanups before ";"
+
+                if (tok.kind == CToken.SPACE and
+                    next_tok.kind == CToken.PUNC and
+                    next_tok.value == ";"):
+
+                    continue
+
+                if (tok.kind == CToken.PUNC and
+                    next_tok.kind == CToken.PUNC and
+                    tok.value == ";" and
+                    next_tok.kind == CToken.PUNC and
+                    next_tok.value == ";"):
+
+                    continue
+
+            out += str(tok.value)
+
+        return out
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 06/22] docs: kdoc: use tokenizer to handle comments on structs
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (4 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 05/22] docs: add a C tokenizer to be used by kernel-doc Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-18  7:48   ` Loktionov, Aleksandr
  2026-03-17 18:09 ` [PATCH v3 07/22] unittests: test_private: modify it to use CTokenizer directly Mauro Carvalho Chehab
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Aleksandr Loktionov, Randy Dunlap

Better handle comments inside structs. After those changes,
all unittests now pass:

  test_private:
    TestPublicPrivate:
        test balanced_inner_private:                                 OK
        test balanced_non_greddy_private:                            OK
        test balanced_private:                                       OK
        test no private:                                             OK
        test unbalanced_inner_private:                               OK
        test unbalanced_private:                                     OK
        test unbalanced_struct_group_tagged_with_private:            OK
        test unbalanced_two_struct_group_tagged_first_with_private:  OK
        test unbalanced_without_end_of_line:                         OK

  Ran 9 tests

This also solves a bug when handling STRUCT_GROUP() with a private
comment on it:

	@@ -397134,7 +397134,7 @@ basic V4L2 device-level support.
	             unsigned int    max_len;
	             unsigned int    offset;
	             struct page_pool_params_slow  slow;
	-            STRUCT_GROUP( struct net_device *netdev;
	+            struct net_device *netdev;
	             unsigned int queue_idx;
	             unsigned int    flags;
	       };

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <f83ee9e8c38407eaab6ad10d4ccf155fb36683cc.1773074166.git.mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/kdoc_parser.py | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 4b3c555e6c8e..62d8030cf532 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -13,6 +13,7 @@ import sys
 import re
 from pprint import pformat
 
+from kdoc.c_lex import CTokenizer
 from kdoc.kdoc_re import NestedMatch, KernRe
 from kdoc.kdoc_item import KdocItem
 
@@ -84,15 +85,9 @@ def trim_private_members(text):
     """
     Remove ``struct``/``enum`` members that have been marked "private".
     """
-    # First look for a "public:" block that ends a private region, then
-    # handle the "private until the end" case.
-    #
-    text = KernRe(r'/\*\s*private:.*?/\*\s*public:.*?\*/', flags=re.S).sub('', text)
-    text = KernRe(r'/\*\s*private:.*', flags=re.S).sub('', text)
-    #
-    # We needed the comments to do the above, but now we can take them out.
-    #
-    return KernRe(r'\s*/\*.*?\*/\s*', flags=re.S).sub('', text).strip()
+
+    tokens = CTokenizer(text)
+    return str(tokens)
 
 class state:
     """
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 07/22] unittests: test_private: modify it to use CTokenizer directly
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (5 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 06/22] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 08/22] unittests: test_tokenizer: check if the tokenizer works Mauro Carvalho Chehab
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel

Change the logic to use the tokenizer directly. This allows
adding more unit tests to check the validty of the tokenizer
itself.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <2672257233ff73a9464c09b50924be51e25d4f59.1773074166.git.mchehab+huawei@kernel.org>
---
 .../{test_private.py => test_tokenizer.py}    | 75 +++++++++++++------
 1 file changed, 51 insertions(+), 24 deletions(-)
 rename tools/unittests/{test_private.py => test_tokenizer.py} (85%)

diff --git a/tools/unittests/test_private.py b/tools/unittests/test_tokenizer.py
similarity index 85%
rename from tools/unittests/test_private.py
rename to tools/unittests/test_tokenizer.py
index eae245ae8a12..3b1d0b5bd311 100755
--- a/tools/unittests/test_private.py
+++ b/tools/unittests/test_tokenizer.py
@@ -15,20 +15,43 @@ from unittest.mock import MagicMock
 SRC_DIR = os.path.dirname(os.path.realpath(__file__))
 sys.path.insert(0, os.path.join(SRC_DIR, "../lib/python"))
 
-from kdoc.kdoc_parser import trim_private_members
+from kdoc.c_lex import CTokenizer
 from unittest_helper import run_unittest
 
+
 #
 # List of tests.
 #
 # The code will dynamically generate one test for each key on this dictionary.
 #
 
+def make_private_test(name, data):
+    """
+    Create a test named ``name`` using parameters given by ``data`` dict.
+    """
+
+    def test(self):
+        """In-lined lambda-like function to run the test"""
+        tokens = CTokenizer(data["source"])
+        result = str(tokens)
+
+        #
+        # Avoid whitespace false positives
+        #
+        result = re.sub(r"\s++", " ", result).strip()
+        expected = re.sub(r"\s++", " ", data["trimmed"]).strip()
+
+        msg = f"failed when parsing this source:\n{data['source']}"
+        self.assertEqual(result, expected, msg=msg)
+
+    return test
+
 #: Tests to check if CTokenizer is handling properly public/private comments.
 TESTS_PRIVATE = {
     #
     # Simplest case: no private. Ensure that trimming won't affect struct
     #
+    "__run__": make_private_test,
     "no private": {
         "source": """
             struct foo {
@@ -288,41 +311,45 @@ TESTS_PRIVATE = {
     },
 }
 
+#: Dict containing all test groups fror CTokenizer
+TESTS = {
+    "TestPublicPrivate": TESTS_PRIVATE,
+}
 
-class TestPublicPrivate(unittest.TestCase):
-    """
-    Main test class. Populated dynamically at runtime.
-    """
+def setUp(self):
+    self.maxDiff = None
 
-    def setUp(self):
-        self.maxDiff = None
+def build_test_class(group_name, table):
+    """
+    Dynamically creates a class instance using type() as a generator
+    for a new class derivated from unittest.TestCase.
 
-    def add_test(cls, name, source, trimmed):
-        """
-        Dynamically add a test to the class
-        """
-        def test(cls):
-            result = trim_private_members(source)
+    We're opting to do it inside a function to avoid the risk of
+    changing the globals() dictionary.
+    """
 
-            result = re.sub(r"\s++", " ", result).strip()
-            expected = re.sub(r"\s++", " ", trimmed).strip()
+    class_dict = {
+        "setUp": setUp
+    }
 
-            msg = f"failed when parsing this source:\n" + source
+    run = table["__run__"]
 
-            cls.assertEqual(result, expected, msg=msg)
+    for test_name, data in table.items():
+        if test_name == "__run__":
+            continue
 
-        test.__name__ = f'test {name}'
+        class_dict[f"test_{test_name}"] = run(test_name, data)
 
-        setattr(TestPublicPrivate, test.__name__, test)
+    cls = type(group_name, (unittest.TestCase,), class_dict)
 
+    return cls.__name__, cls
 
 #
-# Populate TestPublicPrivate class
+# Create classes and add them to the global dictionary
 #
-test_class = TestPublicPrivate()
-for name, test in TESTS_PRIVATE.items():
-    test_class.add_test(name, test["source"], test["trimmed"])
-
+for group, table in TESTS.items():
+    t = build_test_class(group, table)
+    globals()[t[0]] = t[1]
 
 #
 # main
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 08/22] unittests: test_tokenizer: check if the tokenizer works
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (6 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 07/22] unittests: test_private: modify it to use CTokenizer directly Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 09/22] unittests: add a runner to execute all unittests Mauro Carvalho Chehab
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel

Add extra tests to check if the tokenizer is working properly.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/unittests/test_tokenizer.py | 108 +++++++++++++++++++++++++++++-
 1 file changed, 106 insertions(+), 2 deletions(-)

diff --git a/tools/unittests/test_tokenizer.py b/tools/unittests/test_tokenizer.py
index 3b1d0b5bd311..5634b4a7283e 100755
--- a/tools/unittests/test_tokenizer.py
+++ b/tools/unittests/test_tokenizer.py
@@ -15,15 +15,118 @@ from unittest.mock import MagicMock
 SRC_DIR = os.path.dirname(os.path.realpath(__file__))
 sys.path.insert(0, os.path.join(SRC_DIR, "../lib/python"))
 
-from kdoc.c_lex import CTokenizer
+from kdoc.c_lex import CToken, CTokenizer
 from unittest_helper import run_unittest
 
-
 #
 # List of tests.
 #
 # The code will dynamically generate one test for each key on this dictionary.
 #
+def tokens_to_list(tokens):
+    tuples = []
+
+    for tok in tokens:
+        if tok.kind == CToken.SPACE:
+            continue
+
+        tuples += [(tok.kind, tok.value, tok.level)]
+
+    return tuples
+
+
+def make_tokenizer_test(name, data):
+    """
+    Create a test named ``name`` using parameters given by ``data`` dict.
+    """
+
+    def test(self):
+        """In-lined lambda-like function to run the test"""
+
+        #
+        # Check if logger is working
+        #
+        if "log_level" in data:
+            with self.assertLogs('kdoc.c_lex', level='ERROR') as cm:
+                tokenizer = CTokenizer(data["source"])
+
+            return
+
+        #
+        # Check if tokenizer is producing expected results
+        #
+        tokens = CTokenizer(data["source"]).tokens
+
+        result = tokens_to_list(tokens)
+        expected = tokens_to_list(data["expected"])
+
+        self.assertEqual(result, expected, msg=f"{name}")
+
+    return test
+
+#: Tokenizer tests.
+TESTS_TOKENIZER = {
+    "__run__": make_tokenizer_test,
+
+    "basic_tokens": {
+        "source": """
+            int a; // comment
+            float b = 1.23;
+        """,
+        "expected": [
+            CToken(CToken.NAME, "int"),
+            CToken(CToken.NAME, "a"),
+            CToken(CToken.ENDSTMT, ";"),
+            CToken(CToken.COMMENT, "// comment"),
+            CToken(CToken.NAME, "float"),
+            CToken(CToken.NAME, "b"),
+            CToken(CToken.OP, "="),
+            CToken(CToken.NUMBER, "1.23"),
+            CToken(CToken.ENDSTMT, ";"),
+        ],
+    },
+
+    "depth_counters": {
+        "source": """
+            struct X {
+                int arr[10];
+                func(a[0], (b + c));
+            }
+        """,
+        "expected": [
+            CToken(CToken.STRUCT, "struct"),
+            CToken(CToken.NAME, "X"),
+            CToken(CToken.BEGIN, "{", brace_level=1),
+
+            CToken(CToken.NAME, "int", brace_level=1),
+            CToken(CToken.NAME, "arr", brace_level=1),
+            CToken(CToken.BEGIN, "[", brace_level=1, bracket_level=1),
+            CToken(CToken.NUMBER, "10", brace_level=1, bracket_level=1),
+            CToken(CToken.END, "]", brace_level=1),
+            CToken(CToken.ENDSTMT, ";", brace_level=1),
+            CToken(CToken.NAME, "func", brace_level=1),
+            CToken(CToken.BEGIN, "(", brace_level=1, paren_level=1),
+            CToken(CToken.NAME, "a", brace_level=1, paren_level=1),
+            CToken(CToken.BEGIN, "[", brace_level=1, paren_level=1, bracket_level=1),
+            CToken(CToken.NUMBER, "0", brace_level=1, paren_level=1, bracket_level=1),
+            CToken(CToken.END, "]", brace_level=1, paren_level=1),
+            CToken(CToken.PUNC, ",", brace_level=1, paren_level=1),
+            CToken(CToken.BEGIN, "(", brace_level=1, paren_level=2),
+            CToken(CToken.NAME, "b", brace_level=1, paren_level=2),
+            CToken(CToken.OP, "+", brace_level=1, paren_level=2),
+            CToken(CToken.NAME, "c", brace_level=1, paren_level=2),
+            CToken(CToken.END, ")", brace_level=1, paren_level=1),
+            CToken(CToken.END, ")", brace_level=1),
+            CToken(CToken.ENDSTMT, ";", brace_level=1),
+            CToken(CToken.END, "}"),
+        ],
+    },
+
+    "mismatch_error": {
+        "source": "int a$ = 5;",          # $ is illegal
+        "log_level": "ERROR",
+    },
+}
 
 def make_private_test(name, data):
     """
@@ -314,6 +417,7 @@ TESTS_PRIVATE = {
 #: Dict containing all test groups fror CTokenizer
 TESTS = {
     "TestPublicPrivate": TESTS_PRIVATE,
+    "TestTokenizer": TESTS_TOKENIZER,
 }
 
 def setUp(self):
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 09/22] unittests: add a runner to execute all unittests
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (7 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 08/22] unittests: test_tokenizer: check if the tokenizer works Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 10/22] docs: kdoc: create a CMatch to match nested C blocks Mauro Carvalho Chehab
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel

We'll soon have multiple unit tests, add a runner that will
discover all of them and execute all tests.

It was opted to discover only files that starts with "test",
as this way unittest discover won't try adding libraries or
other stuff that might not contain unittest classes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/unittests/run.py | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)
 create mode 100755 tools/unittests/run.py

diff --git a/tools/unittests/run.py b/tools/unittests/run.py
new file mode 100755
index 000000000000..8c19036d43a1
--- /dev/null
+++ b/tools/unittests/run.py
@@ -0,0 +1,17 @@
+#!/bin/env python3
+import os
+import unittest
+import sys
+
+TOOLS_DIR=os.path.join(os.path.dirname(os.path.realpath(__file__)), "..")
+sys.path.insert(0, TOOLS_DIR)
+
+from lib.python.unittest_helper import TestUnits
+
+if __name__ == "__main__":
+    loader = unittest.TestLoader()
+
+    suite = loader.discover(start_dir=os.path.join(TOOLS_DIR, "unittests"),
+                            pattern="test*.py")
+
+    TestUnits().run("", suite=suite)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 10/22] docs: kdoc: create a CMatch to match nested C blocks
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (8 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 09/22] unittests: add a runner to execute all unittests Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 11/22] tools: unittests: add tests for CMatch Mauro Carvalho Chehab
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel

The NextMatch code is complex, and will become even more complex
if we add there support for arguments.

Now that we have a tokenizer, we can use a better solution,
easier to be understood.

Yet, to improve performance, it is better to make it use a
previously tokenized code, changing its ABI.

So, reimplement NextMatch using the CTokener class. Once it
is done, we can drop NestedMatch.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/c_lex.py | 121 ++++++++++++++++++++++++++++++---
 1 file changed, 111 insertions(+), 10 deletions(-)

diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index 9d726f821f3f..5da472734ff7 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -273,20 +273,121 @@ class CTokenizer():
 
                 # Do some cleanups before ";"
 
-                if (tok.kind == CToken.SPACE and
-                    next_tok.kind == CToken.PUNC and
-                    next_tok.value == ";"):
-
+                if tok.kind == CToken.SPACE and next_tok.kind == CToken.ENDSTMT:
                     continue
 
-                if (tok.kind == CToken.PUNC and
-                    next_tok.kind == CToken.PUNC and
-                    tok.value == ";" and
-                    next_tok.kind == CToken.PUNC and
-                    next_tok.value == ";"):
-
+                if tok.kind == CToken.ENDSTMT and next_tok.kind == tok.kind:
                     continue
 
             out += str(tok.value)
 
         return out
+
+
+class CMatch:
+    """
+    Finding nested delimiters is hard with regular expressions. It is
+    even harder on Python with its normal re module, as there are several
+    advanced regular expressions that are missing.
+
+    This is the case of this pattern::
+
+            '\\bSTRUCT_GROUP(\\(((?:(?>[^)(]+)|(?1))*)\\))[^;]*;'
+
+    which is used to properly match open/close parentheses of the
+    string search STRUCT_GROUP(),
+
+    Add a class that counts pairs of delimiters, using it to match and
+    replace nested expressions.
+
+    The original approach was suggested by:
+
+        https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
+
+    Although I re-implemented it to make it more generic and match 3 types
+    of delimiters. The logic checks if delimiters are paired. If not, it
+    will ignore the search string.
+    """
+
+    # TODO: add a sub method
+
+    def __init__(self, regex):
+        self.regex = KernRe(regex)
+
+    def _search(self, tokenizer):
+        """
+        Finds paired blocks for a regex that ends with a delimiter.
+
+        The suggestion of using finditer to match pairs came from:
+        https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
+        but I ended using a different implementation to align all three types
+        of delimiters and seek for an initial regular expression.
+
+        The algorithm seeks for open/close paired delimiters and places them
+        into a stack, yielding a start/stop position of each match when the
+        stack is zeroed.
+
+        The algorithm should work fine for properly paired lines, but will
+        silently ignore end delimiters that precede a start delimiter.
+        This should be OK for kernel-doc parser, as unaligned delimiters
+        would cause compilation errors. So, we don't need to raise exceptions
+        to cover such issues.
+        """
+
+        start = None
+        offset = -1
+        started = False
+
+        import sys
+
+        stack = []
+
+        for i, tok in enumerate(tokenizer.tokens):
+            if start is None:
+                if tok.kind == CToken.NAME and self.regex.match(tok.value):
+                    start = i
+                    stack.append((start, tok.level))
+                    started = False
+
+                continue
+
+            if not started and tok.kind == CToken.BEGIN:
+                started = True
+                continue
+
+            if tok.kind == CToken.END and tok.level == stack[-1][1]:
+                start, level = stack.pop()
+                offset = i
+
+                yield CTokenizer(tokenizer.tokens[start:offset + 1])
+                start = None
+
+        #
+        # If an END zeroing levels is not there, return remaining stuff
+        # This is meant to solve cases where the caller logic might be
+        # picking an incomplete block.
+        #
+        if start and offset < 0:
+            print("WARNING: can't find an end", file=sys.stderr)
+            yield CTokenizer(tokenizer.tokens[start:])
+
+    def search(self, source):
+        """
+        This is similar to re.search:
+
+        It matches a regex that it is followed by a delimiter,
+        returning occurrences only if all delimiters are paired.
+        """
+
+        if isinstance(source, CTokenizer):
+            tokenizer = source
+            is_token = True
+        else:
+            tokenizer = CTokenizer(source)
+            is_token = False
+
+        for new_tokenizer in self._search(tokenizer):
+            if is_token:
+                yield new_tokenizer
+            else:
+                yield str(new_tokenizer)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 11/22] tools: unittests: add tests for CMatch
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (9 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 10/22] docs: kdoc: create a CMatch to match nested C blocks Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 12/22] docs: c_lex: properly implement a sub() method " Mauro Carvalho Chehab
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel

The CMatch logic is complex enough to justify tests to ensure
that it is doing its job.

Add unittests to check the functionality provided by CMatch
by replicating expected patterns.

The CMatch class handles with complex macros. Add an unittest
to check if its doing the right thing and detect eventual regressions
as we improve its code.

The initial version was generated using gpt-oss:latest LLM
on my local GPU, as LLMs aren't bad transforming patterns
into unittests.

Yet, the curent version contains only the skeleton of what
LLM produced, as I ended higly changing its content to be
more representative and to have real case scenarios.

The kdoc_xforms test suite contains 3 test groups. Two of
them tests the basic functionality of CMatch to
replace patterns.

The last one (TestRealUsecases) contains real code snippets
from the Kernel with some cleanups to better fit in 80 columns
and uses the same transforms as kernel-doc, thus allowing
to test the logic used inside kdoc_parser to transform
functions, structs and variable patterns.

Its output is like this:

        $ tools/unittests/kdoc_xforms.py
        Ran 25 tests in 0.003s

        OK
	test_cmatch:
	    TestSearch:
	        test_search_acquires_multiple:      OK
	        test_search_acquires_nested_paren:  OK
	        test_search_acquires_simple:        OK
	        test_search_must_hold:              OK
	        test_search_must_hold_shared:       OK
	        test_search_no_false_positive:      OK
	        test_search_no_function:            OK
	        test_search_no_macro_remains:       OK

        Ran 8 tests

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/unittests/test_cmatch.py | 95 ++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)
 create mode 100755 tools/unittests/test_cmatch.py

diff --git a/tools/unittests/test_cmatch.py b/tools/unittests/test_cmatch.py
new file mode 100755
index 000000000000..53b25aa4dc4a
--- /dev/null
+++ b/tools/unittests/test_cmatch.py
@@ -0,0 +1,95 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2026: Mauro Carvalho Chehab <mchehab@kernel.org>.
+#
+# pylint: disable=C0413,R0904
+
+
+"""
+Unit tests for kernel-doc CMatch.
+"""
+
+import os
+import re
+import sys
+import unittest
+
+
+# Import Python modules
+
+SRC_DIR = os.path.dirname(os.path.realpath(__file__))
+sys.path.insert(0, os.path.join(SRC_DIR, "../lib/python"))
+
+from kdoc.c_lex import CMatch
+from kdoc.xforms_lists import CTransforms
+from unittest_helper import run_unittest
+
+#
+# Override unittest.TestCase to better compare diffs ignoring whitespaces
+#
+class TestCaseDiff(unittest.TestCase):
+    """
+    Disable maximum limit on diffs and add a method to better
+    handle diffs with whitespace differences.
+    """
+
+    @classmethod
+    def setUpClass(cls):
+        """Ensure that there won't be limit for diffs"""
+        cls.maxDiff = None
+
+
+#
+# Tests doing with different macros
+#
+
+class TestSearch(TestCaseDiff):
+    """
+    Test search mechanism
+    """
+
+    def test_search_acquires_simple(self):
+        line = "__acquires(ctx) foo();"
+        result = ", ".join(CMatch("__acquires").search(line))
+        self.assertEqual(result, "__acquires(ctx)")
+
+    def test_search_acquires_multiple(self):
+        line = "__acquires(ctx) __acquires(other) bar();"
+        result = ", ".join(CMatch("__acquires").search(line))
+        self.assertEqual(result, "__acquires(ctx), __acquires(other)")
+
+    def test_search_acquires_nested_paren(self):
+        line = "__acquires((ctx1, ctx2)) baz();"
+        result = ", ".join(CMatch("__acquires").search(line))
+        self.assertEqual(result, "__acquires((ctx1, ctx2))")
+
+    def test_search_must_hold(self):
+        line = "__must_hold(&lock) do_something();"
+        result = ", ".join(CMatch("__must_hold").search(line))
+        self.assertEqual(result, "__must_hold(&lock)")
+
+    def test_search_must_hold_shared(self):
+        line = "__must_hold_shared(RCU) other();"
+        result = ", ".join(CMatch("__must_hold_shared").search(line))
+        self.assertEqual(result, "__must_hold_shared(RCU)")
+
+    def test_search_no_false_positive(self):
+        line = "call__acquires(foo);  // should stay intact"
+        result = ", ".join(CMatch(r"\b__acquires").search(line))
+        self.assertEqual(result, "")
+
+    def test_search_no_macro_remains(self):
+        line = "do_something_else();"
+        result = ", ".join(CMatch("__acquires").search(line))
+        self.assertEqual(result, "")
+
+    def test_search_no_function(self):
+        line = "something"
+        result = ", ".join(CMatch(line).search(line))
+        self.assertEqual(result, "")
+
+#
+# Run all tests
+#
+if __name__ == "__main__":
+    run_unittest(__file__)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 12/22] docs: c_lex: properly implement a sub() method for CMatch
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (10 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 11/22] tools: unittests: add tests for CMatch Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 13/22] unittests: test_cmatch: add tests for sub() Mauro Carvalho Chehab
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel

Implement a sub() method to do what it is expected, parsing
backref arguments like \0, \1, \2, ...

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/c_lex.py | 272 +++++++++++++++++++++++++++++++--
 1 file changed, 259 insertions(+), 13 deletions(-)

diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index 5da472734ff7..20e50ff0ecd5 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -16,6 +16,8 @@ Other errors are logged via log instance.
 import logging
 import re
 
+from copy import copy
+
 from .kdoc_re import KernRe
 
 log = logging.getLogger(__name__)
@@ -284,6 +286,172 @@ class CTokenizer():
         return out
 
 
+class CTokenArgs:
+    """
+    Ancillary class to help using backrefs from sub matches.
+
+    If the highest backref contain a "+" at the last element,
+    the logic will be greedy, picking all other delims.
+
+    This is needed to parse struct_group macros with end with ``MEMBERS...``.
+    """
+    def __init__(self, sub_str):
+        self.sub_groups = set()
+        self.max_group = -1
+        self.greedy = None
+
+        for m in KernRe(r'\\(\d+)([+]?)').finditer(sub_str):
+            group = int(m.group(1))
+            if m.group(2) == "+":
+                if self.greedy and self.greedy != group:
+                    raise ValueError("There are multiple greedy patterns!")
+                self.greedy = group
+
+            self.sub_groups.add(group)
+            self.max_group = max(self.max_group, group)
+
+        if self.greedy:
+            if self.greedy != self.max_group:
+                raise ValueError("Greedy pattern is not the last one!")
+
+            sub_str = KernRe(r'(\\\d+)[+]').sub(r"\1", sub_str)
+
+        self.sub_str = sub_str
+        self.sub_tokeninzer = CTokenizer(sub_str)
+
+    def groups(self, new_tokenizer):
+        """
+        Create replacement arguments for backrefs like:
+
+        ``\0``, ``\1``, ``\2``, ...``\n``
+
+        It also accepts a ``+`` character to the highest backref. When used,
+        it means in practice to ignore delimins after it, being greedy.
+
+        The logic is smart enough to only go up to the maximum required
+        argument, even if there are more.
+
+        If there is a backref for an argument above the limit, it will
+        raise an exception. Please notice that, on C, square brackets
+        don't have any separator on it. Trying to use ``\1``..``\n`` for
+        brackets also raise an exception.
+        """
+
+        level = (0, 0, 0)
+
+        if self.max_group < 0:
+            return level, []
+
+        tokens = new_tokenizer.tokens
+
+        #
+        # Fill \0 with the full token contents
+        #
+        groups_list = [ [] ]
+
+        if 0 in self.sub_groups:
+            inner_level = 0
+
+            for i in range(0, len(tokens)):
+                tok = tokens[i]
+
+                if tok.kind == CToken.BEGIN:
+                    inner_level += 1
+
+                    #
+                    # Discard first begin
+                    #
+                    if not groups_list[0]:
+                        continue
+                elif tok.kind == CToken.END:
+                    inner_level -= 1
+                    if inner_level < 0:
+                        break
+
+                if inner_level:
+                    groups_list[0].append(tok)
+
+        if not self.max_group:
+            return level, groups_list
+
+        delim = None
+
+        #
+        # Ignore everything before BEGIN. The value of begin gives the
+        # delimiter to be used for the matches
+        #
+        for i in range(0, len(tokens)):
+            tok = tokens[i]
+            if tok.kind == CToken.BEGIN:
+                if tok.value == "{":
+                    delim = ";"
+                elif tok.value == "(":
+                    delim = ","
+                else:
+                    self.log.error(fr"Can't handle \1..\n on {sub_str}")
+
+                level = tok.level
+                break
+
+        pos = 1
+        groups_list.append([])
+
+        inner_level = 0
+        for i in range(i + 1, len(tokens)):
+            tok = tokens[i]
+
+            if tok.kind == CToken.BEGIN:
+                inner_level += 1
+            if tok.kind == CToken.END:
+                inner_level -= 1
+                if inner_level < 0:
+                    break
+
+            if tok.kind in [CToken.PUNC, CToken.ENDSTMT] and delim == tok.value:
+                pos += 1
+                if self.greedy and pos > self.max_group:
+                    pos -= 1
+                else:
+                    groups_list.append([])
+
+                    if pos > self.max_group:
+                        break
+
+                    continue
+
+            groups_list[pos].append(tok)
+
+        if pos < self.max_group:
+            log.error(fr"{self.sub_str} groups are up to {pos} instead of {self.max_group}")
+
+        return level, groups_list
+
+    def tokens(self, new_tokenizer):
+        level, groups = self.groups(new_tokenizer)
+
+        new = CTokenizer()
+
+        for tok in self.sub_tokeninzer.tokens:
+            if tok.kind == CToken.BACKREF:
+                group = int(tok.value[1:])
+
+                for group_tok in groups[group]:
+                    new_tok = copy(group_tok)
+
+                    new_level = [0, 0, 0]
+
+                    for i in range(0, len(level)):
+                        new_level[i] = new_tok.level[i] + level[i]
+
+                    new_tok.level = tuple(new_level)
+
+                    new.tokens += [ new_tok ]
+            else:
+                new.tokens += [ tok ]
+
+        return new.tokens
+
+
 class CMatch:
     """
     Finding nested delimiters is hard with regular expressions. It is
@@ -309,10 +477,10 @@ class CMatch:
     will ignore the search string.
     """
 
-    # TODO: add a sub method
 
-    def __init__(self, regex):
-        self.regex = KernRe(regex)
+    def __init__(self, regex, delim="("):
+        self.regex = KernRe("^" + regex + r"\b")
+        self.start_delim = delim
 
     def _search(self, tokenizer):
         """
@@ -335,7 +503,6 @@ class CMatch:
         """
 
         start = None
-        offset = -1
         started = False
 
         import sys
@@ -351,15 +518,24 @@ class CMatch:
 
                 continue
 
-            if not started and tok.kind == CToken.BEGIN:
-                started = True
-                continue
+            if not started:
+                if tok.kind == CToken.SPACE:
+                    continue
+
+                if tok.kind == CToken.BEGIN and tok.value == self.start_delim:
+                    started = True
+                    continue
+
+                # Name only token without BEGIN/END
+                if i > start:
+                    i -= 1
+                yield start, i
+                start = None
 
             if tok.kind == CToken.END and tok.level == stack[-1][1]:
                 start, level = stack.pop()
-                offset = i
 
-                yield CTokenizer(tokenizer.tokens[start:offset + 1])
+                yield start, i
                 start = None
 
         #
@@ -367,9 +543,12 @@ class CMatch:
         # This is meant to solve cases where the caller logic might be
         # picking an incomplete block.
         #
-        if start and offset < 0:
-            print("WARNING: can't find an end", file=sys.stderr)
-            yield CTokenizer(tokenizer.tokens[start:])
+        if start and stack:
+            if started:
+                s = str(tokenizer)
+                log.warning(f"can't find a final end at {s}")
+
+            yield start, len(tokenizer.tokens)
 
     def search(self, source):
         """
@@ -386,8 +565,75 @@ class CMatch:
             tokenizer = CTokenizer(source)
             is_token = False
 
-        for new_tokenizer in self._search(tokenizer):
+        for start, end in self._search(tokenizer):
+            new_tokenizer = CTokenizer(tokenizer.tokens[start:end + 1])
+
             if is_token:
                 yield new_tokenizer
             else:
                 yield str(new_tokenizer)
+
+    def sub(self, sub_str, source, count=0):
+        """
+        This is similar to re.sub:
+
+        It matches a regex that it is followed by a delimiter,
+        replacing occurrences only if all delimiters are paired.
+
+        if the sub argument contains::
+
+            r'\0'
+
+        it will work just like re: it places there the matched paired data
+        with the delimiter stripped.
+
+        If count is different than zero, it will replace at most count
+        items.
+        """
+        if isinstance(source, CTokenizer):
+            is_token = True
+            tokenizer = source
+        else:
+            is_token = False
+            tokenizer = CTokenizer(source)
+
+        # Detect if sub_str contains sub arguments
+
+        args_match = CTokenArgs(sub_str)
+
+        new_tokenizer = CTokenizer()
+        pos = 0
+        n = 0
+
+        #
+        # NOTE: the code below doesn't consider overlays at sub.
+        # We may need to add some extra unit tests to check if those
+        # would cause problems. When replacing by "", this should not
+        # be a problem, but other transformations could be problematic
+        #
+        for start, end in self._search(tokenizer):
+            new_tokenizer.tokens += tokenizer.tokens[pos:start]
+
+            new = CTokenizer(tokenizer.tokens[start:end + 1])
+
+            new_tokenizer.tokens += args_match.tokens(new)
+
+            pos = end + 1
+
+            n += 1
+            if count and n >= count:
+                break
+
+        new_tokenizer.tokens += tokenizer.tokens[pos:]
+
+        if not is_token:
+            return str(new_tokenizer)
+
+        return new_tokenizer
+
+    def __repr__(self):
+        """
+        Returns a displayable version of the class init.
+        """
+
+        return f'CMatch("{self.regex.regex.pattern}")'
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 13/22] unittests: test_cmatch: add tests for sub()
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (11 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 12/22] docs: c_lex: properly implement a sub() method " Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 14/22] docs: kdoc: replace NestedMatch with CMatch Mauro Carvalho Chehab
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Kees Cook, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Gustavo A. R. Silva

Now that we have code for sub(), test it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/unittests/test_cmatch.py | 730 ++++++++++++++++++++++++++++++++-
 1 file changed, 728 insertions(+), 2 deletions(-)

diff --git a/tools/unittests/test_cmatch.py b/tools/unittests/test_cmatch.py
index 53b25aa4dc4a..7b996f83784d 100755
--- a/tools/unittests/test_cmatch.py
+++ b/tools/unittests/test_cmatch.py
@@ -21,7 +21,7 @@ SRC_DIR = os.path.dirname(os.path.realpath(__file__))
 sys.path.insert(0, os.path.join(SRC_DIR, "../lib/python"))
 
 from kdoc.c_lex import CMatch
-from kdoc.xforms_lists import CTransforms
+from kdoc.kdoc_re import KernRe
 from unittest_helper import run_unittest
 
 #
@@ -75,7 +75,7 @@ class TestSearch(TestCaseDiff):
 
     def test_search_no_false_positive(self):
         line = "call__acquires(foo);  // should stay intact"
-        result = ", ".join(CMatch(r"\b__acquires").search(line))
+        result = ", ".join(CMatch(r"__acquires").search(line))
         self.assertEqual(result, "")
 
     def test_search_no_macro_remains(self):
@@ -88,6 +88,732 @@ class TestSearch(TestCaseDiff):
         result = ", ".join(CMatch(line).search(line))
         self.assertEqual(result, "")
 
+#
+# Override unittest.TestCase to better compare diffs ignoring whitespaces
+#
+class TestCaseDiff(unittest.TestCase):
+    """
+    Disable maximum limit on diffs and add a method to better
+    handle diffs with whitespace differences.
+    """
+
+    @classmethod
+    def setUpClass(cls):
+        """Ensure that there won't be limit for diffs"""
+        cls.maxDiff = None
+
+    def assertLogicallyEqual(self, a, b):
+        """
+        Compare two results ignoring multiple whitespace differences.
+
+        This is useful to check more complex matches picked from examples.
+        On a plus side, we also don't need to use dedent.
+        Please notice that line breaks still need to match. We might
+        remove it at the regex, but this way, checking the diff is easier.
+        """
+        a = re.sub(r"[\t ]+", " ", a.strip())
+        b = re.sub(r"[\t ]+", " ", b.strip())
+
+        a = re.sub(r"\s+\n", "\n", a)
+        b = re.sub(r"\s+\n", "\n", b)
+
+        a = re.sub(" ;", ";", a)
+        b = re.sub(" ;", ";", b)
+
+        self.assertEqual(a, b)
+
+#
+# Tests doing with different macros
+#
+
+class TestSubMultipleMacros(TestCaseDiff):
+    """
+    Tests doing with different macros.
+
+    Here, we won't use assertLogicallyEqual. Instead, we'll check if each
+    of the expected patterns are present at the answer.
+    """
+
+    def test_acquires_simple(self):
+        """Simple replacement test with __acquires"""
+        line = "__acquires(ctx) foo();"
+        result = CMatch(r"__acquires").sub("REPLACED", line)
+
+        self.assertEqual("REPLACED foo();", result)
+
+    def test_acquires_multiple(self):
+        """Multiple __acquires"""
+        line = "__acquires(ctx) __acquires(other) bar();"
+        result = CMatch(r"__acquires").sub("REPLACED", line)
+
+        self.assertEqual("REPLACED REPLACED bar();", result)
+
+    def test_acquires_nested_paren(self):
+        """__acquires with nested pattern"""
+        line = "__acquires((ctx1, ctx2)) baz();"
+        result = CMatch(r"__acquires").sub("REPLACED", line)
+
+        self.assertEqual("REPLACED baz();", result)
+
+    def test_must_hold(self):
+        """__must_hold with a pointer"""
+        line = "__must_hold(&lock) do_something();"
+        result = CMatch(r"__must_hold").sub("REPLACED", line)
+
+        self.assertNotIn("__must_hold(", result)
+        self.assertIn("do_something();", result)
+
+    def test_must_hold_shared(self):
+        """__must_hold with an upercase defined value"""
+        line = "__must_hold_shared(RCU) other();"
+        result = CMatch(r"__must_hold_shared").sub("REPLACED", line)
+
+        self.assertNotIn("__must_hold_shared(", result)
+        self.assertIn("other();", result)
+
+    def test_no_false_positive(self):
+        """
+        Ensure that unrelated text containing similar patterns is preserved
+        """
+        line = "call__acquires(foo);  // should stay intact"
+        result = CMatch(r"\b__acquires").sub("REPLACED", line)
+
+        self.assertLogicallyEqual(result, "call__acquires(foo);")
+
+    def test_mixed_macros(self):
+        """Add a mix of macros"""
+        line = "__acquires(ctx) __releases(ctx) __must_hold(&lock) foo();"
+
+        result = CMatch(r"__acquires").sub("REPLACED", line)
+        result = CMatch(r"__releases").sub("REPLACED", result)
+        result = CMatch(r"__must_hold").sub("REPLACED", result)
+
+        self.assertNotIn("__acquires(", result)
+        self.assertNotIn("__releases(", result)
+        self.assertNotIn("__must_hold(", result)
+
+        self.assertIn("foo();", result)
+
+    def test_no_macro_remains(self):
+        """Ensures that unmatched macros are untouched"""
+        line = "do_something_else();"
+        result = CMatch(r"__acquires").sub("REPLACED", line)
+
+        self.assertEqual(result, line)
+
+    def test_no_function(self):
+        """Ensures that no functions will remain untouched"""
+        line = "something"
+        result = CMatch(line).sub("REPLACED", line)
+
+        self.assertEqual(result, line)
+
+#
+# Check if the diff is logically equivalent. To simplify, the tests here
+# use a single macro name for all replacements.
+#
+
+class TestSubSimple(TestCaseDiff):
+    """
+    Test argument replacements.
+
+    Here, the function name can be anything. So, we picked __attribute__(),
+    to mimic a macro found at the Kernel, but none of the replacements her
+    has any relationship with the Kernel usage.
+    """
+
+    MACRO = "__attribute__"
+
+    @classmethod
+    def setUpClass(cls):
+        """Define a CMatch to be used for all tests"""
+        cls.matcher = CMatch(cls.MACRO)
+
+    def test_sub_with_capture(self):
+        """Test all arguments replacement with a single arg"""
+        line = f"{self.MACRO}(&ctx)\nfoo();"
+
+        result = self.matcher.sub(r"ACQUIRED(\0)", line)
+
+        self.assertLogicallyEqual("ACQUIRED(&ctx)\nfoo();", result)
+
+    def test_sub_zero_placeholder(self):
+        """Test all arguments replacement with a multiple args"""
+        line = f"{self.MACRO}(arg1, arg2)\nbar();"
+
+        result = self.matcher.sub(r"REPLACED(\0)", line)
+
+        self.assertLogicallyEqual("REPLACED(arg1, arg2)\nbar();", result)
+
+    def test_sub_single_placeholder(self):
+        """Single replacement rule for \1"""
+        line = f"{self.MACRO}(ctx, boo)\nfoo();"
+        result = self.matcher.sub(r"ACQUIRED(\1)", line)
+
+        self.assertLogicallyEqual("ACQUIRED(ctx)\nfoo();", result)
+
+    def test_sub_multiple_placeholders(self):
+        """Replacement rule for both \1 and \2"""
+        line = f"{self.MACRO}(arg1, arg2)\nbar();"
+        result = self.matcher.sub(r"REPLACE(\1, \2)", line)
+
+        self.assertLogicallyEqual("REPLACE(arg1, arg2)\nbar();", result)
+
+    def test_sub_mixed_placeholders(self):
+        """Replacement rule for \0, \1 and additional text"""
+        line = f"{self.MACRO}(foo, bar)\nbaz();"
+        result = self.matcher.sub(r"ALL(\0) FIRST(\1)", line)
+
+        self.assertLogicallyEqual("ALL(foo, bar) FIRST(foo)\nbaz();", result)
+
+    def test_sub_no_placeholder(self):
+        """Replacement without placeholders"""
+        line = f"{self.MACRO}(arg)\nfoo();"
+        result = self.matcher.sub(r"NO_BACKREFS()", line)
+
+        self.assertLogicallyEqual("NO_BACKREFS()\nfoo();", result)
+
+    def test_sub_count_parameter(self):
+        """Verify that the algorithm stops after the requested count"""
+        line = f"{self.MACRO}(a1) x();\n{self.MACRO}(a2) y();"
+        result = self.matcher.sub(r"ONLY_FIRST(\1) ", line, count=1)
+
+        self.assertLogicallyEqual(f"ONLY_FIRST(a1) x();\n{self.MACRO}(a2) y();",
+                                  result)
+
+    def test_strip_multiple_acquires(self):
+        """Check if spaces between removed delimiters will be dropped"""
+        line = f"int {self.MACRO}(1)  {self.MACRO}(2 )   {self.MACRO}(3) foo;"
+        result = self.matcher.sub("", line)
+
+        self.assertLogicallyEqual(result, "int foo;")
+
+    def test_rise_early_greedy(self):
+        line = f"{self.MACRO}(a, b, c, d);"
+        sub = r"\1, \2+, \3"
+
+        with self.assertRaises(ValueError):
+            result = self.matcher.sub(sub, line)
+
+    def test_rise_multiple_greedy(self):
+        line = f"{self.MACRO}(a, b, c, d);"
+        sub = r"\1, \2+, \3+"
+
+        with self.assertRaises(ValueError):
+            result = self.matcher.sub(sub, line)
+
+#
+# Test replacements with slashrefs
+#
+
+
+class TestSubWithLocalXforms(TestCaseDiff):
+    """
+    Test diferent usecase patterns found at the Kernel.
+
+    Here, replacements using both CMatch and KernRe can be tested,
+    as it will import the actual replacement rules used by kernel-doc.
+    """
+
+    struct_xforms = [
+        (CMatch("__attribute__"), ' '),
+        (CMatch('__aligned'), ' '),
+        (CMatch('__counted_by'), ' '),
+        (CMatch('__counted_by_(le|be)'), ' '),
+        (CMatch('__guarded_by'), ' '),
+        (CMatch('__pt_guarded_by'), ' '),
+
+        (CMatch('__cacheline_group_(begin|end)'), ''),
+
+        (CMatch('struct_group'), r'\2'),
+        (CMatch('struct_group_attr'), r'\3'),
+        (CMatch('struct_group_tagged'), r'struct \1 { \3+ } \2;'),
+        (CMatch('__struct_group'), r'\4'),
+
+        (CMatch('__ETHTOOL_DECLARE_LINK_MODE_MASK'), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
+        (CMatch('DECLARE_PHY_INTERFACE_MASK',), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
+        (CMatch('DECLARE_BITMAP'), r'unsigned long \1[BITS_TO_LONGS(\2)]'),
+
+        (CMatch('DECLARE_HASHTABLE'), r'unsigned long \1[1 << ((\2) - 1)]'),
+        (CMatch('DECLARE_KFIFO'), r'\2 *\1'),
+        (CMatch('DECLARE_KFIFO_PTR'), r'\2 *\1'),
+        (CMatch('(?:__)?DECLARE_FLEX_ARRAY'), r'\1 \2[]'),
+        (CMatch('DEFINE_DMA_UNMAP_ADDR'), r'dma_addr_t \1'),
+        (CMatch('DEFINE_DMA_UNMAP_LEN'), r'__u32 \1'),
+        (CMatch('VIRTIO_DECLARE_FEATURES'), r'union { u64 \1; u64 \1_array[VIRTIO_FEATURES_U64S]; }'),
+    ]
+
+    function_xforms = [
+        (CMatch('__printf'), ""),
+        (CMatch('__(?:re)?alloc_size'), ""),
+        (CMatch("__diagnose_as"), ""),
+        (CMatch("DECL_BUCKET_PARAMS"), r"\1, \2"),
+
+        (CMatch("__cond_acquires"), ""),
+        (CMatch("__cond_releases"), ""),
+        (CMatch("__acquires"), ""),
+        (CMatch("__releases"), ""),
+        (CMatch("__must_hold"), ""),
+        (CMatch("__must_not_hold"), ""),
+        (CMatch("__must_hold_shared"), ""),
+        (CMatch("__cond_acquires_shared"), ""),
+        (CMatch("__acquires_shared"), ""),
+        (CMatch("__releases_shared"), ""),
+        (CMatch("__attribute__"), ""),
+    ]
+
+    var_xforms = [
+        (CMatch('__guarded_by'), ""),
+        (CMatch('__pt_guarded_by'), ""),
+        (CMatch("LIST_HEAD"), r"struct list_head \1"),
+    ]
+
+    #: Transforms main dictionary used at apply_transforms().
+    xforms = {
+        "struct": struct_xforms,
+        "func": function_xforms,
+        "var": var_xforms,
+    }
+
+    @classmethod
+    def apply_transforms(cls, xform_type, text):
+        """
+        Mimic the behavior of kdoc_parser.apply_transforms() method.
+
+        For each element of STRUCT_XFORMS, apply apply_transforms.
+
+        There are two parameters:
+
+        - ``xform_type``
+            Can be ``func``, ``struct`` or ``var``;
+        - ``text``
+            The text where the sub patterns from CTransforms will be applied.
+        """
+        for search, subst in cls.xforms.get(xform_type):
+            text = search.sub(subst, text)
+
+        return text.strip()
+
+        cls.matcher = CMatch(r"struct_group[\w\_]*")
+
+    def test_struct_group(self):
+        """
+        Test struct_group using a pattern from
+        drivers/net/ethernet/asix/ax88796c_main.h.
+        """
+        line = """
+            struct tx_pkt_info {
+                    struct_group(tx_overhead,
+                            struct tx_sop_header sop;
+                            struct tx_segment_header seg;
+                    );
+                    struct tx_eop_header eop;
+                    u16 pkt_len;
+                    u16 seq_num;
+            };
+        """
+        expected = """
+            struct tx_pkt_info {
+                    struct tx_sop_header sop;
+                    struct tx_segment_header seg;
+                    struct tx_eop_header eop;
+                    u16 pkt_len;
+                    u16 seq_num;
+            };
+        """
+
+        result = self.apply_transforms("struct", line)
+        self.assertLogicallyEqual(result, expected)
+
+    def test_struct_group_attr(self):
+        """
+        Test two struct_group_attr using patterns from fs/smb/client/cifspdu.h.
+        """
+        line = """
+            typedef struct smb_com_open_rsp {
+                struct smb_hdr hdr;     /* wct = 34 BB */
+                __u8 AndXCommand;
+                __u8 AndXReserved;
+                __le16 AndXOffset;
+                __u8 OplockLevel;
+                __u16 Fid;
+                __le32 CreateAction;
+                struct_group_attr(common_attributes,,
+                    __le64 CreationTime;
+                    __le64 LastAccessTime;
+                    __le64 LastWriteTime;
+                    __le64 ChangeTime;
+                    __le32 FileAttributes;
+                );
+                __le64 AllocationSize;
+                __le64 EndOfFile;
+                __le16 FileType;
+                __le16 DeviceState;
+                __u8 DirectoryFlag;
+                __u16 ByteCount;        /* bct = 0 */
+            } OPEN_RSP;
+            typedef struct {
+                struct_group_attr(common_attributes,,
+                    __le64 CreationTime;
+                    __le64 LastAccessTime;
+                    __le64 LastWriteTime;
+                    __le64 ChangeTime;
+                    __le32 Attributes;
+                );
+                __u32 Pad1;
+                __le64 AllocationSize;
+                __le64 EndOfFile;
+                __le32 NumberOfLinks;
+                __u8 DeletePending;
+                __u8 Directory;
+                __u16 Pad2;
+                __le32 EASize;
+                __le32 FileNameLength;
+                union {
+                    char __pad;
+                    DECLARE_FLEX_ARRAY(char, FileName);
+                };
+            } FILE_ALL_INFO;       /* level 0x107 QPathInfo */
+        """
+        expected = """
+            typedef struct smb_com_open_rsp {
+                struct smb_hdr hdr;
+                __u8 AndXCommand;
+                __u8 AndXReserved;
+                __le16 AndXOffset;
+                __u8 OplockLevel;
+                __u16 Fid;
+                __le32 CreateAction;
+                __le64 CreationTime;
+                __le64 LastAccessTime;
+                __le64 LastWriteTime;
+                __le64 ChangeTime;
+                __le32 FileAttributes;
+                __le64 AllocationSize;
+                __le64 EndOfFile;
+                __le16 FileType;
+                __le16 DeviceState;
+                __u8 DirectoryFlag;
+                __u16 ByteCount;
+            } OPEN_RSP;
+        typedef struct {
+            __le64 CreationTime;
+            __le64 LastAccessTime;
+            __le64 LastWriteTime;
+            __le64 ChangeTime;
+            __le32 Attributes;
+            __u32 Pad1;
+            __le64 AllocationSize;
+            __le64 EndOfFile;
+            __le32 NumberOfLinks;
+            __u8 DeletePending;
+            __u8 Directory;
+            __u16 Pad2;
+            __le32 EASize;
+            __le32 FileNameLength;
+            union {
+                char __pad;
+                char FileName[];
+            };
+        } FILE_ALL_INFO;
+        """
+
+        result = self.apply_transforms("struct", line)
+        self.assertLogicallyEqual(result, expected)
+
+    def test_raw_struct_group(self):
+        """
+        Test a __struct_group pattern from include/uapi/cxl/features.h.
+        """
+        line = """
+            struct cxl_mbox_get_sup_feats_out {
+                __struct_group(cxl_mbox_get_sup_feats_out_hdr, hdr, /* empty */,
+                    __le16 num_entries;
+                    __le16 supported_feats;
+                    __u8 reserved[4];
+                );
+                struct cxl_feat_entry ents[] __counted_by_le(num_entries);
+            } __attribute__ ((__packed__));
+        """
+        expected = """
+            struct cxl_mbox_get_sup_feats_out {
+                __le16 num_entries;
+                __le16 supported_feats;
+                __u8 reserved[4];
+                struct cxl_feat_entry ents[];
+            };
+        """
+
+        result = self.apply_transforms("struct", line)
+        self.assertLogicallyEqual(result, expected)
+
+    def test_raw_struct_group_tagged(self):
+        r"""
+        Test cxl_regs with struct_group_tagged patterns from drivers/cxl/cxl.h.
+
+        NOTE:
+
+            This one has actually a violation from what kernel-doc would
+            expect: Kernel-doc regex expects only 3 members, but this is
+            actually defined as::
+
+                #define struct_group_tagged(TAG, NAME, MEMBERS...)
+
+            The replace expression there is::
+
+                struct \1 { \3 } \2;
+
+            but it should be really something like::
+
+                struct \1 { \3 \4 \5 \6 \7 \8 ... } \2;
+
+            a later fix would be needed to address it.
+
+        """
+        line = """
+            struct cxl_regs {
+                struct_group_tagged(cxl_component_regs, component,
+                    void __iomem *hdm_decoder;
+                    void __iomem *ras;
+                );
+
+
+                /* This is actually a violation: too much commas */
+                struct_group_tagged(cxl_device_regs, device_regs,
+                    void __iomem *status, *mbox, *memdev;
+                );
+
+                struct_group_tagged(cxl_pmu_regs, pmu_regs,
+                    void __iomem *pmu;
+                );
+
+                struct_group_tagged(cxl_rch_regs, rch_regs,
+                    void __iomem *dport_aer;
+                );
+
+                struct_group_tagged(cxl_rcd_regs, rcd_regs,
+                    void __iomem *rcd_pcie_cap;
+                );
+            };
+        """
+        expected = """
+        struct cxl_regs {
+            struct cxl_component_regs {
+                void __iomem *hdm_decoder;
+                void __iomem *ras;
+            } component;
+
+            struct cxl_device_regs {
+                void __iomem *status, *mbox, *memdev;
+            } device_regs;
+
+            struct cxl_pmu_regs {
+                void __iomem *pmu;
+            } pmu_regs;
+
+            struct cxl_rch_regs {
+                void __iomem *dport_aer;
+            } rch_regs;
+
+            struct cxl_rcd_regs {
+                void __iomem *rcd_pcie_cap;
+            } rcd_regs;
+        };
+        """
+
+        result = self.apply_transforms("struct", line)
+        self.assertLogicallyEqual(result, expected)
+
+    def test_struct_group_tagged_with_private(self):
+        """
+        Replace struct_group_tagged with private, using the same regex
+        for the replacement as what happens in xforms_lists.py.
+
+        As the private removal happens outside NestedGroup class, we manually
+        dropped the remaining part of the struct, to simulate what happens
+        at kdoc_parser.
+
+        Taken from include/net/page_pool/types.h
+        """
+        line = """
+            struct page_pool_params {
+                struct_group_tagged(page_pool_params_slow, slow,
+                                    struct net_device *netdev;
+                                    unsigned int queue_idx;
+                                    unsigned int    flags;
+                                    /* private: only under "slow" struct */
+                                    unsigned int ignored;
+                );
+                /* Struct below shall not be ignored */
+                struct_group_tagged(page_pool_params_fast, fast,
+                                    unsigned int    order;
+                                    unsigned int    pool_size;
+                                    int             nid;
+                                    struct device   *dev;
+                                    struct napi_struct *napi;
+                                    enum dma_data_direction dma_dir;
+                                    unsigned int    max_len;
+                                    unsigned int    offset;
+                );
+            };
+        """
+        expected = """
+            struct page_pool_params {
+                struct page_pool_params_slow {
+                    struct net_device *netdev;
+                    unsigned int queue_idx;
+                    unsigned int    flags;
+                } slow;
+                struct page_pool_params_fast {
+                    unsigned int order;
+                    unsigned int    pool_size;
+                    int             nid;
+                    struct device   *dev;
+                    struct napi_struct *napi;
+                    enum dma_data_direction dma_dir;
+                    unsigned int    max_len;
+                    unsigned int    offset;
+                } fast;
+            };
+        """
+
+        result = self.apply_transforms("struct", line)
+        self.assertLogicallyEqual(result, expected)
+
+    def test_struct_kcov(self):
+        """
+        """
+        line = """
+            struct kcov {
+                refcount_t              refcount;
+                spinlock_t              lock;
+                enum kcov_mode          mode __guarded_by(&lock);
+                unsigned int            size __guarded_by(&lock);
+                void                    *area __guarded_by(&lock);
+                struct task_struct      *t __guarded_by(&lock);
+                bool                    remote;
+                unsigned int            remote_size;
+                int                     sequence;
+            };
+        """
+        expected = """
+        """
+
+        result = self.apply_transforms("struct", line)
+        self.assertLogicallyEqual(result, expected)
+
+
+    def test_struct_kcov(self):
+        """
+        Test a struct from kernel/kcov.c.
+        """
+        line = """
+            struct kcov {
+                refcount_t              refcount;
+                spinlock_t              lock;
+                enum kcov_mode          mode __guarded_by(&lock);
+                unsigned int            size __guarded_by(&lock);
+                void                    *area __guarded_by(&lock);
+                struct task_struct      *t __guarded_by(&lock);
+                bool                    remote;
+                unsigned int            remote_size;
+                int                     sequence;
+            };
+        """
+        expected = """
+            struct kcov {
+                refcount_t              refcount;
+                spinlock_t              lock;
+                enum kcov_mode          mode;
+                unsigned int            size;
+                void                    *area;
+                struct task_struct      *t;
+                bool                    remote;
+                unsigned int            remote_size;
+                int                     sequence;
+            };
+        """
+
+        result = self.apply_transforms("struct", line)
+        self.assertLogicallyEqual(result, expected)
+
+    def test_vars_stackdepot(self):
+        """
+        Test guarded_by on vars from lib/stackdepot.c.
+        """
+        line = """
+            size_t pool_offset __guarded_by(&pool_lock) = DEPOT_POOL_SIZE;
+            __guarded_by(&pool_lock) LIST_HEAD(free_stacks);
+            void **stack_pools __pt_guarded_by(&pool_lock);
+        """
+        expected = """
+            size_t pool_offset = DEPOT_POOL_SIZE;
+            struct list_head free_stacks;
+            void **stack_pools;
+        """
+
+        result = self.apply_transforms("var", line)
+        self.assertLogicallyEqual(result, expected)
+
+    def test_functions_with_acquires_and_releases(self):
+        """
+        Test guarded_by on vars from lib/stackdepot.c.
+        """
+        line = """
+            bool prepare_report_consumer(unsigned long *flags,
+                                         const struct access_info *ai,
+                                         struct other_info *other_info) \
+                                        __cond_acquires(true, &report_lock);
+
+            int tcp_sigpool_start(unsigned int id, struct tcp_sigpool *c) \
+                                  __cond_acquires(0, RCU_BH);
+
+            bool undo_report_consumer(unsigned long *flags,
+                                      const struct access_info *ai,
+                                      struct other_info *other_info) \
+                                     __cond_releases(true, &report_lock);
+
+            void debugfs_enter_cancellation(struct file *file,
+                                            struct debugfs_cancellation *c) \
+                                           __acquires(cancellation);
+
+            void debugfs_leave_cancellation(struct file *file,
+                                            struct debugfs_cancellation *c) \
+                                           __releases(cancellation);
+
+            acpi_cpu_flags acpi_os_acquire_lock(acpi_spinlock lockp) \
+                                               __acquires(lockp);
+
+            void acpi_os_release_lock(acpi_spinlock lockp,
+                                      acpi_cpu_flags not_used) \
+                                     __releases(lockp)
+        """
+        expected = """
+            bool prepare_report_consumer(unsigned long *flags,
+                                         const struct access_info *ai,
+                                         struct other_info *other_info);
+
+            int tcp_sigpool_start(unsigned int id, struct tcp_sigpool *c);
+
+            bool undo_report_consumer(unsigned long *flags,
+                                      const struct access_info *ai,
+                                      struct other_info *other_info);
+
+            void debugfs_enter_cancellation(struct file *file,
+                                            struct debugfs_cancellation *c);
+
+            void debugfs_leave_cancellation(struct file *file,
+                                            struct debugfs_cancellation *c);
+
+            acpi_cpu_flags acpi_os_acquire_lock(acpi_spinlock lockp);
+
+            void acpi_os_release_lock(acpi_spinlock lockp,
+                                      acpi_cpu_flags not_used)
+        """
+
+        result = self.apply_transforms("func", line)
+        self.assertLogicallyEqual(result, expected)
+
 #
 # Run all tests
 #
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 14/22] docs: kdoc: replace NestedMatch with CMatch
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (12 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 13/22] unittests: test_cmatch: add tests for sub() Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 15/22] docs: kdoc_re: get rid of NestedMatch class Mauro Carvalho Chehab
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Aleksandr Loktionov, Randy Dunlap

Our previous approach to solve nested structs were to use
NestedMatch. It works well, but adding support to parse delimiters
is very complex.

Instead, use CMatch, which uses a C tokenizer, making the code more
reliable and simpler.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/kdoc_parser.py  |  2 +-
 tools/lib/python/kdoc/xforms_lists.py | 31 ++++++++++++++-------------
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 62d8030cf532..efd58c88ff31 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -14,7 +14,7 @@ import re
 from pprint import pformat
 
 from kdoc.c_lex import CTokenizer
-from kdoc.kdoc_re import NestedMatch, KernRe
+from kdoc.kdoc_re import KernRe
 from kdoc.kdoc_item import KdocItem
 
 #
diff --git a/tools/lib/python/kdoc/xforms_lists.py b/tools/lib/python/kdoc/xforms_lists.py
index c07cbe1e6349..7fa7f52cec7b 100644
--- a/tools/lib/python/kdoc/xforms_lists.py
+++ b/tools/lib/python/kdoc/xforms_lists.py
@@ -4,7 +4,8 @@
 
 import re
 
-from kdoc.kdoc_re import KernRe, NestedMatch
+from kdoc.kdoc_re import KernRe
+from kdoc.c_lex import CMatch
 
 struct_args_pattern = r'([^,)]+)'
 
@@ -60,7 +61,7 @@ class CTransforms:
         #
         # As it doesn't properly match the end parenthesis on some cases.
         #
-        # So, a better solution was crafted: there's now a NestedMatch
+        # So, a better solution was crafted: there's now a CMatch
         # class that ensures that delimiters after a search are properly
         # matched. So, the implementation to drop STRUCT_GROUP() will be
         # handled in separate.
@@ -72,9 +73,9 @@ class CTransforms:
         #
         # Replace macros
         #
-        # TODO: use NestedMatch for FOO($1, $2, ...) matches
+        # TODO: use CMatch for FOO($1, $2, ...) matches
         #
-        # it is better to also move those to the NestedMatch logic,
+        # it is better to also move those to the CMatch logic,
         # to ensure that parentheses will be properly matched.
         #
         (KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S),
@@ -95,17 +96,17 @@ class CTransforms:
         (KernRe(r'DEFINE_DMA_UNMAP_LEN\s*\(' + struct_args_pattern + r'\)', re.S), r'__u32 \1'),
         (KernRe(r'VIRTIO_DECLARE_FEATURES\(([\w_]+)\)'), r'union { u64 \1; u64 \1_array[VIRTIO_FEATURES_U64S]; }'),
 
-        (NestedMatch(r"__cond_acquires\s*\("), ""),
-        (NestedMatch(r"__cond_releases\s*\("), ""),
-        (NestedMatch(r"__acquires\s*\("), ""),
-        (NestedMatch(r"__releases\s*\("), ""),
-        (NestedMatch(r"__must_hold\s*\("), ""),
-        (NestedMatch(r"__must_not_hold\s*\("), ""),
-        (NestedMatch(r"__must_hold_shared\s*\("), ""),
-        (NestedMatch(r"__cond_acquires_shared\s*\("), ""),
-        (NestedMatch(r"__acquires_shared\s*\("), ""),
-        (NestedMatch(r"__releases_shared\s*\("), ""),
-        (NestedMatch(r'\bSTRUCT_GROUP\('), r'\0'),
+        (CMatch(r"__cond_acquires"), ""),
+        (CMatch(r"__cond_releases"), ""),
+        (CMatch(r"__acquires"), ""),
+        (CMatch(r"__releases"), ""),
+        (CMatch(r"__must_hold"), ""),
+        (CMatch(r"__must_not_hold"), ""),
+        (CMatch(r"__must_hold_shared"), ""),
+        (CMatch(r"__cond_acquires_shared"), ""),
+        (CMatch(r"__acquires_shared"), ""),
+        (CMatch(r"__releases_shared"), ""),
+        (CMatch(r"STRUCT_GROUP"), r'\0'),
     ]
 
     #: Transforms for function prototypes.
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 15/22] docs: kdoc_re: get rid of NestedMatch class
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (13 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 14/22] docs: kdoc: replace NestedMatch with CMatch Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 16/22] docs: xforms_lists: handle struct_group directly Mauro Carvalho Chehab
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Aleksandr Loktionov, Randy Dunlap

Now that everything was converted to CMatch, we can get rid of
the previous NestedMatch implementation.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/kdoc_re.py | 201 -------------------------------
 1 file changed, 201 deletions(-)

diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
index 085b89a4547c..6f3ae28859ea 100644
--- a/tools/lib/python/kdoc/kdoc_re.py
+++ b/tools/lib/python/kdoc/kdoc_re.py
@@ -140,204 +140,3 @@ class KernRe:
         """
 
         return self.last_match.groups()
-
-#: Nested delimited pairs (brackets and parenthesis)
-DELIMITER_PAIRS = {
-    '{': '}',
-    '(': ')',
-    '[': ']',
-}
-
-#: compiled delimiters
-RE_DELIM = KernRe(r'[\{\}\[\]\(\)]')
-
-
-class NestedMatch:
-    """
-    Finding nested delimiters is hard with regular expressions. It is
-    even harder on Python with its normal re module, as there are several
-    advanced regular expressions that are missing.
-
-    This is the case of this pattern::
-
-            '\\bSTRUCT_GROUP(\\(((?:(?>[^)(]+)|(?1))*)\\))[^;]*;'
-
-    which is used to properly match open/close parentheses of the
-    string search STRUCT_GROUP(),
-
-    Add a class that counts pairs of delimiters, using it to match and
-    replace nested expressions.
-
-    The original approach was suggested by:
-
-        https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
-
-    Although I re-implemented it to make it more generic and match 3 types
-    of delimiters. The logic checks if delimiters are paired. If not, it
-    will ignore the search string.
-    """
-
-    # TODO: make NestedMatch handle multiple match groups
-    #
-    # Right now, regular expressions to match it are defined only up to
-    #       the start delimiter, e.g.:
-    #
-    #       \bSTRUCT_GROUP\(
-    #
-    # is similar to: STRUCT_GROUP\((.*)\)
-    # except that the content inside the match group is delimiter-aligned.
-    #
-    # The content inside parentheses is converted into a single replace
-    # group (e.g. r`\0').
-    #
-    # It would be nice to change such definition to support multiple
-    # match groups, allowing a regex equivalent to:
-    #
-    #   FOO\((.*), (.*), (.*)\)
-    #
-    # it is probably easier to define it not as a regular expression, but
-    # with some lexical definition like:
-    #
-    #   FOO(arg1, arg2, arg3)
-
-    def __init__(self, regex):
-        self.regex = KernRe(regex)
-
-    def _search(self, line):
-        """
-        Finds paired blocks for a regex that ends with a delimiter.
-
-        The suggestion of using finditer to match pairs came from:
-        https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
-        but I ended using a different implementation to align all three types
-        of delimiters and seek for an initial regular expression.
-
-        The algorithm seeks for open/close paired delimiters and places them
-        into a stack, yielding a start/stop position of each match when the
-        stack is zeroed.
-
-        The algorithm should work fine for properly paired lines, but will
-        silently ignore end delimiters that precede a start delimiter.
-        This should be OK for kernel-doc parser, as unaligned delimiters
-        would cause compilation errors. So, we don't need to raise exceptions
-        to cover such issues.
-        """
-
-        stack = []
-
-        for match_re in self.regex.finditer(line):
-            start = match_re.start()
-            offset = match_re.end()
-            string_char = None
-            escape = False
-
-            d = line[offset - 1]
-            if d not in DELIMITER_PAIRS:
-                continue
-
-            end = DELIMITER_PAIRS[d]
-            stack.append(end)
-
-            for match in RE_DELIM.finditer(line[offset:]):
-                pos = match.start() + offset
-
-                d = line[pos]
-
-                if escape:
-                    escape = False
-                    continue
-
-                if string_char:
-                    if d == '\\':
-                        escape = True
-                    elif d == string_char:
-                        string_char = None
-
-                    continue
-
-                if d in ('"', "'"):
-                    string_char = d
-                    continue
-
-                if d in DELIMITER_PAIRS:
-                    end = DELIMITER_PAIRS[d]
-
-                    stack.append(end)
-                    continue
-
-                # Does the end delimiter match what is expected?
-                if stack and d == stack[-1]:
-                    stack.pop()
-
-                    if not stack:
-                        yield start, offset, pos + 1
-                        break
-
-    def search(self, line):
-        """
-        This is similar to re.search:
-
-        It matches a regex that it is followed by a delimiter,
-        returning occurrences only if all delimiters are paired.
-        """
-
-        for t in self._search(line):
-
-            yield line[t[0]:t[2]]
-
-    def sub(self, sub, line, count=0):
-        """
-        This is similar to re.sub:
-
-        It matches a regex that it is followed by a delimiter,
-        replacing occurrences only if all delimiters are paired.
-
-        if the sub argument contains::
-
-            r'\0'
-
-        it will work just like re: it places there the matched paired data
-        with the delimiter stripped.
-
-        If count is different than zero, it will replace at most count
-        items.
-        """
-        out = ""
-
-        cur_pos = 0
-        n = 0
-
-        for start, end, pos in self._search(line):
-            out += line[cur_pos:start]
-
-            # Value, ignoring start/end delimiters
-            value = line[end:pos - 1]
-
-            # replaces \0 at the sub string, if \0 is used there
-            new_sub = sub
-            new_sub = new_sub.replace(r'\0', value)
-
-            out += new_sub
-
-            # Drop end ';' if any
-            if pos < len(line) and line[pos] == ';':
-                pos += 1
-
-            cur_pos = pos
-            n += 1
-
-            if count and count >= n:
-                break
-
-        # Append the remaining string
-        l = len(line)
-        out += line[cur_pos:l]
-
-        return out
-
-    def __repr__(self):
-        """
-        Returns a displayable version of the class init.
-        """
-
-        return f'NestedMatch("{self.regex.regex.pattern}")'
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 16/22] docs: xforms_lists: handle struct_group directly
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (14 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 15/22] docs: kdoc_re: get rid of NestedMatch class Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 17/22] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Aleksandr Loktionov, Randy Dunlap

The previous logic was handling struct_group on two steps.
Remove the previous approach, as CMatch can do it the right
way on a single step.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/xforms_lists.py | 53 +++------------------------
 1 file changed, 6 insertions(+), 47 deletions(-)

diff --git a/tools/lib/python/kdoc/xforms_lists.py b/tools/lib/python/kdoc/xforms_lists.py
index 7fa7f52cec7b..98632c50a146 100644
--- a/tools/lib/python/kdoc/xforms_lists.py
+++ b/tools/lib/python/kdoc/xforms_lists.py
@@ -32,52 +32,6 @@ class CTransforms:
         (KernRe(r'\s*____cacheline_aligned_in_smp', re.S), ' '),
         (KernRe(r'\s*____cacheline_aligned', re.S), ' '),
         (KernRe(r'\s*__cacheline_group_(begin|end)\([^\)]+\);'), ''),
-        #
-        # Unwrap struct_group macros based on this definition:
-        # __struct_group(TAG, NAME, ATTRS, MEMBERS...)
-        # which has variants like: struct_group(NAME, MEMBERS...)
-        # Only MEMBERS arguments require documentation.
-        #
-        # Parsing them happens on two steps:
-        #
-        # 1. drop struct group arguments that aren't at MEMBERS,
-        #    storing them as STRUCT_GROUP(MEMBERS)
-        #
-        # 2. remove STRUCT_GROUP() ancillary macro.
-        #
-        # The original logic used to remove STRUCT_GROUP() using an
-        # advanced regex:
-        #
-        #   \bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;
-        #
-        # with two patterns that are incompatible with
-        # Python re module, as it has:
-        #
-        #   - a recursive pattern: (?1)
-        #   - an atomic grouping: (?>...)
-        #
-        # I tried a simpler version: but it didn't work either:
-        #   \bSTRUCT_GROUP\(([^\)]+)\)[^;]*;
-        #
-        # As it doesn't properly match the end parenthesis on some cases.
-        #
-        # So, a better solution was crafted: there's now a CMatch
-        # class that ensures that delimiters after a search are properly
-        # matched. So, the implementation to drop STRUCT_GROUP() will be
-        # handled in separate.
-        #
-        (KernRe(r'\bstruct_group\s*\(([^,]*,)', re.S), r'STRUCT_GROUP('),
-        (KernRe(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S), r'STRUCT_GROUP('),
-        (KernRe(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S), r'struct \1 \2; STRUCT_GROUP('),
-        (KernRe(r'\b__struct_group\s*\(([^,]*,){3}', re.S), r'STRUCT_GROUP('),
-        #
-        # Replace macros
-        #
-        # TODO: use CMatch for FOO($1, $2, ...) matches
-        #
-        # it is better to also move those to the CMatch logic,
-        # to ensure that parentheses will be properly matched.
-        #
         (KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S),
         r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
         (KernRe(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S),
@@ -106,7 +60,12 @@ class CTransforms:
         (CMatch(r"__cond_acquires_shared"), ""),
         (CMatch(r"__acquires_shared"), ""),
         (CMatch(r"__releases_shared"), ""),
-        (CMatch(r"STRUCT_GROUP"), r'\0'),
+
+        (CMatch('struct_group'), r'\2'),
+        (CMatch('struct_group_attr'), r'\3'),
+        (CMatch('struct_group_tagged'), r'struct \1 \2; \3'),
+        (CMatch('__struct_group'), r'\4'),
+
     ]
 
     #: Transforms for function prototypes.
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 17/22] docs: xforms_lists: better evaluate struct_group macros
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (15 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 16/22] docs: xforms_lists: handle struct_group directly Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-18  7:44   ` Loktionov, Aleksandr
  2026-03-17 18:09 ` [PATCH v3 18/22] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Aleksandr Loktionov, Randy Dunlap

The previous approach were to unwind nested structs/unions.

Now that we have a logic that can handle it well, use it to
ensure that struct_group macros will properly reflect the
actual struct.

Note that the replacemend logic still simplifies the code
a little bit, as the basic build block for struct group is:

	union { \
		struct { MEMBERS } ATTRS; \
		struct __struct_group_tag(TAG) { MEMBERS } ATTRS NAME; \
	} ATTRS

There:

- ATTRS is meant to add extra macro attributes like __packed
  which we already discard, as they aren't relevant to
  document struct members;

- TAG is used only when built with __cplusplus.

So, instead, convert them into just:

    struct { MEMBERS };

Please notice that here, we're using the greedy version of the
backrefs, as MEMBERS is actually MEMBERS... on all such macros.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/xforms_lists.py | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/tools/lib/python/kdoc/xforms_lists.py b/tools/lib/python/kdoc/xforms_lists.py
index 98632c50a146..2056572852fd 100644
--- a/tools/lib/python/kdoc/xforms_lists.py
+++ b/tools/lib/python/kdoc/xforms_lists.py
@@ -61,10 +61,16 @@ class CTransforms:
         (CMatch(r"__acquires_shared"), ""),
         (CMatch(r"__releases_shared"), ""),
 
-        (CMatch('struct_group'), r'\2'),
-        (CMatch('struct_group_attr'), r'\3'),
-        (CMatch('struct_group_tagged'), r'struct \1 \2; \3'),
-        (CMatch('__struct_group'), r'\4'),
+        #
+        # Macro __struct_group() creates an union with an anonymous
+        # and a non-anonymous struct, depending on the parameters. We only
+        # need one of those at kernel-doc, as we won't be documenting the same
+        # members twice.
+        #
+        (CMatch('struct_group'), r'struct { \2+ };'),
+        (CMatch('struct_group_attr'), r'struct { \3+ };'),
+        (CMatch('struct_group_tagged'), r'struct { \3+ };'),
+        (CMatch('__struct_group'), r'struct { \4+ };'),
 
     ]
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 18/22] docs: c_lex: setup a logger to report tokenizer issues
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (16 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 17/22] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-18  7:45   ` Loktionov, Aleksandr
  2026-03-17 18:09 ` [PATCH v3 19/22] docs: kernel-doc.rst: document private: scope propagation Mauro Carvalho Chehab
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Aleksandr Loktionov, Randy Dunlap

Report file that has issues detected via CMatch and CTokenizer.

This is done by setting up a logger that will be overriden by
kdoc_parser, when used on it.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/c_lex.py       | 16 ++++++++++++++++
 tools/lib/python/kdoc/kdoc_parser.py |  4 +++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index 20e50ff0ecd5..b6d58bd470a9 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -22,6 +22,22 @@ from .kdoc_re import KernRe
 
 log = logging.getLogger(__name__)
 
+def tokenizer_set_log(logger, prefix = ""):
+    """
+    Replace the module‑level logger with a LoggerAdapter that
+    prepends *prefix* to every message.
+    """
+    global log
+
+    class PrefixAdapter(logging.LoggerAdapter):
+        """
+        Ancillary class to set prefix on all message logs.
+        """
+        def process(self, msg, kwargs):
+            return f"{prefix}{msg}", kwargs
+
+    # Wrap the provided logger in our adapter
+    log = PrefixAdapter(logger, {"prefix": prefix})
 
 class CToken():
     """
diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index efd58c88ff31..f90c6dd0343d 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -13,7 +13,7 @@ import sys
 import re
 from pprint import pformat
 
-from kdoc.c_lex import CTokenizer
+from kdoc.c_lex import CTokenizer, tokenizer_set_log
 from kdoc.kdoc_re import KernRe
 from kdoc.kdoc_item import KdocItem
 
@@ -253,6 +253,8 @@ class KernelDoc:
         self.config = config
         self.xforms = xforms
 
+        tokenizer_set_log(self.config.log, f"{self.fname}: CMatch: ")
+
         # Initial state for the state machines
         self.state = state.NORMAL
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 19/22] docs: kernel-doc.rst: document private: scope propagation
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (17 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 18/22] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 20/22] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Randy Dunlap, Shuah Khan, Vincent Mailhol

This was an undefined behavior, but at least one place used private:
inside a nested struct meant to not be propagated outside it.

Kernel-doc now defines how this is propagated. So, document that.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 Documentation/doc-guide/kernel-doc.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/doc-guide/kernel-doc.rst b/Documentation/doc-guide/kernel-doc.rst
index 8d2c09fb36e4..1c148fe8e1f9 100644
--- a/Documentation/doc-guide/kernel-doc.rst
+++ b/Documentation/doc-guide/kernel-doc.rst
@@ -213,6 +213,10 @@ The ``private:`` and ``public:`` tags must begin immediately following a
 ``/*`` comment marker. They may optionally include comments between the
 ``:`` and the ending ``*/`` marker.
 
+When ``private:`` is used on nested structs, it propagates only to inner
+structs/unions.
+
+
 Example::
 
   /**
@@ -256,8 +260,10 @@ It is possible to document nested structs and unions, like::
         union {
           struct {
             int memb1;
+            /* private: hides memb2 from documentation */
             int memb2;
           };
+          /* Everything here is public again, as private scope finished */
           struct {
             void *memb3;
             int memb4;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 20/22] docs: kdoc: ensure that comments are dropped before calling split_struct_proto()
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (18 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 19/22] docs: kernel-doc.rst: document private: scope propagation Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 21/22] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Aleksandr Loktionov, Randy Dunlap

Changeset 2b957decdb6c ("docs: kdoc: don't add broken comments inside prototypes")
revealed a hidden bug at split_struct_proto(): some comments there may break
its capability of properly identifying a struct.

Fixing it is as simple as stripping comments before calling it.

Fixes: 2b957decdb6c ("docs: kdoc: don't add broken comments inside prototypes")
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/kdoc_parser.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index f90c6dd0343d..8b2c9d0f0c58 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -723,6 +723,7 @@ class KernelDoc:
         #
         # Do the basic parse to get the pieces of the declaration.
         #
+        proto = trim_private_members(proto)
         struct_parts = self.split_struct_proto(proto)
         if not struct_parts:
             self.emit_msg(ln, f"{proto} error: Cannot parse struct or union!")
@@ -763,6 +764,7 @@ class KernelDoc:
         # Strip preprocessor directives.  Note that this depends on the
         # trailing semicolon we added in process_proto_type().
         #
+        proto = trim_private_members(proto)
         proto = KernRe(r'#\s*((define|ifdef|if)\s+|endif)[^;]*;', flags=re.S).sub('', proto)
         #
         # Parse out the name and members of the enum.  Typedef form first.
@@ -770,7 +772,7 @@ class KernelDoc:
         r = KernRe(r'typedef\s+enum\s*\{(.*)\}\s*(\w*)\s*;')
         if r.search(proto):
             declaration_name = r.group(2)
-            members = trim_private_members(r.group(1))
+            members = r.group(1)
         #
         # Failing that, look for a straight enum
         #
@@ -778,7 +780,7 @@ class KernelDoc:
             r = KernRe(r'enum\s+(\w*)\s*\{(.*)\}')
             if r.match(proto):
                 declaration_name = r.group(1)
-                members = trim_private_members(r.group(2))
+                members = r.group(2)
         #
         # OK, this isn't going to work.
         #
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 21/22] docs: kdoc_parser: avoid tokenizing structs everytime
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (19 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 20/22] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-17 18:09 ` [PATCH v3 22/22] docs: xforms_lists: use CMatch for all identifiers Mauro Carvalho Chehab
  2026-03-22 21:08 ` [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Jonathan Corbet
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Aleksandr Loktionov, Randy Dunlap

Most of the rules inside CTransforms are of the type CMatch.

Don't re-parse the source code every time.

Doing this doesn't change the output, but makes kdoc almost
as fast as before the tokenizer patches:

    # Before tokenizer patches
    $ time ./scripts/kernel-doc . -man >original 2>&1

    real    0m42.933s
    user    0m36.523s
    sys     0m1.145s

    # After tokenizer patches
    $ time ./scripts/kernel-doc . -man >before 2>&1

    real    1m29.853s
    user    1m23.974s
    sys     0m1.237s

    # After this patch
    $ time ./scripts/kernel-doc . -man >after 2>&1

    real    0m48.579s
    user    0m45.938s
    sys     0m0.988s

    $ diff -s before after
    Files before and after are identical

Manually checked the differences between original and after
with:

    $ diff -U0 -prBw original after|grep -v Warning|grep -v "@@"|less

They're due:
  - whitespace fixes;
  - struct_group are now better handled;
  - several badly-generated man pages from broken inline kernel-doc
    markups are now fixed.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/kdoc_parser.py  |  1 -
 tools/lib/python/kdoc/xforms_lists.py | 30 +++++++++++++++++++++------
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 8b2c9d0f0c58..f6c4ee3b18c9 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -737,7 +737,6 @@ class KernelDoc:
         #
         # Go through the list of members applying all of our transformations.
         #
-        members = trim_private_members(members)
         members = self.xforms.apply("struct", members)
 
         #
diff --git a/tools/lib/python/kdoc/xforms_lists.py b/tools/lib/python/kdoc/xforms_lists.py
index 2056572852fd..5a62d4a450cb 100644
--- a/tools/lib/python/kdoc/xforms_lists.py
+++ b/tools/lib/python/kdoc/xforms_lists.py
@@ -5,7 +5,7 @@
 import re
 
 from kdoc.kdoc_re import KernRe
-from kdoc.c_lex import CMatch
+from kdoc.c_lex import CMatch, CTokenizer
 
 struct_args_pattern = r'([^,)]+)'
 
@@ -16,6 +16,12 @@ class CTransforms:
     into something we can parse and generate kdoc for.
     """
 
+    #
+    # NOTE:
+    #      Due to performance reasons, place CMatch rules before KernRe,
+    #      as this avoids running the C parser every time.
+    #
+
     #: Transforms for structs and unions.
     struct_xforms = [
         # Strip attributes
@@ -124,13 +130,25 @@ class CTransforms:
         "var": var_xforms,
     }
 
-    def apply(self, xforms_type, text):
+    def apply(self, xforms_type, source):
         """
-        Apply a set of transforms to a block of text.
+        Apply a set of transforms to a block of source.
+
+        As tokenizer is used here, this function also remove comments
+        at the end.
         """
         if xforms_type not in self.xforms:
-            return text
+            return source
+
+        if isinstance(source, str):
+            source = CTokenizer(source)
 
         for search, subst in self.xforms[xforms_type]:
-            text = search.sub(subst, text)
-        return text
+            #
+            # KernRe only accept strings.
+            #
+            if isinstance(search, KernRe):
+                source = str(source)
+
+            source = search.sub(subst, source)
+        return str(source)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 22/22] docs: xforms_lists: use CMatch for all identifiers
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (20 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 21/22] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
@ 2026-03-17 18:09 ` Mauro Carvalho Chehab
  2026-03-22 21:08 ` [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Jonathan Corbet
  22 siblings, 0 replies; 28+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:09 UTC (permalink / raw)
  To: Jonathan Corbet, Kees Cook, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
	Gustavo A. R. Silva, Aleksandr Loktionov, Randy Dunlap

CMatch is lexically correct and replaces only identifiers,
which is exactly where macro transformations happen.

Use it to make the output safer and ensure that all arguments
will be parsed the right way, even on complex cases.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 tools/lib/python/kdoc/xforms_lists.py | 159 +++++++++++++-------------
 1 file changed, 79 insertions(+), 80 deletions(-)

diff --git a/tools/lib/python/kdoc/xforms_lists.py b/tools/lib/python/kdoc/xforms_lists.py
index 5a62d4a450cb..f6ea9efb11ae 100644
--- a/tools/lib/python/kdoc/xforms_lists.py
+++ b/tools/lib/python/kdoc/xforms_lists.py
@@ -7,7 +7,8 @@ import re
 from kdoc.kdoc_re import KernRe
 from kdoc.c_lex import CMatch, CTokenizer
 
-struct_args_pattern = r'([^,)]+)'
+struct_args_pattern = r"([^,)]+)"
+
 
 class CTransforms:
     """
@@ -24,48 +25,40 @@ class CTransforms:
 
     #: Transforms for structs and unions.
     struct_xforms = [
-        # Strip attributes
-        (KernRe(r"__attribute__\s*\(\([a-z0-9,_\*\s\(\)]*\)\)", flags=re.I | re.S, cache=False), ' '),
-        (KernRe(r'\s*__aligned\s*\([^;]*\)', re.S), ' '),
-        (KernRe(r'\s*__counted_by\s*\([^;]*\)', re.S), ' '),
-        (KernRe(r'\s*__counted_by_(le|be)\s*\([^;]*\)', re.S), ' '),
-        (KernRe(r'\s*__guarded_by\s*\([^\)]*\)', re.S), ' '),
-        (KernRe(r'\s*__pt_guarded_by\s*\([^\)]*\)', re.S), ' '),
-        (KernRe(r'\s*__packed\s*', re.S), ' '),
-        (KernRe(r'\s*CRYPTO_MINALIGN_ATTR', re.S), ' '),
-        (KernRe(r'\s*__private', re.S), ' '),
-        (KernRe(r'\s*__rcu', re.S), ' '),
-        (KernRe(r'\s*____cacheline_aligned_in_smp', re.S), ' '),
-        (KernRe(r'\s*____cacheline_aligned', re.S), ' '),
-        (KernRe(r'\s*__cacheline_group_(begin|end)\([^\)]+\);'), ''),
-        (KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S),
-        r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
-        (KernRe(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S),
-        r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
-        (KernRe(r'DECLARE_BITMAP\s*\(' + struct_args_pattern + r',\s*' + struct_args_pattern + r'\)',
-                re.S), r'unsigned long \1[BITS_TO_LONGS(\2)]'),
-        (KernRe(r'DECLARE_HASHTABLE\s*\(' + struct_args_pattern + r',\s*' + struct_args_pattern + r'\)',
-                re.S), r'unsigned long \1[1 << ((\2) - 1)]'),
-        (KernRe(r'DECLARE_KFIFO\s*\(' + struct_args_pattern + r',\s*' + struct_args_pattern +
-                r',\s*' + struct_args_pattern + r'\)', re.S), r'\2 *\1'),
-        (KernRe(r'DECLARE_KFIFO_PTR\s*\(' + struct_args_pattern + r',\s*' +
-                struct_args_pattern + r'\)', re.S), r'\2 *\1'),
-        (KernRe(r'(?:__)?DECLARE_FLEX_ARRAY\s*\(' + struct_args_pattern + r',\s*' +
-                struct_args_pattern + r'\)', re.S), r'\1 \2[]'),
-        (KernRe(r'DEFINE_DMA_UNMAP_ADDR\s*\(' + struct_args_pattern + r'\)', re.S), r'dma_addr_t \1'),
-        (KernRe(r'DEFINE_DMA_UNMAP_LEN\s*\(' + struct_args_pattern + r'\)', re.S), r'__u32 \1'),
-        (KernRe(r'VIRTIO_DECLARE_FEATURES\(([\w_]+)\)'), r'union { u64 \1; u64 \1_array[VIRTIO_FEATURES_U64S]; }'),
-
-        (CMatch(r"__cond_acquires"), ""),
-        (CMatch(r"__cond_releases"), ""),
-        (CMatch(r"__acquires"), ""),
-        (CMatch(r"__releases"), ""),
-        (CMatch(r"__must_hold"), ""),
-        (CMatch(r"__must_not_hold"), ""),
-        (CMatch(r"__must_hold_shared"), ""),
-        (CMatch(r"__cond_acquires_shared"), ""),
-        (CMatch(r"__acquires_shared"), ""),
-        (CMatch(r"__releases_shared"), ""),
+        (CMatch("__attribute__"), ""),
+        (CMatch("__aligned"), ""),
+        (CMatch("__counted_by"), ""),
+        (CMatch("__counted_by_(le|be)"), ""),
+        (CMatch("__guarded_by"), ""),
+        (CMatch("__pt_guarded_by"), ""),
+        (CMatch("__packed"), ""),
+        (CMatch("CRYPTO_MINALIGN_ATTR"), ""),
+        (CMatch("__private"), ""),
+        (CMatch("__rcu"), ""),
+        (CMatch("____cacheline_aligned_in_smp"), ""),
+        (CMatch("____cacheline_aligned"), ""),
+        (CMatch("__cacheline_group_(?:begin|end)"), ""),
+        (CMatch("__ETHTOOL_DECLARE_LINK_MODE_MASK"), r"DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)"),
+        (CMatch("DECLARE_PHY_INTERFACE_MASK",),r"DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)"),
+        (CMatch("DECLARE_BITMAP"), r"unsigned long \1[BITS_TO_LONGS(\2)]"),
+        (CMatch("DECLARE_HASHTABLE"), r"unsigned long \1[1 << ((\2) - 1)]"),
+        (CMatch("DECLARE_KFIFO"), r"\2 *\1"),
+        (CMatch("DECLARE_KFIFO_PTR"), r"\2 *\1"),
+        (CMatch("(?:__)?DECLARE_FLEX_ARRAY"), r"\1 \2[]"),
+        (CMatch("DEFINE_DMA_UNMAP_ADDR"), r"dma_addr_t \1"),
+        (CMatch("DEFINE_DMA_UNMAP_LEN"), r"__u32 \1"),
+        (CMatch("VIRTIO_DECLARE_FEATURES"), r"union { u64 \1; u64 \1_array[VIRTIO_FEATURES_U64S]; }"),
+        (CMatch("__cond_acquires"), ""),
+        (CMatch("__cond_releases"), ""),
+        (CMatch("__acquires"), ""),
+        (CMatch("__releases"), ""),
+        (CMatch("__must_hold"), ""),
+        (CMatch("__must_not_hold"), ""),
+        (CMatch("__must_hold_shared"), ""),
+        (CMatch("__cond_acquires_shared"), ""),
+        (CMatch("__acquires_shared"), ""),
+        (CMatch("__releases_shared"), ""),
+        (CMatch("__attribute__"), ""),
 
         #
         # Macro __struct_group() creates an union with an anonymous
@@ -73,51 +66,57 @@ class CTransforms:
         # need one of those at kernel-doc, as we won't be documenting the same
         # members twice.
         #
-        (CMatch('struct_group'), r'struct { \2+ };'),
-        (CMatch('struct_group_attr'), r'struct { \3+ };'),
-        (CMatch('struct_group_tagged'), r'struct { \3+ };'),
-        (CMatch('__struct_group'), r'struct { \4+ };'),
-
+        (CMatch("struct_group"), r"struct { \2+ };"),
+        (CMatch("struct_group_attr"), r"struct { \3+ };"),
+        (CMatch("struct_group_tagged"), r"struct { \3+ };"),
+        (CMatch("__struct_group"), r"struct { \4+ };"),
     ]
 
     #: Transforms for function prototypes.
     function_xforms = [
-        (KernRe(r"^static +"), ""),
-        (KernRe(r"^extern +"), ""),
-        (KernRe(r"^asmlinkage +"), ""),
-        (KernRe(r"^inline +"), ""),
-        (KernRe(r"^__inline__ +"), ""),
-        (KernRe(r"^__inline +"), ""),
-        (KernRe(r"^__always_inline +"), ""),
-        (KernRe(r"^noinline +"), ""),
-        (KernRe(r"^__FORTIFY_INLINE +"), ""),
-        (KernRe(r"__init +"), ""),
-        (KernRe(r"__init_or_module +"), ""),
-        (KernRe(r"__exit +"), ""),
-        (KernRe(r"__deprecated +"), ""),
-        (KernRe(r"__flatten +"), ""),
-        (KernRe(r"__meminit +"), ""),
-        (KernRe(r"__must_check +"), ""),
-        (KernRe(r"__weak +"), ""),
-        (KernRe(r"__sched +"), ""),
-        (KernRe(r"_noprof"), ""),
-        (KernRe(r"__always_unused *"), ""),
-        (KernRe(r"__printf\s*\(\s*\d*\s*,\s*\d*\s*\) +"), ""),
-        (KernRe(r"__(?:re)?alloc_size\s*\(\s*\d+\s*(?:,\s*\d+\s*)?\) +"), ""),
-        (KernRe(r"__diagnose_as\s*\(\s*\S+\s*(?:,\s*\d+\s*)*\) +"), ""),
-        (KernRe(r"DECL_BUCKET_PARAMS\s*\(\s*(\S+)\s*,\s*(\S+)\s*\)"), r"\1, \2"),
-        (KernRe(r"__no_context_analysis\s*"), ""),
-        (KernRe(r"__attribute_const__ +"), ""),
-        (KernRe(r"__attribute__\s*\(\((?:[\w\s]+(?:\([^)]*\))?\s*,?)+\)\)\s+"), ""),
+        (CMatch("static"), ""),
+        (CMatch("extern"), ""),
+        (CMatch("asmlinkage"), ""),
+        (CMatch("inline"), ""),
+        (CMatch("__inline__"), ""),
+        (CMatch("__inline"), ""),
+        (CMatch("__always_inline"), ""),
+        (CMatch("noinline"), ""),
+        (CMatch("__FORTIFY_INLINE"), ""),
+        (CMatch("__init"), ""),
+        (CMatch("__init_or_module"), ""),
+        (CMatch("__exit"), ""),
+        (CMatch("__deprecated"), ""),
+        (CMatch("__flatten"), ""),
+        (CMatch("__meminit"), ""),
+        (CMatch("__must_check"), ""),
+        (CMatch("__weak"), ""),
+        (CMatch("__sched"), ""),
+        (CMatch("__always_unused"), ""),
+        (CMatch("__printf"), ""),
+        (CMatch("__(?:re)?alloc_size"), ""),
+        (CMatch("__diagnose_as"), ""),
+        (CMatch("DECL_BUCKET_PARAMS"), r"\1, \2"),
+        (CMatch("__no_context_analysis"), ""),
+        (CMatch("__attribute_const__"), ""),
+        (CMatch("__attribute__"), ""),
+
+        #
+        # HACK: this is similar to process_export() hack. It is meant to
+        # drop _noproof from function name. See for instance:
+        # ahash_request_alloc kernel-doc declaration at include/crypto/hash.h.
+        #
+        (KernRe("_noprof"), ""),
     ]
 
     #: Transforms for variable prototypes.
     var_xforms = [
-        (KernRe(r"__read_mostly"), ""),
-        (KernRe(r"__ro_after_init"), ""),
-        (KernRe(r'\s*__guarded_by\s*\([^\)]*\)', re.S), ""),
-        (KernRe(r'\s*__pt_guarded_by\s*\([^\)]*\)', re.S), ""),
-        (KernRe(r"LIST_HEAD\(([\w_]+)\)"), r"struct list_head \1"),
+        (CMatch("__read_mostly"), ""),
+        (CMatch("__ro_after_init"), ""),
+        (CMatch("__guarded_by"), ""),
+        (CMatch("__pt_guarded_by"), ""),
+        (CMatch("LIST_HEAD"), r"struct list_head \1"),
+
         (KernRe(r"(?://.*)$"), ""),
         (KernRe(r"(?:/\*.*\*/)"), ""),
         (KernRe(r";$"), ""),
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* RE: [PATCH v3 17/22] docs: xforms_lists: better evaluate struct_group macros
  2026-03-17 18:09 ` [PATCH v3 17/22] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
@ 2026-03-18  7:44   ` Loktionov, Aleksandr
  0 siblings, 0 replies; 28+ messages in thread
From: Loktionov, Aleksandr @ 2026-03-18  7:44 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Jonathan Corbet, Linux Doc Mailing List
  Cc: linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org,
	Randy Dunlap



> -----Original Message-----
> From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Sent: Tuesday, March 17, 2026 7:10 PM
> To: Jonathan Corbet <corbet@lwn.net>; Linux Doc Mailing List <linux-
> doc@vger.kernel.org>
> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>; linux-
> hardening@vger.kernel.org; linux-kernel@vger.kernel.org; Loktionov,
> Aleksandr <aleksandr.loktionov@intel.com>; Randy Dunlap
> <rdunlap@infradead.org>
> Subject: [PATCH v3 17/22] docs: xforms_lists: better evaluate
> struct_group macros
> 
> The previous approach were to unwind nested structs/unions.
> 
> Now that we have a logic that can handle it well, use it to ensure
> that struct_group macros will properly reflect the actual struct.
> 
> Note that the replacemend logic still simplifies the code a little
replacemend -> replacement

Otherwise, fine
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

> bit, as the basic build block for struct group is:
> 
> 	union { \
> 		struct { MEMBERS } ATTRS; \
> 		struct __struct_group_tag(TAG) { MEMBERS } ATTRS NAME; \
> 	} ATTRS
> 
> There:
> 
> - ATTRS is meant to add extra macro attributes like __packed
>   which we already discard, as they aren't relevant to
>   document struct members;
> 
> - TAG is used only when built with __cplusplus.
> 
> So, instead, convert them into just:
> 
>     struct { MEMBERS };
> 
> Please notice that here, we're using the greedy version of the
> backrefs, as MEMBERS is actually MEMBERS... on all such macros.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
>  tools/lib/python/kdoc/xforms_lists.py | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/lib/python/kdoc/xforms_lists.py
> b/tools/lib/python/kdoc/xforms_lists.py
> index 98632c50a146..2056572852fd 100644
> --- a/tools/lib/python/kdoc/xforms_lists.py
> +++ b/tools/lib/python/kdoc/xforms_lists.py
> @@ -61,10 +61,16 @@ class CTransforms:
>          (CMatch(r"__acquires_shared"), ""),
>          (CMatch(r"__releases_shared"), ""),
> 
> -        (CMatch('struct_group'), r'\2'),
> -        (CMatch('struct_group_attr'), r'\3'),
> -        (CMatch('struct_group_tagged'), r'struct \1 \2; \3'),
> -        (CMatch('__struct_group'), r'\4'),
> +        #
> +        # Macro __struct_group() creates an union with an anonymous
> +        # and a non-anonymous struct, depending on the parameters. We
> only
> +        # need one of those at kernel-doc, as we won't be documenting
> the same
> +        # members twice.
> +        #
> +        (CMatch('struct_group'), r'struct { \2+ };'),
> +        (CMatch('struct_group_attr'), r'struct { \3+ };'),
> +        (CMatch('struct_group_tagged'), r'struct { \3+ };'),
> +        (CMatch('__struct_group'), r'struct { \4+ };'),
> 
>      ]
> 
> --
> 2.52.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v3 18/22] docs: c_lex: setup a logger to report tokenizer issues
  2026-03-17 18:09 ` [PATCH v3 18/22] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
@ 2026-03-18  7:45   ` Loktionov, Aleksandr
  0 siblings, 0 replies; 28+ messages in thread
From: Loktionov, Aleksandr @ 2026-03-18  7:45 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Jonathan Corbet, Linux Doc Mailing List
  Cc: linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org,
	Randy Dunlap



> -----Original Message-----
> From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Sent: Tuesday, March 17, 2026 7:10 PM
> To: Jonathan Corbet <corbet@lwn.net>; Linux Doc Mailing List <linux-
> doc@vger.kernel.org>
> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>; linux-
> hardening@vger.kernel.org; linux-kernel@vger.kernel.org; Loktionov,
> Aleksandr <aleksandr.loktionov@intel.com>; Randy Dunlap
> <rdunlap@infradead.org>
> Subject: [PATCH v3 18/22] docs: c_lex: setup a logger to report
> tokenizer issues
> 
> Report file that has issues detected via CMatch and CTokenizer.
> 
> This is done by setting up a logger that will be overriden by
overridden -> overridden

Otherwise, fine
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

> kdoc_parser, when used on it.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
>  tools/lib/python/kdoc/c_lex.py       | 16 ++++++++++++++++
>  tools/lib/python/kdoc/kdoc_parser.py |  4 +++-
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/lib/python/kdoc/c_lex.py
> b/tools/lib/python/kdoc/c_lex.py index 20e50ff0ecd5..b6d58bd470a9
> 100644
> --- a/tools/lib/python/kdoc/c_lex.py
> +++ b/tools/lib/python/kdoc/c_lex.py
> @@ -22,6 +22,22 @@ from .kdoc_re import KernRe
> 
>  log = logging.getLogger(__name__)
> 
> +def tokenizer_set_log(logger, prefix = ""):
> +    """
> +    Replace the module‑level logger with a LoggerAdapter that
> +    prepends *prefix* to every message.
> +    """
> +    global log
> +
> +    class PrefixAdapter(logging.LoggerAdapter):
> +        """
> +        Ancillary class to set prefix on all message logs.
> +        """
> +        def process(self, msg, kwargs):
> +            return f"{prefix}{msg}", kwargs
> +
> +    # Wrap the provided logger in our adapter
> +    log = PrefixAdapter(logger, {"prefix": prefix})
> 
>  class CToken():
>      """
> diff --git a/tools/lib/python/kdoc/kdoc_parser.py
> b/tools/lib/python/kdoc/kdoc_parser.py
> index efd58c88ff31..f90c6dd0343d 100644
> --- a/tools/lib/python/kdoc/kdoc_parser.py
> +++ b/tools/lib/python/kdoc/kdoc_parser.py
> @@ -13,7 +13,7 @@ import sys
>  import re
>  from pprint import pformat
> 
> -from kdoc.c_lex import CTokenizer
> +from kdoc.c_lex import CTokenizer, tokenizer_set_log
>  from kdoc.kdoc_re import KernRe
>  from kdoc.kdoc_item import KdocItem
> 
> @@ -253,6 +253,8 @@ class KernelDoc:
>          self.config = config
>          self.xforms = xforms
> 
> +        tokenizer_set_log(self.config.log, f"{self.fname}: CMatch: ")
> +
>          # Initial state for the state machines
>          self.state = state.NORMAL
> 
> --
> 2.52.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v3 03/22] docs: kdoc: don't add broken comments inside prototypes
  2026-03-17 18:09 ` [PATCH v3 03/22] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
@ 2026-03-18  7:47   ` Loktionov, Aleksandr
  0 siblings, 0 replies; 28+ messages in thread
From: Loktionov, Aleksandr @ 2026-03-18  7:47 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Jonathan Corbet, Linux Doc Mailing List
  Cc: linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org,
	Randy Dunlap



> -----Original Message-----
> From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Sent: Tuesday, March 17, 2026 7:09 PM
> To: Jonathan Corbet <corbet@lwn.net>; Linux Doc Mailing List <linux-
> doc@vger.kernel.org>
> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>; linux-
> hardening@vger.kernel.org; linux-kernel@vger.kernel.org; Loktionov,
> Aleksandr <aleksandr.loktionov@intel.com>; Randy Dunlap
> <rdunlap@infradead.org>
> Subject: [PATCH v3 03/22] docs: kdoc: don't add broken comments inside
> prototypes
> 
> Parsing a file like drivers/scsi/isci/host.h, which contains broken
> kernel-doc markups makes it create a prototype that contains unmatched
> end comments.
> 
> That causes, for instance, struct sci_power_control to be shown this
> this prototype:
> 
>     struct sci_power_control {
>         * it is not. */ bool timer_started;
>         */ struct sci_timer timer;
>         * requesters field. */ u8 phys_waiting;
>         */ u8 phys_granted_power;
>         * mapped into requesters via struct sci_phy.phy_index */
> struct isci_phy *requesters[SCI_MAX_PHYS];
>     };
> 
> as comments won't start with "/*" anymore.
> 
> Fix the logic to detect such cases, and keep adding the comments
> inside it.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Message-ID:
> <18e577dbbd538dcc22945ff139fe3638344e14f0.1773074166.git.mchehab+huawe
> i@kernel.org>
> ---
>  tools/lib/python/kdoc/kdoc_parser.py | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/tools/lib/python/kdoc/kdoc_parser.py
> b/tools/lib/python/kdoc/kdoc_parser.py
> index edf70ba139a5..086579d00b5c 100644
> --- a/tools/lib/python/kdoc/kdoc_parser.py
> +++ b/tools/lib/python/kdoc/kdoc_parser.py
> @@ -1355,6 +1355,12 @@ class KernelDoc:
>          elif doc_content.search(line):
>              self.emit_msg(ln, f"Incorrect use of kernel-doc format:
> {line}")
>              self.state = state.PROTO
> +
> +            #
> +            # Don't let it add partial comments at the code, as
> breaks the
> +            # logic meant to remove comments from prototypes.
> +            #
> +            self.process_proto_type(ln, "/**\n" + line)
>          # else ... ??
> 
>      def process_inline_text(self, ln, line):
> --
> 2.52.0

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v3 06/22] docs: kdoc: use tokenizer to handle comments on structs
  2026-03-17 18:09 ` [PATCH v3 06/22] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
@ 2026-03-18  7:48   ` Loktionov, Aleksandr
  0 siblings, 0 replies; 28+ messages in thread
From: Loktionov, Aleksandr @ 2026-03-18  7:48 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Jonathan Corbet, Linux Doc Mailing List
  Cc: linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org,
	Randy Dunlap



> -----Original Message-----
> From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Sent: Tuesday, March 17, 2026 7:09 PM
> To: Jonathan Corbet <corbet@lwn.net>; Linux Doc Mailing List <linux-
> doc@vger.kernel.org>
> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>; linux-
> hardening@vger.kernel.org; linux-kernel@vger.kernel.org; Loktionov,
> Aleksandr <aleksandr.loktionov@intel.com>; Randy Dunlap
> <rdunlap@infradead.org>
> Subject: [PATCH v3 06/22] docs: kdoc: use tokenizer to handle comments
> on structs
> 
> Better handle comments inside structs. After those changes, all
> unittests now pass:
> 
>   test_private:
>     TestPublicPrivate:
>         test balanced_inner_private:
> OK
>         test balanced_non_greddy_private:
> OK
>         test balanced_private:
> OK
>         test no private:
> OK
>         test unbalanced_inner_private:
> OK
>         test unbalanced_private:
> OK
>         test unbalanced_struct_group_tagged_with_private:
> OK
>         test unbalanced_two_struct_group_tagged_first_with_private:
> OK
>         test unbalanced_without_end_of_line:
> OK
> 
>   Ran 9 tests
> 
> This also solves a bug when handling STRUCT_GROUP() with a private
> comment on it:
> 
> 	@@ -397134,7 +397134,7 @@ basic V4L2 device-level support.
> 	             unsigned int    max_len;
> 	             unsigned int    offset;
> 	             struct page_pool_params_slow  slow;
> 	-            STRUCT_GROUP( struct net_device *netdev;
> 	+            struct net_device *netdev;
> 	             unsigned int queue_idx;
> 	             unsigned int    flags;
> 	       };
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Message-ID:
> <f83ee9e8c38407eaab6ad10d4ccf155fb36683cc.1773074166.git.mchehab+huawe
> i@kernel.org>
> ---
>  tools/lib/python/kdoc/kdoc_parser.py | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/tools/lib/python/kdoc/kdoc_parser.py
> b/tools/lib/python/kdoc/kdoc_parser.py
> index 4b3c555e6c8e..62d8030cf532 100644
> --- a/tools/lib/python/kdoc/kdoc_parser.py
> +++ b/tools/lib/python/kdoc/kdoc_parser.py
> @@ -13,6 +13,7 @@ import sys
>  import re
>  from pprint import pformat
> 
> +from kdoc.c_lex import CTokenizer
>  from kdoc.kdoc_re import NestedMatch, KernRe  from kdoc.kdoc_item
> import KdocItem
> 
> @@ -84,15 +85,9 @@ def trim_private_members(text):
>      """
>      Remove ``struct``/``enum`` members that have been marked
> "private".
>      """
> -    # First look for a "public:" block that ends a private region,
> then
> -    # handle the "private until the end" case.
> -    #
> -    text = KernRe(r'/\*\s*private:.*?/\*\s*public:.*?\*/',
> flags=re.S).sub('', text)
> -    text = KernRe(r'/\*\s*private:.*', flags=re.S).sub('', text)
> -    #
> -    # We needed the comments to do the above, but now we can take
> them out.
> -    #
> -    return KernRe(r'\s*/\*.*?\*/\s*', flags=re.S).sub('',
> text).strip()
> +
> +    tokens = CTokenizer(text)
> +    return str(tokens)
> 
>  class state:
>      """
> --
> 2.52.0

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms
  2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
                   ` (21 preceding siblings ...)
  2026-03-17 18:09 ` [PATCH v3 22/22] docs: xforms_lists: use CMatch for all identifiers Mauro Carvalho Chehab
@ 2026-03-22 21:08 ` Jonathan Corbet
  22 siblings, 0 replies; 28+ messages in thread
From: Jonathan Corbet @ 2026-03-22 21:08 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Kees Cook, Mauro Carvalho Chehab
  Cc: Mauro Carvalho Chehab, linux-doc, linux-hardening, linux-kernel,
	Gustavo A. R. Silva, Aleksandr Loktionov, Randy Dunlap,
	Shuah Khan, Vincent Mailhol

Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:

> Hi Jon,
>
> It follows v3 of the series. I basically addressed there the
> points you mentioned. Besides that, I did a fix at CMatch
> group(0) logic, and opted to create a special token for ";",
> as it simplifies the code a little bit and will likely help to
> simplify future changes.

OK, I have gone ahead and applied this series...it doesn't seem to
explode for me, anyway :)  Let's hope for the best...

Thanks,

jon

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-03-22 21:08 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-17 18:09 [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 01/22] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 02/22] unittests: add a testbench to check public/private kdoc comments Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 03/22] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
2026-03-18  7:47   ` Loktionov, Aleksandr
2026-03-17 18:09 ` [PATCH v3 04/22] docs: kdoc: properly handle empty enum arguments Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 05/22] docs: add a C tokenizer to be used by kernel-doc Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 06/22] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
2026-03-18  7:48   ` Loktionov, Aleksandr
2026-03-17 18:09 ` [PATCH v3 07/22] unittests: test_private: modify it to use CTokenizer directly Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 08/22] unittests: test_tokenizer: check if the tokenizer works Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 09/22] unittests: add a runner to execute all unittests Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 10/22] docs: kdoc: create a CMatch to match nested C blocks Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 11/22] tools: unittests: add tests for CMatch Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 12/22] docs: c_lex: properly implement a sub() method " Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 13/22] unittests: test_cmatch: add tests for sub() Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 14/22] docs: kdoc: replace NestedMatch with CMatch Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 15/22] docs: kdoc_re: get rid of NestedMatch class Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 16/22] docs: xforms_lists: handle struct_group directly Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 17/22] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
2026-03-18  7:44   ` Loktionov, Aleksandr
2026-03-17 18:09 ` [PATCH v3 18/22] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
2026-03-18  7:45   ` Loktionov, Aleksandr
2026-03-17 18:09 ` [PATCH v3 19/22] docs: kernel-doc.rst: document private: scope propagation Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 20/22] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 21/22] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
2026-03-17 18:09 ` [PATCH v3 22/22] docs: xforms_lists: use CMatch for all identifiers Mauro Carvalho Chehab
2026-03-22 21:08 ` [PATCH v3 00/22] kernel-doc: use a C lexical tokenizer for transforms Jonathan Corbet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox