* [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms
@ 2026-03-12 14:54 Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 01/28] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
` (30 more replies)
0 siblings, 31 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Kees Cook, Mauro Carvalho Chehab
Cc: Mauro Carvalho Chehab, linux-doc, linux-hardening, linux-kernel,
Gustavo A. R. Silva, Aleksandr Loktionov, Randy Dunlap,
Shuah Khan, Vincent Mailhol
Hi Jon,
Sorry for respamming this one too quick. It ends that v1 had some
bugs causing it to fail on several cases. I opted to add extra
patches in the end. This way, it better integrates with kdoc_re.
As part of it, now c_lex will output file name when reporting
errors. With that regards, only more serious errors will raise
an exception. They are meant to indicate problems at kernel-doc
itself. Parsing errors are now using the same warning approach
as kdoc_parser.
I also added a filter at Ctokenizer __str__() logic for the
string convertion to drop some weirdness whitespaces and uneeded
";" characters at the output.
Finally, v2 address the undefined behavior about private: comment
propagation.
This patch series change how kdoc parser handles macro replacements.
Instead of heavily relying on regular expressions that can sometimes
be very complex, it uses a C lexical tokenizer. This ensures that
BEGIN/END blocks on functions and structs are properly handled,
even when nested.
Checking before/after the patch series, for both man pages and
rst only had:
- whitespace differences;
- struct_group macros now are shown as inner anonimous structs
as it should be.
Also, I didn't notice any relevant change on the documentation build
time. With that regards, right now, every time a CMatch replacement
rule takes in place, it does:
for each transform:
- tokenizes the source code;
- handle CMatch;
- convert tokens back to a string.
A possible optimization would be to do, instead:
- tokenizes source code;
- for each transform handle CMatch;
- convert tokens back to a string.
For now, I opted not do do it, because:
- too much changes on a single row;
- docs build time is taking ~3:30 minutes, which is
about the same time it ws taken before the changes;
- there is a very dirty hack inside function_xforms:
(KernRe(r"_noprof"), ""). This is meant to change
function prototypes instead of function arguments.
So, if ok for you, I would prefer to merge this one first. We can later
optimize kdoc_parser to avoid multiple token <-> string conversions.
-
One important aspect of this series is that it introduces unittests
for kernel-doc. I used it a lot during the development of this series,
to ensure that the changes I was doing were producing the expected
results. Tests are on two separate files that can be executed directly.
Alternatively, there is a run.py script that runs all of them (and
any other python script named tools/unittests/test_*.py"):
$ tools/unittests/run.py
test_cmatch:
TestSearch:
test_search_acquires_multiple: OK
test_search_acquires_nested_paren: OK
test_search_acquires_simple: OK
test_search_must_hold: OK
test_search_must_hold_shared: OK
test_search_no_false_positive: OK
test_search_no_function: OK
test_search_no_macro_remains: OK
TestSubMultipleMacros:
test_acquires_multiple: OK
test_acquires_nested_paren: OK
test_acquires_simple: OK
test_mixed_macros: OK
test_must_hold: OK
test_must_hold_shared: OK
test_no_false_positive: OK
test_no_function: OK
test_no_macro_remains: OK
TestSubSimple:
test_rise_early_greedy: OK
test_rise_multiple_greedy: OK
test_strip_multiple_acquires: OK
test_sub_count_parameter: OK
test_sub_mixed_placeholders: OK
test_sub_multiple_placeholders: OK
test_sub_no_placeholder: OK
test_sub_single_placeholder: OK
test_sub_with_capture: OK
test_sub_zero_placeholder: OK
TestSubWithLocalXforms:
test_functions_with_acquires_and_releases: OK
test_raw_struct_group: OK
test_raw_struct_group_tagged: OK
test_struct_group: OK
test_struct_group_attr: OK
test_struct_group_tagged_with_private: OK
test_struct_kcov: OK
test_vars_stackdepot: OK
test_tokenizer:
TestPublicPrivate:
test_balanced_inner_private: OK
test_balanced_non_greddy_private: OK
test_balanced_private: OK
test_no private: OK
test_unbalanced_inner_private: OK
test_unbalanced_private: OK
test_unbalanced_struct_group_tagged_with_private: OK
test_unbalanced_two_struct_group_tagged_first_with_private: OK
test_unbalanced_without_end_of_line: OK
TestTokenizer:
test_basic_tokens: OK
test_depth_counters: OK
test_mismatch_error: OK
Ran 47 tests
PS.: This series contain the contents of the previous /8 series:
https://lore.kernel.org/linux-doc/cover.1773074166.git.mchehab+huawei@kernel.org/
---
v2:
- Added 8 more patches fixing several bugs and modifying unittests
accordingly:
- don't raise exceptions when not needed;
- don't report errors reporting lack of END if there's no BEGIN
at the last replacement string;
- document private scope propagation;
- some changes at unittests to reflect current status;
- addition of two unittests to check error raise logic at c_lex.
Mauro Carvalho Chehab (28):
docs: python: add helpers to run unit tests
unittests: add a testbench to check public/private kdoc comments
docs: kdoc: don't add broken comments inside prototypes
docs: kdoc: properly handle empty enum arguments
docs: kdoc_re: add a C tokenizer
docs: kdoc: use tokenizer to handle comments on structs
docs: kdoc: move C Tokenizer to c_lex module
unittests: test_private: modify it to use CTokenizer directly
unittests: test_tokenizer: check if the tokenizer works
unittests: add a runner to execute all unittests
docs: kdoc: create a CMatch to match nested C blocks
tools: unittests: add tests for CMatch
docs: c_lex: properly implement a sub() method for CMatch
unittests: test_cmatch: add tests for sub()
docs: kdoc: replace NestedMatch with CMatch
docs: kdoc_re: get rid of NestedMatch class
docs: xforms_lists: handle struct_group directly
docs: xforms_lists: better evaluate struct_group macros
docs: c_lex: add support to work with pure name ids
docs: xforms_lists: use CMatch for all identifiers
docs: c_lex: add "@" operator
docs: c_lex: don't exclude an extra token
docs: c_lex: setup a logger to report tokenizer issues
docs: unittests: add and adjust tests to check for errors
docs: c_lex: better handle BEGIN/END at search
docs: kernel-doc.rst: document private: scope propagation
docs: c_lex: produce a cleaner str() representation
unittests: test_cmatch: remove weird stuff from expected results
Documentation/doc-guide/kernel-doc.rst | 6 +
Documentation/tools/python.rst | 2 +
Documentation/tools/unittest.rst | 24 +
tools/lib/python/kdoc/c_lex.py | 645 +++++++++++++++++++
tools/lib/python/kdoc/kdoc_parser.py | 29 +-
tools/lib/python/kdoc/kdoc_re.py | 201 ------
tools/lib/python/kdoc/xforms_lists.py | 209 +++----
tools/lib/python/unittest_helper.py | 353 +++++++++++
tools/unittests/run.py | 17 +
tools/unittests/test_cmatch.py | 821 +++++++++++++++++++++++++
tools/unittests/test_tokenizer.py | 462 ++++++++++++++
11 files changed, 2434 insertions(+), 335 deletions(-)
create mode 100644 Documentation/tools/unittest.rst
create mode 100644 tools/lib/python/kdoc/c_lex.py
create mode 100755 tools/lib/python/unittest_helper.py
create mode 100755 tools/unittests/run.py
create mode 100755 tools/unittests/test_cmatch.py
create mode 100755 tools/unittests/test_tokenizer.py
--
2.52.0
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH v2 01/28] docs: python: add helpers to run unit tests
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 02/28] unittests: add a testbench to check public/private kdoc comments Mauro Carvalho Chehab
` (29 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Mauro Carvalho Chehab, Shuah Khan
While python internal libraries have support for unit tests, its
output is not nice. Add a helper module to improve its output.
I wrote this module last year while testing some scripts I used
internally. The initial skeleton was generated with the help of
LLM tools, but it was higly modified to ensure that it will work
as I would expect.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <37999041f616ddef41e84cf2686c0264d1a51dc9.1773074166.git.mchehab+huawei@kernel.org>
---
Documentation/tools/python.rst | 2 +
Documentation/tools/unittest.rst | 24 ++
tools/lib/python/unittest_helper.py | 353 ++++++++++++++++++++++++++++
3 files changed, 379 insertions(+)
create mode 100644 Documentation/tools/unittest.rst
create mode 100755 tools/lib/python/unittest_helper.py
diff --git a/Documentation/tools/python.rst b/Documentation/tools/python.rst
index 1444c1816735..3b7299161f20 100644
--- a/Documentation/tools/python.rst
+++ b/Documentation/tools/python.rst
@@ -11,3 +11,5 @@ Python libraries
feat
kdoc
kabi
+
+ unittest
diff --git a/Documentation/tools/unittest.rst b/Documentation/tools/unittest.rst
new file mode 100644
index 000000000000..14a2b2a65236
--- /dev/null
+++ b/Documentation/tools/unittest.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Python unittest
+===============
+
+Checking consistency of python modules can be complex. Sometimes, it is
+useful to define a set of unit tests to help checking them.
+
+While the actual test implementation is usecase dependent, Python already
+provides a standard way to add unit tests by using ``import unittest``.
+
+Using such class, requires setting up a test suite. Also, the default format
+is a little bit ackward. To improve it and provide a more uniform way to
+report errors, some unittest classes and functions are defined.
+
+
+Unittest helper module
+======================
+
+.. automodule:: lib.python.unittest_helper
+ :members:
+ :show-inheritance:
+ :undoc-members:
diff --git a/tools/lib/python/unittest_helper.py b/tools/lib/python/unittest_helper.py
new file mode 100755
index 000000000000..55d444cd73d4
--- /dev/null
+++ b/tools/lib/python/unittest_helper.py
@@ -0,0 +1,353 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2025-2026: Mauro Carvalho Chehab <mchehab@kernel.org>.
+#
+# pylint: disable=C0103,R0912,R0914,E1101
+
+"""
+Provides helper functions and classes execute python unit tests.
+
+Those help functions provide a nice colored output summary of each
+executed test and, when a test fails, it shows the different in diff
+format when running in verbose mode, like::
+
+ $ tools/unittests/nested_match.py -v
+ ...
+ Traceback (most recent call last):
+ File "/new_devel/docs/tools/unittests/nested_match.py", line 69, in test_count_limit
+ self.assertEqual(replaced, "bar(a); bar(b); foo(c)")
+ ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+ AssertionError: 'bar(a) foo(b); foo(c)' != 'bar(a); bar(b); foo(c)'
+ - bar(a) foo(b); foo(c)
+ ? ^^^^
+ + bar(a); bar(b); foo(c)
+ ? ^^^^^
+ ...
+
+It also allows filtering what tests will be executed via ``-k`` parameter.
+
+Typical usage is to do::
+
+ from unittest_helper import run_unittest
+ ...
+
+ if __name__ == "__main__":
+ run_unittest(__file__)
+
+If passing arguments is needed, on a more complex scenario, it can be
+used like on this example::
+
+ from unittest_helper import TestUnits, run_unittest
+ ...
+ env = {'sudo': ""}
+ ...
+ if __name__ == "__main__":
+ runner = TestUnits()
+ base_parser = runner.parse_args()
+ base_parser.add_argument('--sudo', action='store_true',
+ help='Enable tests requiring sudo privileges')
+
+ args = base_parser.parse_args()
+
+ # Update module-level flag
+ if args.sudo:
+ env['sudo'] = "1"
+
+ # Run tests with customized arguments
+ runner.run(__file__, parser=base_parser, args=args, env=env)
+"""
+
+import argparse
+import atexit
+import os
+import re
+import unittest
+import sys
+
+from unittest.mock import patch
+
+
+class Summary(unittest.TestResult):
+ """
+ Overrides ``unittest.TestResult`` class to provide a nice colored
+ summary. When in verbose mode, displays actual/expected difference in
+ unified diff format.
+ """
+ def __init__(self, *args, **kwargs):
+ super().__init__(*args, **kwargs)
+
+ #: Dictionary to store organized test results.
+ self.test_results = {}
+
+ #: max length of the test names.
+ self.max_name_length = 0
+
+ def startTest(self, test):
+ super().startTest(test)
+ test_id = test.id()
+ parts = test_id.split(".")
+
+ # Extract module, class, and method names
+ if len(parts) >= 3:
+ module_name = parts[-3]
+ else:
+ module_name = ""
+ if len(parts) >= 2:
+ class_name = parts[-2]
+ else:
+ class_name = ""
+
+ method_name = parts[-1]
+
+ # Build the hierarchical structure
+ if module_name not in self.test_results:
+ self.test_results[module_name] = {}
+
+ if class_name not in self.test_results[module_name]:
+ self.test_results[module_name][class_name] = []
+
+ # Track maximum test name length for alignment
+ display_name = f"{method_name}:"
+
+ self.max_name_length = max(len(display_name), self.max_name_length)
+
+ def _record_test(self, test, status):
+ test_id = test.id()
+ parts = test_id.split(".")
+ if len(parts) >= 3:
+ module_name = parts[-3]
+ else:
+ module_name = ""
+ if len(parts) >= 2:
+ class_name = parts[-2]
+ else:
+ class_name = ""
+ method_name = parts[-1]
+ self.test_results[module_name][class_name].append((method_name, status))
+
+ def addSuccess(self, test):
+ super().addSuccess(test)
+ self._record_test(test, "OK")
+
+ def addFailure(self, test, err):
+ super().addFailure(test, err)
+ self._record_test(test, "FAIL")
+
+ def addError(self, test, err):
+ super().addError(test, err)
+ self._record_test(test, "ERROR")
+
+ def addSkip(self, test, reason):
+ super().addSkip(test, reason)
+ self._record_test(test, f"SKIP ({reason})")
+
+ def printResults(self):
+ """
+ Print results using colors if tty.
+ """
+ # Check for ANSI color support
+ use_color = sys.stdout.isatty()
+ COLORS = {
+ "OK": "\033[32m", # Green
+ "FAIL": "\033[31m", # Red
+ "SKIP": "\033[1;33m", # Yellow
+ "PARTIAL": "\033[33m", # Orange
+ "EXPECTED_FAIL": "\033[36m", # Cyan
+ "reset": "\033[0m", # Reset to default terminal color
+ }
+ if not use_color:
+ for c in COLORS:
+ COLORS[c] = ""
+
+ # Calculate maximum test name length
+ if not self.test_results:
+ return
+ try:
+ lengths = []
+ for module in self.test_results.values():
+ for tests in module.values():
+ for test_name, _ in tests:
+ lengths.append(len(test_name) + 1) # +1 for colon
+ max_length = max(lengths) + 2 # Additional padding
+ except ValueError:
+ sys.exit("Test list is empty")
+
+ # Print results
+ for module_name, classes in self.test_results.items():
+ print(f"{module_name}:")
+ for class_name, tests in classes.items():
+ print(f" {class_name}:")
+ for test_name, status in tests:
+ # Get base status without reason for SKIP
+ if status.startswith("SKIP"):
+ status_code = status.split()[0]
+ else:
+ status_code = status
+ color = COLORS.get(status_code, "")
+ print(
+ f" {test_name + ':':<{max_length}}{color}{status}{COLORS['reset']}"
+ )
+ print()
+
+ # Print summary
+ print(f"\nRan {self.testsRun} tests", end="")
+ if hasattr(self, "timeTaken"):
+ print(f" in {self.timeTaken:.3f}s", end="")
+ print()
+
+ if not self.wasSuccessful():
+ print(f"\n{COLORS['FAIL']}FAILED (", end="")
+ failures = getattr(self, "failures", [])
+ errors = getattr(self, "errors", [])
+ if failures:
+ print(f"failures={len(failures)}", end="")
+ if errors:
+ if failures:
+ print(", ", end="")
+ print(f"errors={len(errors)}", end="")
+ print(f"){COLORS['reset']}")
+
+
+def flatten_suite(suite):
+ """Flatten test suite hierarchy."""
+ tests = []
+ for item in suite:
+ if isinstance(item, unittest.TestSuite):
+ tests.extend(flatten_suite(item))
+ else:
+ tests.append(item)
+ return tests
+
+
+class TestUnits:
+ """
+ Helper class to set verbosity level.
+
+ This class discover test files, import its unittest classes and
+ executes the test on it.
+ """
+ def parse_args(self):
+ """Returns a parser for command line arguments."""
+ parser = argparse.ArgumentParser(description="Test runner with regex filtering")
+ parser.add_argument("-v", "--verbose", action="count", default=1)
+ parser.add_argument("-f", "--failfast", action="store_true")
+ parser.add_argument("-k", "--keyword",
+ help="Regex pattern to filter test methods")
+ return parser
+
+ def run(self, caller_file=None, pattern=None,
+ suite=None, parser=None, args=None, env=None):
+ """
+ Execute all tests from the unity test file.
+
+ It contains several optional parameters:
+
+ ``caller_file``:
+ - name of the file that contains test.
+
+ typical usage is to place __file__ at the caller test, e.g.::
+
+ if __name__ == "__main__":
+ TestUnits().run(__file__)
+
+ ``pattern``:
+ - optional pattern to match multiple file names. Defaults
+ to basename of ``caller_file``.
+
+ ``suite``:
+ - an unittest suite initialized by the caller using
+ ``unittest.TestLoader().discover()``.
+
+ ``parser``:
+ - an argparse parser. If not defined, this helper will create
+ one.
+
+ ``args``:
+ - an ``argparse.Namespace`` data filled by the caller.
+
+ ``env``:
+ - environment variables that will be passed to the test suite
+
+ At least ``caller_file`` or ``suite`` must be used, otherwise a
+ ``TypeError`` will be raised.
+ """
+ if not args:
+ if not parser:
+ parser = self.parse_args()
+ args = parser.parse_args()
+
+ if not caller_file and not suite:
+ raise TypeError("Either caller_file or suite is needed at TestUnits")
+
+ verbose = args.verbose
+
+ if not env:
+ env = os.environ.copy()
+
+ env["VERBOSE"] = f"{verbose}"
+
+ patcher = patch.dict(os.environ, env)
+ patcher.start()
+ # ensure it gets stopped after
+ atexit.register(patcher.stop)
+
+
+ if verbose >= 2:
+ unittest.TextTestRunner(verbosity=verbose).run = lambda suite: suite
+
+ # Load ONLY tests from the calling file
+ if not suite:
+ if not pattern:
+ pattern = caller_file
+
+ loader = unittest.TestLoader()
+ suite = loader.discover(start_dir=os.path.dirname(caller_file),
+ pattern=os.path.basename(caller_file))
+
+ # Flatten the suite for environment injection
+ tests_to_inject = flatten_suite(suite)
+
+ # Filter tests by method name if -k specified
+ if args.keyword:
+ try:
+ pattern = re.compile(args.keyword)
+ filtered_suite = unittest.TestSuite()
+ for test in tests_to_inject: # Use the pre-flattened list
+ method_name = test.id().split(".")[-1]
+ if pattern.search(method_name):
+ filtered_suite.addTest(test)
+ suite = filtered_suite
+ except re.error as e:
+ sys.stderr.write(f"Invalid regex pattern: {e}\n")
+ sys.exit(1)
+ else:
+ # Maintain original suite structure if no keyword filtering
+ suite = unittest.TestSuite(tests_to_inject)
+
+ if verbose >= 2:
+ resultclass = None
+ else:
+ resultclass = Summary
+
+ runner = unittest.TextTestRunner(verbosity=args.verbose,
+ resultclass=resultclass,
+ failfast=args.failfast)
+ result = runner.run(suite)
+ if resultclass:
+ result.printResults()
+
+ sys.exit(not result.wasSuccessful())
+
+
+def run_unittest(fname):
+ """
+ Basic usage of TestUnits class.
+
+ Use it when there's no need to pass any extra argument to the tests
+ with. The recommended way is to place this at the end of each
+ unittest module::
+
+ if __name__ == "__main__":
+ run_unittest(__file__)
+ """
+ TestUnits().run(fname)
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 02/28] unittests: add a testbench to check public/private kdoc comments
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 01/28] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 03/28] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
` (28 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
Add unit tests to check if the public/private and comments strip
is working properly.
Running it shows that, on several cases, public/private is not
doing what it is expected:
test_private:
TestPublicPrivate:
test balanced_inner_private: OK
test balanced_non_greddy_private: OK
test balanced_private: OK
test no private: OK
test unbalanced_inner_private: FAIL
test unbalanced_private: FAIL
test unbalanced_struct_group_tagged_with_private: FAIL
test unbalanced_two_struct_group_tagged_first_with_private: FAIL
test unbalanced_without_end_of_line: FAIL
Ran 9 tests
FAILED (failures=5)
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <144f4952e0cb74fe9c9adc117e9a21ec8aa1cc10.1773074166.git.mchehab+huawei@kernel.org>
---
tools/unittests/test_private.py | 331 ++++++++++++++++++++++++++++++++
1 file changed, 331 insertions(+)
create mode 100755 tools/unittests/test_private.py
diff --git a/tools/unittests/test_private.py b/tools/unittests/test_private.py
new file mode 100755
index 000000000000..eae245ae8a12
--- /dev/null
+++ b/tools/unittests/test_private.py
@@ -0,0 +1,331 @@
+#!/usr/bin/env python3
+
+"""
+Unit tests for struct/union member extractor class.
+"""
+
+
+import os
+import re
+import unittest
+import sys
+
+from unittest.mock import MagicMock
+
+SRC_DIR = os.path.dirname(os.path.realpath(__file__))
+sys.path.insert(0, os.path.join(SRC_DIR, "../lib/python"))
+
+from kdoc.kdoc_parser import trim_private_members
+from unittest_helper import run_unittest
+
+#
+# List of tests.
+#
+# The code will dynamically generate one test for each key on this dictionary.
+#
+
+#: Tests to check if CTokenizer is handling properly public/private comments.
+TESTS_PRIVATE = {
+ #
+ # Simplest case: no private. Ensure that trimming won't affect struct
+ #
+ "no private": {
+ "source": """
+ struct foo {
+ int a;
+ int b;
+ int c;
+ };
+ """,
+ "trimmed": """
+ struct foo {
+ int a;
+ int b;
+ int c;
+ };
+ """,
+ },
+
+ #
+ # Play "by the books" by always having a public in place
+ #
+
+ "balanced_private": {
+ "source": """
+ struct foo {
+ int a;
+ /* private: */
+ int b;
+ /* public: */
+ int c;
+ };
+ """,
+ "trimmed": """
+ struct foo {
+ int a;
+ int c;
+ };
+ """,
+ },
+
+ "balanced_non_greddy_private": {
+ "source": """
+ struct foo {
+ int a;
+ /* private: */
+ int b;
+ /* public: */
+ int c;
+ /* private: */
+ int d;
+ /* public: */
+ int e;
+
+ };
+ """,
+ "trimmed": """
+ struct foo {
+ int a;
+ int c;
+ int e;
+ };
+ """,
+ },
+
+ "balanced_inner_private": {
+ "source": """
+ struct foo {
+ struct {
+ int a;
+ /* private: ignore below */
+ int b;
+ /* public: but this should not be ignored */
+ };
+ int b;
+ };
+ """,
+ "trimmed": """
+ struct foo {
+ struct {
+ int a;
+ };
+ int b;
+ };
+ """,
+ },
+
+ #
+ # Test what happens if there's no public after private place
+ #
+
+ "unbalanced_private": {
+ "source": """
+ struct foo {
+ int a;
+ /* private: */
+ int b;
+ int c;
+ };
+ """,
+ "trimmed": """
+ struct foo {
+ int a;
+ };
+ """,
+ },
+
+ "unbalanced_inner_private": {
+ "source": """
+ struct foo {
+ struct {
+ int a;
+ /* private: ignore below */
+ int b;
+ /* but this should not be ignored */
+ };
+ int b;
+ };
+ """,
+ "trimmed": """
+ struct foo {
+ struct {
+ int a;
+ };
+ int b;
+ };
+ """,
+ },
+
+ "unbalanced_struct_group_tagged_with_private": {
+ "source": """
+ struct page_pool_params {
+ struct_group_tagged(page_pool_params_fast, fast,
+ unsigned int order;
+ unsigned int pool_size;
+ int nid;
+ struct device *dev;
+ struct napi_struct *napi;
+ enum dma_data_direction dma_dir;
+ unsigned int max_len;
+ unsigned int offset;
+ };
+ struct_group_tagged(page_pool_params_slow, slow,
+ struct net_device *netdev;
+ unsigned int queue_idx;
+ unsigned int flags;
+ /* private: used by test code only */
+ void (*init_callback)(netmem_ref netmem, void *arg);
+ void *init_arg;
+ };
+ };
+ """,
+ "trimmed": """
+ struct page_pool_params {
+ struct_group_tagged(page_pool_params_fast, fast,
+ unsigned int order;
+ unsigned int pool_size;
+ int nid;
+ struct device *dev;
+ struct napi_struct *napi;
+ enum dma_data_direction dma_dir;
+ unsigned int max_len;
+ unsigned int offset;
+ };
+ struct_group_tagged(page_pool_params_slow, slow,
+ struct net_device *netdev;
+ unsigned int queue_idx;
+ unsigned int flags;
+ };
+ };
+ """,
+ },
+
+ "unbalanced_two_struct_group_tagged_first_with_private": {
+ "source": """
+ struct page_pool_params {
+ struct_group_tagged(page_pool_params_slow, slow,
+ struct net_device *netdev;
+ unsigned int queue_idx;
+ unsigned int flags;
+ /* private: used by test code only */
+ void (*init_callback)(netmem_ref netmem, void *arg);
+ void *init_arg;
+ };
+ struct_group_tagged(page_pool_params_fast, fast,
+ unsigned int order;
+ unsigned int pool_size;
+ int nid;
+ struct device *dev;
+ struct napi_struct *napi;
+ enum dma_data_direction dma_dir;
+ unsigned int max_len;
+ unsigned int offset;
+ };
+ };
+ """,
+ "trimmed": """
+ struct page_pool_params {
+ struct_group_tagged(page_pool_params_slow, slow,
+ struct net_device *netdev;
+ unsigned int queue_idx;
+ unsigned int flags;
+ };
+ struct_group_tagged(page_pool_params_fast, fast,
+ unsigned int order;
+ unsigned int pool_size;
+ int nid;
+ struct device *dev;
+ struct napi_struct *napi;
+ enum dma_data_direction dma_dir;
+ unsigned int max_len;
+ unsigned int offset;
+ };
+ };
+ """,
+ },
+ "unbalanced_without_end_of_line": {
+ "source": """ \
+ struct page_pool_params { \
+ struct_group_tagged(page_pool_params_slow, slow, \
+ struct net_device *netdev; \
+ unsigned int queue_idx; \
+ unsigned int flags;
+ /* private: used by test code only */
+ void (*init_callback)(netmem_ref netmem, void *arg); \
+ void *init_arg; \
+ }; \
+ struct_group_tagged(page_pool_params_fast, fast, \
+ unsigned int order; \
+ unsigned int pool_size; \
+ int nid; \
+ struct device *dev; \
+ struct napi_struct *napi; \
+ enum dma_data_direction dma_dir; \
+ unsigned int max_len; \
+ unsigned int offset; \
+ }; \
+ };
+ """,
+ "trimmed": """
+ struct page_pool_params {
+ struct_group_tagged(page_pool_params_slow, slow,
+ struct net_device *netdev;
+ unsigned int queue_idx;
+ unsigned int flags;
+ };
+ struct_group_tagged(page_pool_params_fast, fast,
+ unsigned int order;
+ unsigned int pool_size;
+ int nid;
+ struct device *dev;
+ struct napi_struct *napi;
+ enum dma_data_direction dma_dir;
+ unsigned int max_len;
+ unsigned int offset;
+ };
+ };
+ """,
+ },
+}
+
+
+class TestPublicPrivate(unittest.TestCase):
+ """
+ Main test class. Populated dynamically at runtime.
+ """
+
+ def setUp(self):
+ self.maxDiff = None
+
+ def add_test(cls, name, source, trimmed):
+ """
+ Dynamically add a test to the class
+ """
+ def test(cls):
+ result = trim_private_members(source)
+
+ result = re.sub(r"\s++", " ", result).strip()
+ expected = re.sub(r"\s++", " ", trimmed).strip()
+
+ msg = f"failed when parsing this source:\n" + source
+
+ cls.assertEqual(result, expected, msg=msg)
+
+ test.__name__ = f'test {name}'
+
+ setattr(TestPublicPrivate, test.__name__, test)
+
+
+#
+# Populate TestPublicPrivate class
+#
+test_class = TestPublicPrivate()
+for name, test in TESTS_PRIVATE.items():
+ test_class.add_test(name, test["source"], test["trimmed"])
+
+
+#
+# main
+#
+if __name__ == "__main__":
+ run_unittest(__file__)
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 03/28] docs: kdoc: don't add broken comments inside prototypes
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 01/28] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 02/28] unittests: add a testbench to check public/private kdoc comments Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 04/28] docs: kdoc: properly handle empty enum arguments Mauro Carvalho Chehab
` (27 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
Parsing a file like drivers/scsi/isci/host.h, which contains
broken kernel-doc markups makes it create a prototype that contains
unmatched end comments.
That causes, for instance, struct sci_power_control to be shown this
this prototype:
struct sci_power_control {
* it is not. */ bool timer_started;
*/ struct sci_timer timer;
* requesters field. */ u8 phys_waiting;
*/ u8 phys_granted_power;
* mapped into requesters via struct sci_phy.phy_index */ struct isci_phy *requesters[SCI_MAX_PHYS];
};
as comments won't start with "/*" anymore.
Fix the logic to detect such cases, and keep adding the comments
inside it.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <18e577dbbd538dcc22945ff139fe3638344e14f0.1773074166.git.mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/kdoc_parser.py | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index edf70ba139a5..086579d00b5c 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -1355,6 +1355,12 @@ class KernelDoc:
elif doc_content.search(line):
self.emit_msg(ln, f"Incorrect use of kernel-doc format: {line}")
self.state = state.PROTO
+
+ #
+ # Don't let it add partial comments at the code, as breaks the
+ # logic meant to remove comments from prototypes.
+ #
+ self.process_proto_type(ln, "/**\n" + line)
# else ... ??
def process_inline_text(self, ln, line):
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 04/28] docs: kdoc: properly handle empty enum arguments
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (2 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 03/28] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer Mauro Carvalho Chehab
` (26 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
Depending on how the enum proto is written, a comma at the end
may incorrectly make kernel-doc parse an arg like " ".
Strip spaces before checking if arg is empty.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <4182bfb7e5f5b4bbaf05cee1bede691e56247eaf.1773074166.git.mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/kdoc_parser.py | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 086579d00b5c..4b3c555e6c8e 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -810,9 +810,10 @@ class KernelDoc:
member_set = set()
members = KernRe(r'\([^;)]*\)').sub('', members)
for arg in members.split(','):
- if not arg:
- continue
arg = KernRe(r'^\s*(\w+).*').sub(r'\1', arg)
+ if not arg.strip():
+ continue
+
self.entry.parameterlist.append(arg)
if arg not in self.entry.parameterdescs:
self.entry.parameterdescs[arg] = self.undescribed
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (3 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 04/28] docs: kdoc: properly handle empty enum arguments Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-16 23:01 ` Jonathan Corbet
2026-03-16 23:03 ` Jonathan Corbet
2026-03-12 14:54 ` [PATCH v2 06/28] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
` (25 subsequent siblings)
30 siblings, 2 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
Handling C code purely using regular expressions doesn't work well.
Add a C tokenizer to help doing it the right way.
The tokenizer was written using as basis the Python re documentation
tokenizer example from:
https://docs.python.org/3/library/re.html#writing-a-tokenizer
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <c63ad36c81fe043e9e33ca55630414893f127413.1773074166.git.mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/kdoc_re.py | 234 +++++++++++++++++++++++++++++++
1 file changed, 234 insertions(+)
diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
index 085b89a4547c..7bed4e9a8810 100644
--- a/tools/lib/python/kdoc/kdoc_re.py
+++ b/tools/lib/python/kdoc/kdoc_re.py
@@ -141,6 +141,240 @@ class KernRe:
return self.last_match.groups()
+class TokType():
+
+ @staticmethod
+ def __str__(val):
+ """Return the name of an enum value"""
+ return TokType._name_by_val.get(val, f"UNKNOWN({val})")
+
+class CToken():
+ """
+ Data class to define a C token.
+ """
+
+ # Tokens that can be used by the parser. Works like an C enum.
+
+ COMMENT = 0 #: A standard C or C99 comment, including delimiter.
+ STRING = 1 #: A string, including quotation marks.
+ CHAR = 2 #: A character, including apostophes.
+ NUMBER = 3 #: A number.
+ PUNC = 4 #: A puntuation mark: ``;`` / ``,`` / ``.``.
+ BEGIN = 5 #: A begin character: ``{`` / ``[`` / ``(``.
+ END = 6 #: A end character: ``}`` / ``]`` / ``)``.
+ CPP = 7 #: A preprocessor macro.
+ HASH = 8 #: The hash character - useful to handle other macros.
+ OP = 9 #: A C operator (add, subtract, ...).
+ STRUCT = 10 #: A ``struct`` keyword.
+ UNION = 11 #: An ``union`` keyword.
+ ENUM = 12 #: A ``struct`` keyword.
+ TYPEDEF = 13 #: A ``typedef`` keyword.
+ NAME = 14 #: A name. Can be an ID or a type.
+ SPACE = 15 #: Any space characters, including new lines
+
+ MISMATCH = 255 #: an error indicator: should never happen in practice.
+
+ # Dict to convert from an enum interger into a string.
+ _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
+
+ # Dict to convert from string to an enum-like integer value.
+ _name_to_val = {k: v for v, k in _name_by_val.items()}
+
+ @staticmethod
+ def to_name(val):
+ """Convert from an integer value from CToken enum into a string"""
+
+ return CToken._name_by_val.get(val, f"UNKNOWN({val})")
+
+ @staticmethod
+ def from_name(name):
+ """Convert a string into a CToken enum value"""
+ if name in CToken._name_to_val:
+ return CToken._name_to_val[name]
+
+ return CToken.MISMATCH
+
+ def __init__(self, kind, value, pos,
+ brace_level, paren_level, bracket_level):
+ self.kind = kind
+ self.value = value
+ self.pos = pos
+ self.brace_level = brace_level
+ self.paren_level = paren_level
+ self.bracket_level = bracket_level
+
+ def __repr__(self):
+ name = self.to_name(self.kind)
+ if isinstance(self.value, str):
+ value = '"' + self.value + '"'
+ else:
+ value = self.value
+
+ return f"CToken({name}, {value}, {self.pos}, " \
+ f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
+
+#: Tokens to parse C code.
+TOKEN_LIST = [
+ (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
+
+ (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
+ (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
+
+ (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
+ r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
+
+ (CToken.PUNC, r"[;,\.]"),
+
+ (CToken.BEGIN, r"[\[\(\{]"),
+
+ (CToken.END, r"[\]\)\}]"),
+
+ (CToken.CPP, r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
+
+ (CToken.HASH, r"#"),
+
+ (CToken.OP, r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
+ r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
+
+ (CToken.STRUCT, r"\bstruct\b"),
+ (CToken.UNION, r"\bunion\b"),
+ (CToken.ENUM, r"\benum\b"),
+ (CToken.TYPEDEF, r"\bkinddef\b"),
+
+ (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
+
+ (CToken.SPACE, r"[\s]+"),
+
+ (CToken.MISMATCH,r"."),
+]
+
+#: Handle C continuation lines.
+RE_CONT = KernRe(r"\\\n")
+
+RE_COMMENT_START = KernRe(r'/\*\s*')
+
+#: tokenizer regex. Will be filled at the first CTokenizer usage.
+re_scanner = None
+
+class CTokenizer():
+ """
+ Scan C statements and definitions and produce tokens.
+
+ When converted to string, it drops comments and handle public/private
+ values, respecting depth.
+ """
+
+ # This class is inspired and follows the basic concepts of:
+ # https://docs.python.org/3/library/re.html#writing-a-tokenizer
+
+ def _tokenize(self, source):
+ """
+ Interactor that parses ``source``, splitting it into tokens, as defined
+ at ``self.TOKEN_LIST``.
+
+ The interactor returns a CToken class object.
+ """
+
+ # Handle continuation lines. Note that kdoc_parser already has a
+ # logic to do that. Still, let's keep it for completeness, as we might
+ # end re-using this tokenizer outsize kernel-doc some day - or we may
+ # eventually remove from there as a future cleanup.
+ source = RE_CONT.sub("", source)
+
+ brace_level = 0
+ paren_level = 0
+ bracket_level = 0
+
+ for match in re_scanner.finditer(source):
+ kind = CToken.from_name(match.lastgroup)
+ pos = match.start()
+ value = match.group()
+
+ if kind == CToken.MISMATCH:
+ raise RuntimeError(f"Unexpected token '{value}' on {pos}:\n\t{source}")
+ elif kind == CToken.BEGIN:
+ if value == '(':
+ paren_level += 1
+ elif value == '[':
+ bracket_level += 1
+ else: # value == '{'
+ brace_level += 1
+
+ elif kind == CToken.END:
+ if value == ')' and paren_level > 0:
+ paren_level -= 1
+ elif value == ']' and bracket_level > 0:
+ bracket_level -= 1
+ elif brace_level > 0: # value == '}'
+ brace_level -= 1
+
+ yield CToken(kind, value, pos,
+ brace_level, paren_level, bracket_level)
+
+ def __init__(self, source):
+ """
+ Create a regular expression to handle TOKEN_LIST.
+
+ While I generally don't like using regex group naming via:
+ (?P<name>...)
+
+ in this particular case, it makes sense, as we can pick the name
+ when matching a code via re_scanner().
+ """
+ global re_scanner
+
+ if not re_scanner:
+ re_tokens = []
+
+ for kind, pattern in TOKEN_LIST:
+ name = CToken.to_name(kind)
+ re_tokens.append(f"(?P<{name}>{pattern})")
+
+ re_scanner = KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)
+
+ self.tokens = []
+ for tok in self._tokenize(source):
+ self.tokens.append(tok)
+
+ def __str__(self):
+ out=""
+ show_stack = [True]
+
+ for tok in self.tokens:
+ if tok.kind == CToken.BEGIN:
+ show_stack.append(show_stack[-1])
+
+ elif tok.kind == CToken.END:
+ prev = show_stack[-1]
+ if len(show_stack) > 1:
+ show_stack.pop()
+
+ if not prev and show_stack[-1]:
+ #
+ # Try to preserve indent
+ #
+ out += "\t" * (len(show_stack) - 1)
+
+ out += str(tok.value)
+ continue
+
+ elif tok.kind == CToken.COMMENT:
+ comment = RE_COMMENT_START.sub("", tok.value)
+
+ if comment.startswith("private:"):
+ show_stack[-1] = False
+ show = False
+ elif comment.startswith("public:"):
+ show_stack[-1] = True
+
+ continue
+
+ if show_stack[-1]:
+ out += str(tok.value)
+
+ return out
+
+
#: Nested delimited pairs (brackets and parenthesis)
DELIMITER_PAIRS = {
'{': '}',
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 06/28] docs: kdoc: use tokenizer to handle comments on structs
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (4 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module Mauro Carvalho Chehab
` (24 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
Better handle comments inside structs. After those changes,
all unittests now pass:
test_private:
TestPublicPrivate:
test balanced_inner_private: OK
test balanced_non_greddy_private: OK
test balanced_private: OK
test no private: OK
test unbalanced_inner_private: OK
test unbalanced_private: OK
test unbalanced_struct_group_tagged_with_private: OK
test unbalanced_two_struct_group_tagged_first_with_private: OK
test unbalanced_without_end_of_line: OK
Ran 9 tests
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <f83ee9e8c38407eaab6ad10d4ccf155fb36683cc.1773074166.git.mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/kdoc_parser.py | 14 ++++----------
1 file changed, 4 insertions(+), 10 deletions(-)
diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 4b3c555e6c8e..6b181ead3175 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -13,7 +13,7 @@ import sys
import re
from pprint import pformat
-from kdoc.kdoc_re import NestedMatch, KernRe
+from kdoc.kdoc_re import NestedMatch, KernRe, CTokenizer
from kdoc.kdoc_item import KdocItem
#
@@ -84,15 +84,9 @@ def trim_private_members(text):
"""
Remove ``struct``/``enum`` members that have been marked "private".
"""
- # First look for a "public:" block that ends a private region, then
- # handle the "private until the end" case.
- #
- text = KernRe(r'/\*\s*private:.*?/\*\s*public:.*?\*/', flags=re.S).sub('', text)
- text = KernRe(r'/\*\s*private:.*', flags=re.S).sub('', text)
- #
- # We needed the comments to do the above, but now we can take them out.
- #
- return KernRe(r'\s*/\*.*?\*/\s*', flags=re.S).sub('', text).strip()
+
+ tokens = CTokenizer(text)
+ return str(tokens)
class state:
"""
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (5 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 06/28] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-16 23:30 ` Jonathan Corbet
2026-03-12 14:54 ` [PATCH v2 08/28] unittests: test_private: modify it to use CTokenizer directly Mauro Carvalho Chehab
` (23 subsequent siblings)
30 siblings, 1 reply; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
Place the C tokenizer on a different module.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/c_lex.py | 239 +++++++++++++++++++++++++++
tools/lib/python/kdoc/kdoc_parser.py | 3 +-
tools/lib/python/kdoc/kdoc_re.py | 233 --------------------------
3 files changed, 241 insertions(+), 234 deletions(-)
create mode 100644 tools/lib/python/kdoc/c_lex.py
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
new file mode 100644
index 000000000000..a104c29b63fb
--- /dev/null
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -0,0 +1,239 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2025: Mauro Carvalho Chehab <mchehab@kernel.org>.
+
+"""
+Regular expression ancillary classes.
+
+Those help caching regular expressions and do matching for kernel-doc.
+"""
+
+import re
+
+from .kdoc_re import KernRe
+
+class CToken():
+ """
+ Data class to define a C token.
+ """
+
+ # Tokens that can be used by the parser. Works like an C enum.
+
+ COMMENT = 0 #: A standard C or C99 comment, including delimiter.
+ STRING = 1 #: A string, including quotation marks.
+ CHAR = 2 #: A character, including apostophes.
+ NUMBER = 3 #: A number.
+ PUNC = 4 #: A puntuation mark: ``;`` / ``,`` / ``.``.
+ BEGIN = 5 #: A begin character: ``{`` / ``[`` / ``(``.
+ END = 6 #: A end character: ``}`` / ``]`` / ``)``.
+ CPP = 7 #: A preprocessor macro.
+ HASH = 8 #: The hash character - useful to handle other macros.
+ OP = 9 #: A C operator (add, subtract, ...).
+ STRUCT = 10 #: A ``struct`` keyword.
+ UNION = 11 #: An ``union`` keyword.
+ ENUM = 12 #: A ``struct`` keyword.
+ TYPEDEF = 13 #: A ``typedef`` keyword.
+ NAME = 14 #: A name. Can be an ID or a type.
+ SPACE = 15 #: Any space characters, including new lines
+
+ MISMATCH = 255 #: an error indicator: should never happen in practice.
+
+ # Dict to convert from an enum interger into a string.
+ _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
+
+ # Dict to convert from string to an enum-like integer value.
+ _name_to_val = {k: v for v, k in _name_by_val.items()}
+
+ @staticmethod
+ def to_name(val):
+ """Convert from an integer value from CToken enum into a string"""
+
+ return CToken._name_by_val.get(val, f"UNKNOWN({val})")
+
+ @staticmethod
+ def from_name(name):
+ """Convert a string into a CToken enum value"""
+ if name in CToken._name_to_val:
+ return CToken._name_to_val[name]
+
+ return CToken.MISMATCH
+
+ def __init__(self, kind, value, pos,
+ brace_level, paren_level, bracket_level):
+ self.kind = kind
+ self.value = value
+ self.pos = pos
+ self.brace_level = brace_level
+ self.paren_level = paren_level
+ self.bracket_level = bracket_level
+
+ def __repr__(self):
+ name = self.to_name(self.kind)
+ if isinstance(self.value, str):
+ value = '"' + self.value + '"'
+ else:
+ value = self.value
+
+ return f"CToken({name}, {value}, {self.pos}, " \
+ f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
+
+#: Tokens to parse C code.
+TOKEN_LIST = [
+ (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
+
+ (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
+ (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
+
+ (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
+ r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
+
+ (CToken.PUNC, r"[;,\.]"),
+
+ (CToken.BEGIN, r"[\[\(\{]"),
+
+ (CToken.END, r"[\]\)\}]"),
+
+ (CToken.CPP, r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
+
+ (CToken.HASH, r"#"),
+
+ (CToken.OP, r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
+ r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
+
+ (CToken.STRUCT, r"\bstruct\b"),
+ (CToken.UNION, r"\bunion\b"),
+ (CToken.ENUM, r"\benum\b"),
+ (CToken.TYPEDEF, r"\bkinddef\b"),
+
+ (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
+
+ (CToken.SPACE, r"[\s]+"),
+
+ (CToken.MISMATCH,r"."),
+]
+
+#: Handle C continuation lines.
+RE_CONT = KernRe(r"\\\n")
+
+RE_COMMENT_START = KernRe(r'/\*\s*')
+
+#: tokenizer regex. Will be filled at the first CTokenizer usage.
+re_scanner = None
+
+class CTokenizer():
+ """
+ Scan C statements and definitions and produce tokens.
+
+ When converted to string, it drops comments and handle public/private
+ values, respecting depth.
+ """
+
+ # This class is inspired and follows the basic concepts of:
+ # https://docs.python.org/3/library/re.html#writing-a-tokenizer
+
+ def _tokenize(self, source):
+ """
+ Interactor that parses ``source``, splitting it into tokens, as defined
+ at ``self.TOKEN_LIST``.
+
+ The interactor returns a CToken class object.
+ """
+
+ # Handle continuation lines. Note that kdoc_parser already has a
+ # logic to do that. Still, let's keep it for completeness, as we might
+ # end re-using this tokenizer outsize kernel-doc some day - or we may
+ # eventually remove from there as a future cleanup.
+ source = RE_CONT.sub("", source)
+
+ brace_level = 0
+ paren_level = 0
+ bracket_level = 0
+
+ for match in re_scanner.finditer(source):
+ kind = CToken.from_name(match.lastgroup)
+ pos = match.start()
+ value = match.group()
+
+ if kind == CToken.MISMATCH:
+ raise RuntimeError(f"Unexpected token '{value}' on {pos}:\n\t{source}")
+ elif kind == CToken.BEGIN:
+ if value == '(':
+ paren_level += 1
+ elif value == '[':
+ bracket_level += 1
+ else: # value == '{'
+ brace_level += 1
+
+ elif kind == CToken.END:
+ if value == ')' and paren_level > 0:
+ paren_level -= 1
+ elif value == ']' and bracket_level > 0:
+ bracket_level -= 1
+ elif brace_level > 0: # value == '}'
+ brace_level -= 1
+
+ yield CToken(kind, value, pos,
+ brace_level, paren_level, bracket_level)
+
+ def __init__(self, source):
+ """
+ Create a regular expression to handle TOKEN_LIST.
+
+ While I generally don't like using regex group naming via:
+ (?P<name>...)
+
+ in this particular case, it makes sense, as we can pick the name
+ when matching a code via re_scanner().
+ """
+ global re_scanner
+
+ if not re_scanner:
+ re_tokens = []
+
+ for kind, pattern in TOKEN_LIST:
+ name = CToken.to_name(kind)
+ re_tokens.append(f"(?P<{name}>{pattern})")
+
+ re_scanner = KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)
+
+ self.tokens = []
+ for tok in self._tokenize(source):
+ self.tokens.append(tok)
+
+ def __str__(self):
+ out=""
+ show_stack = [True]
+
+ for tok in self.tokens:
+ if tok.kind == CToken.BEGIN:
+ show_stack.append(show_stack[-1])
+
+ elif tok.kind == CToken.END:
+ prev = show_stack[-1]
+ if len(show_stack) > 1:
+ show_stack.pop()
+
+ if not prev and show_stack[-1]:
+ #
+ # Try to preserve indent
+ #
+ out += "\t" * (len(show_stack) - 1)
+
+ out += str(tok.value)
+ continue
+
+ elif tok.kind == CToken.COMMENT:
+ comment = RE_COMMENT_START.sub("", tok.value)
+
+ if comment.startswith("private:"):
+ show_stack[-1] = False
+ show = False
+ elif comment.startswith("public:"):
+ show_stack[-1] = True
+
+ continue
+
+ if show_stack[-1]:
+ out += str(tok.value)
+
+ return out
diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 6b181ead3175..e804e61b09c0 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -13,7 +13,8 @@ import sys
import re
from pprint import pformat
-from kdoc.kdoc_re import NestedMatch, KernRe, CTokenizer
+from kdoc.kdoc_re import NestedMatch, KernRe
+from kdoc.c_lex import CTokenizer
from kdoc.kdoc_item import KdocItem
#
diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
index 7bed4e9a8810..ba601a4f5035 100644
--- a/tools/lib/python/kdoc/kdoc_re.py
+++ b/tools/lib/python/kdoc/kdoc_re.py
@@ -141,239 +141,6 @@ class KernRe:
return self.last_match.groups()
-class TokType():
-
- @staticmethod
- def __str__(val):
- """Return the name of an enum value"""
- return TokType._name_by_val.get(val, f"UNKNOWN({val})")
-
-class CToken():
- """
- Data class to define a C token.
- """
-
- # Tokens that can be used by the parser. Works like an C enum.
-
- COMMENT = 0 #: A standard C or C99 comment, including delimiter.
- STRING = 1 #: A string, including quotation marks.
- CHAR = 2 #: A character, including apostophes.
- NUMBER = 3 #: A number.
- PUNC = 4 #: A puntuation mark: ``;`` / ``,`` / ``.``.
- BEGIN = 5 #: A begin character: ``{`` / ``[`` / ``(``.
- END = 6 #: A end character: ``}`` / ``]`` / ``)``.
- CPP = 7 #: A preprocessor macro.
- HASH = 8 #: The hash character - useful to handle other macros.
- OP = 9 #: A C operator (add, subtract, ...).
- STRUCT = 10 #: A ``struct`` keyword.
- UNION = 11 #: An ``union`` keyword.
- ENUM = 12 #: A ``struct`` keyword.
- TYPEDEF = 13 #: A ``typedef`` keyword.
- NAME = 14 #: A name. Can be an ID or a type.
- SPACE = 15 #: Any space characters, including new lines
-
- MISMATCH = 255 #: an error indicator: should never happen in practice.
-
- # Dict to convert from an enum interger into a string.
- _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
-
- # Dict to convert from string to an enum-like integer value.
- _name_to_val = {k: v for v, k in _name_by_val.items()}
-
- @staticmethod
- def to_name(val):
- """Convert from an integer value from CToken enum into a string"""
-
- return CToken._name_by_val.get(val, f"UNKNOWN({val})")
-
- @staticmethod
- def from_name(name):
- """Convert a string into a CToken enum value"""
- if name in CToken._name_to_val:
- return CToken._name_to_val[name]
-
- return CToken.MISMATCH
-
- def __init__(self, kind, value, pos,
- brace_level, paren_level, bracket_level):
- self.kind = kind
- self.value = value
- self.pos = pos
- self.brace_level = brace_level
- self.paren_level = paren_level
- self.bracket_level = bracket_level
-
- def __repr__(self):
- name = self.to_name(self.kind)
- if isinstance(self.value, str):
- value = '"' + self.value + '"'
- else:
- value = self.value
-
- return f"CToken({name}, {value}, {self.pos}, " \
- f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
-
-#: Tokens to parse C code.
-TOKEN_LIST = [
- (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
-
- (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
- (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
-
- (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
- r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
-
- (CToken.PUNC, r"[;,\.]"),
-
- (CToken.BEGIN, r"[\[\(\{]"),
-
- (CToken.END, r"[\]\)\}]"),
-
- (CToken.CPP, r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
-
- (CToken.HASH, r"#"),
-
- (CToken.OP, r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
- r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
-
- (CToken.STRUCT, r"\bstruct\b"),
- (CToken.UNION, r"\bunion\b"),
- (CToken.ENUM, r"\benum\b"),
- (CToken.TYPEDEF, r"\bkinddef\b"),
-
- (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
-
- (CToken.SPACE, r"[\s]+"),
-
- (CToken.MISMATCH,r"."),
-]
-
-#: Handle C continuation lines.
-RE_CONT = KernRe(r"\\\n")
-
-RE_COMMENT_START = KernRe(r'/\*\s*')
-
-#: tokenizer regex. Will be filled at the first CTokenizer usage.
-re_scanner = None
-
-class CTokenizer():
- """
- Scan C statements and definitions and produce tokens.
-
- When converted to string, it drops comments and handle public/private
- values, respecting depth.
- """
-
- # This class is inspired and follows the basic concepts of:
- # https://docs.python.org/3/library/re.html#writing-a-tokenizer
-
- def _tokenize(self, source):
- """
- Interactor that parses ``source``, splitting it into tokens, as defined
- at ``self.TOKEN_LIST``.
-
- The interactor returns a CToken class object.
- """
-
- # Handle continuation lines. Note that kdoc_parser already has a
- # logic to do that. Still, let's keep it for completeness, as we might
- # end re-using this tokenizer outsize kernel-doc some day - or we may
- # eventually remove from there as a future cleanup.
- source = RE_CONT.sub("", source)
-
- brace_level = 0
- paren_level = 0
- bracket_level = 0
-
- for match in re_scanner.finditer(source):
- kind = CToken.from_name(match.lastgroup)
- pos = match.start()
- value = match.group()
-
- if kind == CToken.MISMATCH:
- raise RuntimeError(f"Unexpected token '{value}' on {pos}:\n\t{source}")
- elif kind == CToken.BEGIN:
- if value == '(':
- paren_level += 1
- elif value == '[':
- bracket_level += 1
- else: # value == '{'
- brace_level += 1
-
- elif kind == CToken.END:
- if value == ')' and paren_level > 0:
- paren_level -= 1
- elif value == ']' and bracket_level > 0:
- bracket_level -= 1
- elif brace_level > 0: # value == '}'
- brace_level -= 1
-
- yield CToken(kind, value, pos,
- brace_level, paren_level, bracket_level)
-
- def __init__(self, source):
- """
- Create a regular expression to handle TOKEN_LIST.
-
- While I generally don't like using regex group naming via:
- (?P<name>...)
-
- in this particular case, it makes sense, as we can pick the name
- when matching a code via re_scanner().
- """
- global re_scanner
-
- if not re_scanner:
- re_tokens = []
-
- for kind, pattern in TOKEN_LIST:
- name = CToken.to_name(kind)
- re_tokens.append(f"(?P<{name}>{pattern})")
-
- re_scanner = KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)
-
- self.tokens = []
- for tok in self._tokenize(source):
- self.tokens.append(tok)
-
- def __str__(self):
- out=""
- show_stack = [True]
-
- for tok in self.tokens:
- if tok.kind == CToken.BEGIN:
- show_stack.append(show_stack[-1])
-
- elif tok.kind == CToken.END:
- prev = show_stack[-1]
- if len(show_stack) > 1:
- show_stack.pop()
-
- if not prev and show_stack[-1]:
- #
- # Try to preserve indent
- #
- out += "\t" * (len(show_stack) - 1)
-
- out += str(tok.value)
- continue
-
- elif tok.kind == CToken.COMMENT:
- comment = RE_COMMENT_START.sub("", tok.value)
-
- if comment.startswith("private:"):
- show_stack[-1] = False
- show = False
- elif comment.startswith("public:"):
- show_stack[-1] = True
-
- continue
-
- if show_stack[-1]:
- out += str(tok.value)
-
- return out
-
#: Nested delimited pairs (brackets and parenthesis)
DELIMITER_PAIRS = {
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 08/28] unittests: test_private: modify it to use CTokenizer directly
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (6 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 09/28] unittests: test_tokenizer: check if the tokenizer works Mauro Carvalho Chehab
` (22 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
Change the logic to use the tokenizer directly. This allows
adding more unit tests to check the validty of the tokenizer
itself.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <2672257233ff73a9464c09b50924be51e25d4f59.1773074166.git.mchehab+huawei@kernel.org>
---
.../{test_private.py => test_tokenizer.py} | 76 +++++++++++++------
1 file changed, 52 insertions(+), 24 deletions(-)
rename tools/unittests/{test_private.py => test_tokenizer.py} (85%)
diff --git a/tools/unittests/test_private.py b/tools/unittests/test_tokenizer.py
similarity index 85%
rename from tools/unittests/test_private.py
rename to tools/unittests/test_tokenizer.py
index eae245ae8a12..da0f2c4c9e21 100755
--- a/tools/unittests/test_private.py
+++ b/tools/unittests/test_tokenizer.py
@@ -15,20 +15,44 @@ from unittest.mock import MagicMock
SRC_DIR = os.path.dirname(os.path.realpath(__file__))
sys.path.insert(0, os.path.join(SRC_DIR, "../lib/python"))
-from kdoc.kdoc_parser import trim_private_members
+from kdoc.kdoc_re import CTokenizer
from unittest_helper import run_unittest
+
+
#
# List of tests.
#
# The code will dynamically generate one test for each key on this dictionary.
#
+def make_private_test(name, data):
+ """
+ Create a test named ``name`` using parameters given by ``data`` dict.
+ """
+
+ def test(self):
+ """In-lined lambda-like function to run the test"""
+ tokens = CTokenizer(data["source"])
+ result = str(tokens)
+
+ #
+ # Avoid whitespace false positives
+ #
+ result = re.sub(r"\s++", " ", result).strip()
+ expected = re.sub(r"\s++", " ", data["trimmed"]).strip()
+
+ msg = f"failed when parsing this source:\n{data['source']}"
+ self.assertEqual(result, expected, msg=msg)
+
+ return test
+
#: Tests to check if CTokenizer is handling properly public/private comments.
TESTS_PRIVATE = {
#
# Simplest case: no private. Ensure that trimming won't affect struct
#
+ "__run__": make_private_test,
"no private": {
"source": """
struct foo {
@@ -288,41 +312,45 @@ TESTS_PRIVATE = {
},
}
+#: Dict containing all test groups fror CTokenizer
+TESTS = {
+ "TestPublicPrivate": TESTS_PRIVATE,
+}
-class TestPublicPrivate(unittest.TestCase):
- """
- Main test class. Populated dynamically at runtime.
- """
+def setUp(self):
+ self.maxDiff = None
- def setUp(self):
- self.maxDiff = None
+def build_test_class(group_name, table):
+ """
+ Dynamically creates a class instance using type() as a generator
+ for a new class derivated from unittest.TestCase.
- def add_test(cls, name, source, trimmed):
- """
- Dynamically add a test to the class
- """
- def test(cls):
- result = trim_private_members(source)
+ We're opting to do it inside a function to avoid the risk of
+ changing the globals() dictionary.
+ """
- result = re.sub(r"\s++", " ", result).strip()
- expected = re.sub(r"\s++", " ", trimmed).strip()
+ class_dict = {
+ "setUp": setUp
+ }
- msg = f"failed when parsing this source:\n" + source
+ run = table["__run__"]
- cls.assertEqual(result, expected, msg=msg)
+ for test_name, data in table.items():
+ if test_name == "__run__":
+ continue
- test.__name__ = f'test {name}'
+ class_dict[f"test_{test_name}"] = run(test_name, data)
- setattr(TestPublicPrivate, test.__name__, test)
+ cls = type(group_name, (unittest.TestCase,), class_dict)
+ return cls.__name__, cls
#
-# Populate TestPublicPrivate class
+# Create classes and add them to the global dictionary
#
-test_class = TestPublicPrivate()
-for name, test in TESTS_PRIVATE.items():
- test_class.add_test(name, test["source"], test["trimmed"])
-
+for group, table in TESTS.items():
+ t = build_test_class(group, table)
+ globals()[t[0]] = t[1]
#
# main
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 09/28] unittests: test_tokenizer: check if the tokenizer works
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (7 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 08/28] unittests: test_private: modify it to use CTokenizer directly Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 10/28] unittests: add a runner to execute all unittests Mauro Carvalho Chehab
` (21 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
Add extra tests to check if the tokenizer is working properly.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/c_lex.py | 4 +-
tools/unittests/test_tokenizer.py | 109 +++++++++++++++++++++++++++++-
2 files changed, 108 insertions(+), 5 deletions(-)
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index a104c29b63fb..38f70e836eb8 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -58,8 +58,8 @@ class CToken():
return CToken.MISMATCH
- def __init__(self, kind, value, pos,
- brace_level, paren_level, bracket_level):
+ def __init__(self, kind, value=None, pos=0,
+ brace_level=0, paren_level=0, bracket_level=0):
self.kind = kind
self.value = value
self.pos = pos
diff --git a/tools/unittests/test_tokenizer.py b/tools/unittests/test_tokenizer.py
index da0f2c4c9e21..efb1d1687811 100755
--- a/tools/unittests/test_tokenizer.py
+++ b/tools/unittests/test_tokenizer.py
@@ -15,16 +15,118 @@ from unittest.mock import MagicMock
SRC_DIR = os.path.dirname(os.path.realpath(__file__))
sys.path.insert(0, os.path.join(SRC_DIR, "../lib/python"))
-from kdoc.kdoc_re import CTokenizer
+from kdoc.c_lex import CToken, CTokenizer
from unittest_helper import run_unittest
-
-
#
# List of tests.
#
# The code will dynamically generate one test for each key on this dictionary.
#
+def tokens_to_list(tokens):
+ tuples = []
+
+ for tok in tokens:
+ if tok.kind == CToken.SPACE:
+ continue
+
+ tuples += [(tok.kind, tok.value,
+ tok.brace_level, tok.paren_level, tok.bracket_level)]
+
+ return tuples
+
+
+def make_tokenizer_test(name, data):
+ """
+ Create a test named ``name`` using parameters given by ``data`` dict.
+ """
+
+ def test(self):
+ """In-lined lambda-like function to run the test"""
+
+ #
+ # Check if exceptions are properly handled
+ #
+ if "raises" in data:
+ with self.assertRaises(data["raises"]):
+ CTokenizer(data["source"])
+ return
+
+ #
+ # Check if tokenizer is producing expected results
+ #
+ tokens = CTokenizer(data["source"]).tokens
+
+ result = tokens_to_list(tokens)
+ expected = tokens_to_list(data["expected"])
+
+ self.assertEqual(result, expected, msg=f"{name}")
+
+ return test
+
+#: Tokenizer tests.
+TESTS_TOKENIZER = {
+ "__run__": make_tokenizer_test,
+
+ "basic_tokens": {
+ "source": """
+ int a; // comment
+ float b = 1.23;
+ """,
+ "expected": [
+ CToken(CToken.NAME, "int"),
+ CToken(CToken.NAME, "a"),
+ CToken(CToken.PUNC, ";"),
+ CToken(CToken.COMMENT, "// comment"),
+ CToken(CToken.NAME, "float"),
+ CToken(CToken.NAME, "b"),
+ CToken(CToken.OP, "="),
+ CToken(CToken.NUMBER, "1.23"),
+ CToken(CToken.PUNC, ";"),
+ ],
+ },
+
+ "depth_counters": {
+ "source": """
+ struct X {
+ int arr[10];
+ func(a[0], (b + c));
+ }
+ """,
+ "expected": [
+ CToken(CToken.STRUCT, "struct"),
+ CToken(CToken.NAME, "X"),
+ CToken(CToken.BEGIN, "{", brace_level=1),
+
+ CToken(CToken.NAME, "int", brace_level=1),
+ CToken(CToken.NAME, "arr", brace_level=1),
+ CToken(CToken.BEGIN, "[", brace_level=1, bracket_level=1),
+ CToken(CToken.NUMBER, "10", brace_level=1, bracket_level=1),
+ CToken(CToken.END, "]", brace_level=1),
+ CToken(CToken.PUNC, ";", brace_level=1),
+ CToken(CToken.NAME, "func", brace_level=1),
+ CToken(CToken.BEGIN, "(", brace_level=1, paren_level=1),
+ CToken(CToken.NAME, "a", brace_level=1, paren_level=1),
+ CToken(CToken.BEGIN, "[", brace_level=1, paren_level=1, bracket_level=1),
+ CToken(CToken.NUMBER, "0", brace_level=1, paren_level=1, bracket_level=1),
+ CToken(CToken.END, "]", brace_level=1, paren_level=1),
+ CToken(CToken.PUNC, ",", brace_level=1, paren_level=1),
+ CToken(CToken.BEGIN, "(", brace_level=1, paren_level=2),
+ CToken(CToken.NAME, "b", brace_level=1, paren_level=2),
+ CToken(CToken.OP, "+", brace_level=1, paren_level=2),
+ CToken(CToken.NAME, "c", brace_level=1, paren_level=2),
+ CToken(CToken.END, ")", brace_level=1, paren_level=1),
+ CToken(CToken.END, ")", brace_level=1),
+ CToken(CToken.PUNC, ";", brace_level=1),
+ CToken(CToken.END, "}"),
+ ],
+ },
+
+ "mismatch_error": {
+ "source": "int a$ = 5;", # $ is illegal
+ "raises": RuntimeError,
+ },
+}
def make_private_test(name, data):
"""
@@ -315,6 +417,7 @@ TESTS_PRIVATE = {
#: Dict containing all test groups fror CTokenizer
TESTS = {
"TestPublicPrivate": TESTS_PRIVATE,
+ "TestTokenizer": TESTS_TOKENIZER,
}
def setUp(self):
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 10/28] unittests: add a runner to execute all unittests
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (8 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 09/28] unittests: test_tokenizer: check if the tokenizer works Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 11/28] docs: kdoc: create a CMatch to match nested C blocks Mauro Carvalho Chehab
` (20 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
We'll soon have multiple unit tests, add a runner that will
discover all of them and execute all tests.
It was opted to discover only files that starts with "test",
as this way unittest discover won't try adding libraries or
other stuff that might not contain unittest classes.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/unittests/run.py | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
create mode 100755 tools/unittests/run.py
diff --git a/tools/unittests/run.py b/tools/unittests/run.py
new file mode 100755
index 000000000000..8c19036d43a1
--- /dev/null
+++ b/tools/unittests/run.py
@@ -0,0 +1,17 @@
+#!/bin/env python3
+import os
+import unittest
+import sys
+
+TOOLS_DIR=os.path.join(os.path.dirname(os.path.realpath(__file__)), "..")
+sys.path.insert(0, TOOLS_DIR)
+
+from lib.python.unittest_helper import TestUnits
+
+if __name__ == "__main__":
+ loader = unittest.TestLoader()
+
+ suite = loader.discover(start_dir=os.path.join(TOOLS_DIR, "unittests"),
+ pattern="test*.py")
+
+ TestUnits().run("", suite=suite)
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 11/28] docs: kdoc: create a CMatch to match nested C blocks
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (9 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 10/28] unittests: add a runner to execute all unittests Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 12/28] tools: unittests: add tests for CMatch Mauro Carvalho Chehab
` (19 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
The NextMatch code is complex, and will become even more complex
if we add there support for arguments.
Now that we have a tokenizer, we can use a better solution,
easier to be understood.
Yet, to improve performance, it is better to make it use a
previously tokenized code, changing its ABI.
So, reimplement NextMatch using the CTokener class. Once it
is done, we can drop NestedMatch.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/c_lex.py | 222 +++++++++++++++++++++++++++---
tools/unittests/test_tokenizer.py | 3 +-
2 files changed, 203 insertions(+), 22 deletions(-)
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index 38f70e836eb8..e986a4ad73e3 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -58,14 +58,13 @@ class CToken():
return CToken.MISMATCH
+
def __init__(self, kind, value=None, pos=0,
brace_level=0, paren_level=0, bracket_level=0):
self.kind = kind
self.value = value
self.pos = pos
- self.brace_level = brace_level
- self.paren_level = paren_level
- self.bracket_level = bracket_level
+ self.level = (bracket_level, paren_level, brace_level)
def __repr__(self):
name = self.to_name(self.kind)
@@ -74,8 +73,7 @@ class CToken():
else:
value = self.value
- return f"CToken({name}, {value}, {self.pos}, " \
- f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
+ return f"CToken(CToken.{name}, {value}, {self.pos}, {self.level})"
#: Tokens to parse C code.
TOKEN_LIST = [
@@ -105,20 +103,30 @@ TOKEN_LIST = [
(CToken.ENUM, r"\benum\b"),
(CToken.TYPEDEF, r"\bkinddef\b"),
- (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
+ (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
(CToken.SPACE, r"[\s]+"),
(CToken.MISMATCH,r"."),
]
+def fill_re_scanner(token_list):
+ """Ancillary routine to convert TOKEN_LIST into a finditer regex"""
+ re_tokens = []
+
+ for kind, pattern in token_list:
+ name = CToken.to_name(kind)
+ re_tokens.append(f"(?P<{name}>{pattern})")
+
+ return KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)
+
#: Handle C continuation lines.
RE_CONT = KernRe(r"\\\n")
RE_COMMENT_START = KernRe(r'/\*\s*')
#: tokenizer regex. Will be filled at the first CTokenizer usage.
-re_scanner = None
+RE_SCANNER = fill_re_scanner(TOKEN_LIST)
class CTokenizer():
"""
@@ -149,7 +157,7 @@ class CTokenizer():
paren_level = 0
bracket_level = 0
- for match in re_scanner.finditer(source):
+ for match in RE_SCANNER.finditer(source):
kind = CToken.from_name(match.lastgroup)
pos = match.start()
value = match.group()
@@ -175,7 +183,7 @@ class CTokenizer():
yield CToken(kind, value, pos,
brace_level, paren_level, bracket_level)
- def __init__(self, source):
+ def __init__(self, source=None):
"""
Create a regular expression to handle TOKEN_LIST.
@@ -183,20 +191,18 @@ class CTokenizer():
(?P<name>...)
in this particular case, it makes sense, as we can pick the name
- when matching a code via re_scanner().
+ when matching a code via RE_SCANNER.
"""
- global re_scanner
-
- if not re_scanner:
- re_tokens = []
-
- for kind, pattern in TOKEN_LIST:
- name = CToken.to_name(kind)
- re_tokens.append(f"(?P<{name}>{pattern})")
-
- re_scanner = KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)
self.tokens = []
+
+ if not source:
+ return
+
+ if isinstance(source, list):
+ self.tokens = source
+ return
+
for tok in self._tokenize(source):
self.tokens.append(tok)
@@ -237,3 +243,179 @@ class CTokenizer():
out += str(tok.value)
return out
+
+
+class CMatch:
+ """
+ Finding nested delimiters is hard with regular expressions. It is
+ even harder on Python with its normal re module, as there are several
+ advanced regular expressions that are missing.
+
+ This is the case of this pattern::
+
+ '\\bSTRUCT_GROUP(\\(((?:(?>[^)(]+)|(?1))*)\\))[^;]*;'
+
+ which is used to properly match open/close parentheses of the
+ string search STRUCT_GROUP(),
+
+ Add a class that counts pairs of delimiters, using it to match and
+ replace nested expressions.
+
+ The original approach was suggested by:
+
+ https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
+
+ Although I re-implemented it to make it more generic and match 3 types
+ of delimiters. The logic checks if delimiters are paired. If not, it
+ will ignore the search string.
+ """
+
+ # TODO: make CMatch handle multiple match groups
+ #
+ # Right now, regular expressions to match it are defined only up to
+ # the start delimiter, e.g.:
+ #
+ # \bSTRUCT_GROUP\(
+ #
+ # is similar to: STRUCT_GROUP\((.*)\)
+ # except that the content inside the match group is delimiter-aligned.
+ #
+ # The content inside parentheses is converted into a single replace
+ # group (e.g. r`\0').
+ #
+ # It would be nice to change such definition to support multiple
+ # match groups, allowing a regex equivalent to:
+ #
+ # FOO\((.*), (.*), (.*)\)
+ #
+ # it is probably easier to define it not as a regular expression, but
+ # with some lexical definition like:
+ #
+ # FOO(arg1, arg2, arg3)
+
+ def __init__(self, regex):
+ self.regex = KernRe(regex)
+
+ def _search(self, tokenizer):
+ """
+ Finds paired blocks for a regex that ends with a delimiter.
+
+ The suggestion of using finditer to match pairs came from:
+ https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
+ but I ended using a different implementation to align all three types
+ of delimiters and seek for an initial regular expression.
+
+ The algorithm seeks for open/close paired delimiters and places them
+ into a stack, yielding a start/stop position of each match when the
+ stack is zeroed.
+
+ The algorithm should work fine for properly paired lines, but will
+ silently ignore end delimiters that precede a start delimiter.
+ This should be OK for kernel-doc parser, as unaligned delimiters
+ would cause compilation errors. So, we don't need to raise exceptions
+ to cover such issues.
+ """
+
+ start = None
+ offset = -1
+ started = False
+
+ import sys
+
+ stack = []
+
+ for i, tok in enumerate(tokenizer.tokens):
+ if start is None:
+ if tok.kind == CToken.NAME and self.regex.match(tok.value):
+ start = i
+ stack.append((start, tok.level))
+ started = False
+
+ continue
+
+ if not started and tok.kind == CToken.BEGIN:
+ started = True
+ continue
+
+ if tok.kind == CToken.END and tok.level == stack[-1][1]:
+ start, level = stack.pop()
+ offset = i
+
+ yield CTokenizer(tokenizer.tokens[start:offset + 1])
+ start = None
+
+ #
+ # If an END zeroing levels is not there, return remaining stuff
+ # This is meant to solve cases where the caller logic might be
+ # picking an incomplete block.
+ #
+ if start and offset < 0:
+ print("WARNING: can't find an end", file=sys.stderr)
+ yield CTokenizer(tokenizer.tokens[start:])
+
+ def search(self, source):
+ """
+ This is similar to re.search:
+
+ It matches a regex that it is followed by a delimiter,
+ returning occurrences only if all delimiters are paired.
+ """
+
+ if isinstance(source, CTokenizer):
+ tokenizer = source
+ is_token = True
+ else:
+ tokenizer = CTokenizer(source)
+ is_token = False
+
+ for new_tokenizer in self._search(tokenizer):
+ if is_token:
+ yield new_tokenizer
+ else:
+ yield str(new_tokenizer)
+
+ def sub(self, sub, line, count=0):
+ """
+ This is similar to re.sub:
+
+ It matches a regex that it is followed by a delimiter,
+ replacing occurrences only if all delimiters are paired.
+
+ if the sub argument contains::
+
+ r'\0'
+
+ it will work just like re: it places there the matched paired data
+ with the delimiter stripped.
+
+ If count is different than zero, it will replace at most count
+ items.
+ """
+ if isinstance(source, CTokenizer):
+ is_token = True
+ tokenizer = source
+ else:
+ is_token = False
+ tokenizer = CTokenizer(source)
+
+ new_tokenizer = CTokenizer()
+ cur_pos = 0
+ for start, end in self._search(tokenizer):
+ new_tokenizer.tokens += tokenizer.tokens[cur_pos:start]
+# new_tokenizer.tokens += [sub_str]
+
+ cur_pos = end + 1
+
+ if cur_pos:
+ new_tokenizer.tokens += tokenizer.tokens[cur_pos:]
+
+ print(new_tokenizer.tokens)
+
+ return str(new_tokenizer)
+
+ def __repr__(self):
+ """
+ Returns a displayable version of the class init.
+ """
+
+ return f'CMatch("{self.regex.regex.pattern}")'
diff --git a/tools/unittests/test_tokenizer.py b/tools/unittests/test_tokenizer.py
index efb1d1687811..3081f27a7786 100755
--- a/tools/unittests/test_tokenizer.py
+++ b/tools/unittests/test_tokenizer.py
@@ -30,8 +30,7 @@ def tokens_to_list(tokens):
if tok.kind == CToken.SPACE:
continue
- tuples += [(tok.kind, tok.value,
- tok.brace_level, tok.paren_level, tok.bracket_level)]
+ tuples += [(tok.kind, tok.value, tok.level)]
return tuples
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 12/28] tools: unittests: add tests for CMatch
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (10 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 11/28] docs: kdoc: create a CMatch to match nested C blocks Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 13/28] docs: c_lex: properly implement a sub() method " Mauro Carvalho Chehab
` (18 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
The CMatch logic is complex enough to justify tests to ensure
that it is doing its job.
Add unittests to check the functionality provided by CMatch
by replicating expected patterns.
The CMatch class handles with complex macros. Add an unittest
to check if its doing the right thing and detect eventual regressions
as we improve its code.
The initial version was generated using gpt-oss:latest LLM
on my local GPU, as LLMs aren't bad transforming patterns
into unittests.
Yet, the curent version contains only the skeleton of what
LLM produced, as I ended higly changing its content to be
more representative and to have real case scenarios.
The kdoc_xforms test suite contains 3 test groups. Two of
them tests the basic functionality of CMatch to
replace patterns.
The last one (TestRealUsecases) contains real code snippets
from the Kernel with some cleanups to better fit in 80 columns
and uses the same transforms as kernel-doc, thus allowing
to test the logic used inside kdoc_parser to transform
functions, structs and variable patterns.
Its output is like this:
$ tools/unittests/kdoc_xforms.py
Ran 25 tests in 0.003s
OK
test_cmatch:
TestSearch:
test_search_acquires_multiple: OK
test_search_acquires_nested_paren: OK
test_search_acquires_simple: OK
test_search_must_hold: OK
test_search_must_hold_shared: OK
test_search_no_false_positive: OK
test_search_no_function: OK
test_search_no_macro_remains: OK
Ran 8 tests
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/unittests/test_cmatch.py | 95 ++++++++++++++++++++++++++++++++++
1 file changed, 95 insertions(+)
create mode 100755 tools/unittests/test_cmatch.py
diff --git a/tools/unittests/test_cmatch.py b/tools/unittests/test_cmatch.py
new file mode 100755
index 000000000000..53b25aa4dc4a
--- /dev/null
+++ b/tools/unittests/test_cmatch.py
@@ -0,0 +1,95 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2026: Mauro Carvalho Chehab <mchehab@kernel.org>.
+#
+# pylint: disable=C0413,R0904
+
+
+"""
+Unit tests for kernel-doc CMatch.
+"""
+
+import os
+import re
+import sys
+import unittest
+
+
+# Import Python modules
+
+SRC_DIR = os.path.dirname(os.path.realpath(__file__))
+sys.path.insert(0, os.path.join(SRC_DIR, "../lib/python"))
+
+from kdoc.c_lex import CMatch
+from kdoc.xforms_lists import CTransforms
+from unittest_helper import run_unittest
+
+#
+# Override unittest.TestCase to better compare diffs ignoring whitespaces
+#
+class TestCaseDiff(unittest.TestCase):
+ """
+ Disable maximum limit on diffs and add a method to better
+ handle diffs with whitespace differences.
+ """
+
+ @classmethod
+ def setUpClass(cls):
+ """Ensure that there won't be limit for diffs"""
+ cls.maxDiff = None
+
+
+#
+# Tests doing with different macros
+#
+
+class TestSearch(TestCaseDiff):
+ """
+ Test search mechanism
+ """
+
+ def test_search_acquires_simple(self):
+ line = "__acquires(ctx) foo();"
+ result = ", ".join(CMatch("__acquires").search(line))
+ self.assertEqual(result, "__acquires(ctx)")
+
+ def test_search_acquires_multiple(self):
+ line = "__acquires(ctx) __acquires(other) bar();"
+ result = ", ".join(CMatch("__acquires").search(line))
+ self.assertEqual(result, "__acquires(ctx), __acquires(other)")
+
+ def test_search_acquires_nested_paren(self):
+ line = "__acquires((ctx1, ctx2)) baz();"
+ result = ", ".join(CMatch("__acquires").search(line))
+ self.assertEqual(result, "__acquires((ctx1, ctx2))")
+
+ def test_search_must_hold(self):
+ line = "__must_hold(&lock) do_something();"
+ result = ", ".join(CMatch("__must_hold").search(line))
+ self.assertEqual(result, "__must_hold(&lock)")
+
+ def test_search_must_hold_shared(self):
+ line = "__must_hold_shared(RCU) other();"
+ result = ", ".join(CMatch("__must_hold_shared").search(line))
+ self.assertEqual(result, "__must_hold_shared(RCU)")
+
+ def test_search_no_false_positive(self):
+ line = "call__acquires(foo); // should stay intact"
+ result = ", ".join(CMatch(r"\b__acquires").search(line))
+ self.assertEqual(result, "")
+
+ def test_search_no_macro_remains(self):
+ line = "do_something_else();"
+ result = ", ".join(CMatch("__acquires").search(line))
+ self.assertEqual(result, "")
+
+ def test_search_no_function(self):
+ line = "something"
+ result = ", ".join(CMatch(line).search(line))
+ self.assertEqual(result, "")
+
+#
+# Run all tests
+#
+if __name__ == "__main__":
+ run_unittest(__file__)
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 13/28] docs: c_lex: properly implement a sub() method for CMatch
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (11 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 12/28] tools: unittests: add tests for CMatch Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 14/28] unittests: test_cmatch: add tests for sub() Mauro Carvalho Chehab
` (17 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
Change the sub() method to do what it is expected, parsing
backref arguments like \0, \1, \2, ...
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/c_lex.py | 240 +++++++++++++++++++++++++++------
1 file changed, 202 insertions(+), 38 deletions(-)
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index e986a4ad73e3..98031cb7907c 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -10,6 +10,8 @@ Those help caching regular expressions and do matching for kernel-doc.
import re
+from copy import copy
+
from .kdoc_re import KernRe
class CToken():
@@ -36,6 +38,8 @@ class CToken():
NAME = 14 #: A name. Can be an ID or a type.
SPACE = 15 #: Any space characters, including new lines
+ BACKREF = 16 #: Not a valid C sequence, but used at sub regex patterns.
+
MISMATCH = 255 #: an error indicator: should never happen in practice.
# Dict to convert from an enum interger into a string.
@@ -107,6 +111,8 @@ TOKEN_LIST = [
(CToken.SPACE, r"[\s]+"),
+ (CToken.BACKREF, r"\\\d+"),
+
(CToken.MISMATCH,r"."),
]
@@ -245,6 +251,167 @@ class CTokenizer():
return out
+class CTokenArgs:
+ """
+ Ancillary class to help using backrefs from sub matches.
+
+ If the highest backref contain a "+" at the last element,
+ the logic will be greedy, picking all other delims.
+
+ This is needed to parse struct_group macros with end with ``MEMBERS...``.
+ """
+ def __init__(self, sub_str):
+ self.sub_groups = set()
+ self.max_group = -1
+ self.greedy = None
+
+ for m in KernRe(r'\\(\d+)([+]?)').finditer(sub_str):
+ group = int(m.group(1))
+ if m.group(2) == "+":
+ if self.greedy and self.greedy != group:
+ raise ValueError("There are multiple greedy patterns!")
+ self.greedy = group
+
+ self.sub_groups.add(group)
+ self.max_group = max(self.max_group, group)
+
+ if self.greedy:
+ if self.greedy != self.max_group:
+ raise ValueError("Greedy pattern is not the last one!")
+
+ sub_str = KernRe(r'(\\\d+)[+]').sub(r"\1", sub_str)
+
+ self.sub_str = sub_str
+ self.sub_tokeninzer = CTokenizer(sub_str)
+
+ def groups(self, new_tokenizer):
+ """
+ Create replacement arguments for backrefs like:
+
+ ``\0``, ``\1``, ``\2``, ...``\n``
+
+ It also accepts a ``+`` character to the highest backref. When used,
+ it means in practice to ignore delimins after it, being greedy.
+
+ The logic is smart enough to only go up to the maximum required
+ argument, even if there are more.
+
+ If there is a backref for an argument above the limit, it will
+ raise an exception. Please notice that, on C, square brackets
+ don't have any separator on it. Trying to use ``\1``..``\n`` for
+ brackets also raise an exception.
+ """
+
+ level = (0, 0, 0)
+
+ if self.max_group < 0:
+ return level, []
+
+ tokens = new_tokenizer.tokens
+
+ #
+ # Fill \0 with the full token contents
+ #
+ groups_list = [ [] ]
+
+ if 0 in self.sub_groups:
+ inner_level = 0
+
+ for i in range(0, len(tokens)):
+ tok = tokens[i]
+
+ if tok.kind == CToken.BEGIN:
+ inner_level += 1
+ continue
+
+ if tok.kind == CToken.END:
+ inner_level -= 1
+ if inner_level < 0:
+ break
+
+ if inner_level:
+ groups_list[0].append(tok)
+
+ if not self.max_group:
+ return level, groups_list
+
+ delim = None
+
+ #
+ # Ignore everything before BEGIN. The value of begin gives the
+ # delimiter to be used for the matches
+ #
+ for i in range(0, len(tokens)):
+ tok = tokens[i]
+ if tok.kind == CToken.BEGIN:
+ if tok.value == "{":
+ delim = ";"
+ elif tok.value == "(":
+ delim = ","
+ else:
+ raise ValueError(fr"Can't handle \1..\n on {sub_str}")
+
+ level = tok.level
+ break
+
+ pos = 1
+ groups_list.append([])
+
+ inner_level = 0
+ for i in range(i + 1, len(tokens)):
+ tok = tokens[i]
+
+ if tok.kind == CToken.BEGIN:
+ inner_level += 1
+ if tok.kind == CToken.END:
+ inner_level -= 1
+ if inner_level < 0:
+ break
+
+ if tok.kind == CToken.PUNC and delim == tok.value:
+ pos += 1
+ if self.greedy and pos > self.max_group:
+ pos -= 1
+ else:
+ groups_list.append([])
+
+ if pos > self.max_group:
+ break
+
+ continue
+
+ groups_list[pos].append(tok)
+
+ if pos < self.max_group:
+ raise ValueError(fr"{self.sub_str} groups are up to {pos} instead of {self.max_group}")
+
+ return level, groups_list
+
+ def tokens(self, new_tokenizer):
+ level, groups = self.groups(new_tokenizer)
+
+ new = CTokenizer()
+
+ for tok in self.sub_tokeninzer.tokens:
+ if tok.kind == CToken.BACKREF:
+ group = int(tok.value[1:])
+
+ for group_tok in groups[group]:
+ new_tok = copy(group_tok)
+
+ new_level = [0, 0, 0]
+
+ for i in range(0, len(level)):
+ new_level[i] = new_tok.level[i] + level[i]
+
+ new_tok.level = tuple(new_level)
+
+ new.tokens += [ new_tok ]
+ else:
+ new.tokens += [ tok ]
+
+ return new.tokens
+
class CMatch:
"""
Finding nested delimiters is hard with regular expressions. It is
@@ -270,31 +437,9 @@ class CMatch:
will ignore the search string.
"""
- # TODO: make CMatch handle multiple match groups
- #
- # Right now, regular expressions to match it are defined only up to
- # the start delimiter, e.g.:
- #
- # \bSTRUCT_GROUP\(
- #
- # is similar to: STRUCT_GROUP\((.*)\)
- # except that the content inside the match group is delimiter-aligned.
- #
- # The content inside parentheses is converted into a single replace
- # group (e.g. r`\0').
- #
- # It would be nice to change such definition to support multiple
- # match groups, allowing a regex equivalent to:
- #
- # FOO\((.*), (.*), (.*)\)
- #
- # it is probably easier to define it not as a regular expression, but
- # with some lexical definition like:
- #
- # FOO(arg1, arg2, arg3)
def __init__(self, regex):
- self.regex = KernRe(regex)
+ self.regex = KernRe("^" + regex + r"\b")
def _search(self, tokenizer):
"""
@@ -317,7 +462,6 @@ class CMatch:
"""
start = None
- offset = -1
started = False
import sys
@@ -339,9 +483,8 @@ class CMatch:
if tok.kind == CToken.END and tok.level == stack[-1][1]:
start, level = stack.pop()
- offset = i
- yield CTokenizer(tokenizer.tokens[start:offset + 1])
+ yield start, i
start = None
#
@@ -349,9 +492,9 @@ class CMatch:
# This is meant to solve cases where the caller logic might be
# picking an incomplete block.
#
- if start and offset < 0:
+ if start and stack:
print("WARNING: can't find an end", file=sys.stderr)
- yield CTokenizer(tokenizer.tokens[start:])
+ yield start, len(tokenizer.tokens)
def search(self, source):
"""
@@ -368,13 +511,15 @@ class CMatch:
tokenizer = CTokenizer(source)
is_token = False
- for new_tokenizer in self._search(tokenizer):
+ for start, end in self._search(tokenizer):
+ new_tokenizer = CTokenizer(tokenizer.tokens[start:end + 1])
+
if is_token:
yield new_tokenizer
else:
yield str(new_tokenizer)
- def sub(self, sub, line, count=0):
+ def sub(self, sub_str, source, count=0):
"""
This is similar to re.sub:
@@ -398,20 +543,39 @@ class CMatch:
is_token = False
tokenizer = CTokenizer(source)
+ # Detect if sub_str contains sub arguments
+
+ args_match = CTokenArgs(sub_str)
+
new_tokenizer = CTokenizer()
- cur_pos = 0
+ pos = 0
+ n = 0
+
+ #
+ # NOTE: the code below doesn't consider overlays at sub.
+ # We may need to add some extra unit tests to check if those
+ # would cause problems. When replacing by "", this should not
+ # be a problem, but other transformations could be problematic
+ #
for start, end in self._search(tokenizer):
- new_tokenizer.tokens += tokenizer.tokens[cur_pos:start]
-# new_tokenizer.tokens += [sub_str]
+ new_tokenizer.tokens += tokenizer.tokens[pos:start]
- cur_pos = end + 1
+ new = CTokenizer(tokenizer.tokens[start:end + 1])
- if cur_pos:
- new_tokenizer.tokens += tokenizer.tokens[cur_pos:]
+ new_tokenizer.tokens += args_match.tokens(new)
- print(new_tokenizer.tokens)
+ pos = end + 1
- return str(new_tokenizer)
+ n += 1
+ if count and n >= count:
+ break
+
+ new_tokenizer.tokens += tokenizer.tokens[pos:]
+
+ if not is_token:
+ return str(new_tokenizer)
+
+ return new_tokenizer
def __repr__(self):
"""
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 14/28] unittests: test_cmatch: add tests for sub()
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (12 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 13/28] docs: c_lex: properly implement a sub() method " Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 15/28] docs: kdoc: replace NestedMatch with CMatch Mauro Carvalho Chehab
` (16 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Kees Cook, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Gustavo A. R. Silva
Now that we have code for sub(), test it.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/unittests/test_cmatch.py | 721 ++++++++++++++++++++++++++++++++-
1 file changed, 719 insertions(+), 2 deletions(-)
diff --git a/tools/unittests/test_cmatch.py b/tools/unittests/test_cmatch.py
index 53b25aa4dc4a..f6ccd2a942f1 100755
--- a/tools/unittests/test_cmatch.py
+++ b/tools/unittests/test_cmatch.py
@@ -21,7 +21,7 @@ SRC_DIR = os.path.dirname(os.path.realpath(__file__))
sys.path.insert(0, os.path.join(SRC_DIR, "../lib/python"))
from kdoc.c_lex import CMatch
-from kdoc.xforms_lists import CTransforms
+from kdoc.kdoc_re import KernRe
from unittest_helper import run_unittest
#
@@ -75,7 +75,7 @@ class TestSearch(TestCaseDiff):
def test_search_no_false_positive(self):
line = "call__acquires(foo); // should stay intact"
- result = ", ".join(CMatch(r"\b__acquires").search(line))
+ result = ", ".join(CMatch(r"__acquires").search(line))
self.assertEqual(result, "")
def test_search_no_macro_remains(self):
@@ -88,6 +88,723 @@ class TestSearch(TestCaseDiff):
result = ", ".join(CMatch(line).search(line))
self.assertEqual(result, "")
+#
+# Override unittest.TestCase to better compare diffs ignoring whitespaces
+#
+class TestCaseDiff(unittest.TestCase):
+ """
+ Disable maximum limit on diffs and add a method to better
+ handle diffs with whitespace differences.
+ """
+
+ @classmethod
+ def setUpClass(cls):
+ """Ensure that there won't be limit for diffs"""
+ cls.maxDiff = None
+
+ def assertLogicallyEqual(self, a, b):
+ """
+ Compare two results ignoring multiple whitespace differences.
+
+ This is useful to check more complex matches picked from examples.
+ On a plus side, we also don't need to use dedent.
+ Please notice that line breaks still need to match. We might
+ remove it at the regex, but this way, checking the diff is easier.
+ """
+ a = re.sub(r"[\t ]+", " ", a.strip())
+ b = re.sub(r"[\t ]+", " ", b.strip())
+
+ a = re.sub(r"\s+\n", "\n", a)
+ b = re.sub(r"\s+\n", "\n", b)
+
+ a = re.sub(" ;", ";", a)
+ b = re.sub(" ;", ";", b)
+
+ self.assertEqual(a, b)
+
+#
+# Tests doing with different macros
+#
+
+class TestSubMultipleMacros(TestCaseDiff):
+ """
+ Tests doing with different macros.
+
+ Here, we won't use assertLogicallyEqual. Instead, we'll check if each
+ of the expected patterns are present at the answer.
+ """
+
+ def test_acquires_simple(self):
+ """Simple replacement test with __acquires"""
+ line = "__acquires(ctx) foo();"
+ result = CMatch(r"__acquires").sub("REPLACED", line)
+
+ self.assertEqual("REPLACED foo();", result)
+
+ def test_acquires_multiple(self):
+ """Multiple __acquires"""
+ line = "__acquires(ctx) __acquires(other) bar();"
+ result = CMatch(r"__acquires").sub("REPLACED", line)
+
+ self.assertEqual("REPLACED REPLACED bar();", result)
+
+ def test_acquires_nested_paren(self):
+ """__acquires with nested pattern"""
+ line = "__acquires((ctx1, ctx2)) baz();"
+ result = CMatch(r"__acquires").sub("REPLACED", line)
+
+ self.assertEqual("REPLACED baz();", result)
+
+ def test_must_hold(self):
+ """__must_hold with a pointer"""
+ line = "__must_hold(&lock) do_something();"
+ result = CMatch(r"__must_hold").sub("REPLACED", line)
+
+ self.assertNotIn("__must_hold(", result)
+ self.assertIn("do_something();", result)
+
+ def test_must_hold_shared(self):
+ """__must_hold with an upercase defined value"""
+ line = "__must_hold_shared(RCU) other();"
+ result = CMatch(r"__must_hold_shared").sub("REPLACED", line)
+
+ self.assertNotIn("__must_hold_shared(", result)
+ self.assertIn("other();", result)
+
+ def test_no_false_positive(self):
+ """
+ Ensure that unrelated text containing similar patterns is preserved
+ """
+ line = "call__acquires(foo); // should stay intact"
+ result = CMatch(r"\b__acquires").sub("REPLACED", line)
+
+ self.assertLogicallyEqual(result, "call__acquires(foo);")
+
+ def test_mixed_macros(self):
+ """Add a mix of macros"""
+ line = "__acquires(ctx) __releases(ctx) __must_hold(&lock) foo();"
+
+ result = CMatch(r"__acquires").sub("REPLACED", line)
+ result = CMatch(r"__releases").sub("REPLACED", result)
+ result = CMatch(r"__must_hold").sub("REPLACED", result)
+
+ self.assertNotIn("__acquires(", result)
+ self.assertNotIn("__releases(", result)
+ self.assertNotIn("__must_hold(", result)
+
+ self.assertIn("foo();", result)
+
+ def test_no_macro_remains(self):
+ """Ensures that unmatched macros are untouched"""
+ line = "do_something_else();"
+ result = CMatch(r"__acquires").sub("REPLACED", line)
+
+ self.assertEqual(result, line)
+
+ def test_no_function(self):
+ """Ensures that no functions will remain untouched"""
+ line = "something"
+ result = CMatch(line).sub("REPLACED", line)
+
+ self.assertEqual(result, line)
+
+#
+# Check if the diff is logically equivalent. To simplify, the tests here
+# use a single macro name for all replacements.
+#
+
+class TestSubSimple(TestCaseDiff):
+ """
+ Test argument replacements.
+
+ Here, the function name can be anything. So, we picked __attribute__(),
+ to mimic a macro found at the Kernel, but none of the replacements her
+ has any relationship with the Kernel usage.
+ """
+
+ MACRO = "__attribute__"
+
+ @classmethod
+ def setUpClass(cls):
+ """Define a CMatch to be used for all tests"""
+ cls.matcher = CMatch(cls.MACRO)
+
+ def test_sub_with_capture(self):
+ """Test all arguments replacement with a single arg"""
+ line = f"{self.MACRO}(&ctx)\nfoo();"
+
+ result = self.matcher.sub(r"ACQUIRED(\0)", line)
+
+ self.assertLogicallyEqual("ACQUIRED(&ctx)\nfoo();", result)
+
+ def test_sub_zero_placeholder(self):
+ """Test all arguments replacement with a multiple args"""
+ line = f"{self.MACRO}(arg1, arg2)\nbar();"
+
+ result = self.matcher.sub(r"REPLACED(\0)", line)
+
+ self.assertLogicallyEqual("REPLACED(arg1, arg2)\nbar();", result)
+
+ def test_sub_single_placeholder(self):
+ """Single replacement rule for \1"""
+ line = f"{self.MACRO}(ctx, boo)\nfoo();"
+ result = self.matcher.sub(r"ACQUIRED(\1)", line)
+
+ self.assertLogicallyEqual("ACQUIRED(ctx)\nfoo();", result)
+
+ def test_sub_multiple_placeholders(self):
+ """Replacement rule for both \1 and \2"""
+ line = f"{self.MACRO}(arg1, arg2)\nbar();"
+ result = self.matcher.sub(r"REPLACE(\1, \2)", line)
+
+ self.assertLogicallyEqual("REPLACE(arg1, arg2)\nbar();", result)
+
+ def test_sub_mixed_placeholders(self):
+ """Replacement rule for \0, \1 and additional text"""
+ line = f"{self.MACRO}(foo, bar)\nbaz();"
+ result = self.matcher.sub(r"ALL(\0) FIRST(\1)", line)
+
+ self.assertLogicallyEqual("ALL(foo, bar) FIRST(foo)\nbaz();", result)
+
+ def test_sub_no_placeholder(self):
+ """Replacement without placeholders"""
+ line = f"{self.MACRO}(arg)\nfoo();"
+ result = self.matcher.sub(r"NO_BACKREFS()", line)
+
+ self.assertLogicallyEqual("NO_BACKREFS()\nfoo();", result)
+
+ def test_sub_count_parameter(self):
+ """Verify that the algorithm stops after the requested count"""
+ line = f"{self.MACRO}(a1) x();\n{self.MACRO}(a2) y();"
+ result = self.matcher.sub(r"ONLY_FIRST(\1) ", line, count=1)
+
+ self.assertLogicallyEqual(f"ONLY_FIRST(a1) x();\n{self.MACRO}(a2) y();",
+ result)
+
+ def test_strip_multiple_acquires(self):
+ """Check if spaces between removed delimiters will be dropped"""
+ line = f"int {self.MACRO}(1) {self.MACRO}(2 ) {self.MACRO}(3) foo;"
+ result = self.matcher.sub("", line)
+
+ self.assertLogicallyEqual(result, "int foo;")
+
+
+#
+# Test replacements with slashrefs
+#
+
+
+class TestSubWithLocalXforms(TestCaseDiff):
+ """
+ Test diferent usecase patterns found at the Kernel.
+
+ Here, replacements using both CMatch and KernRe can be tested,
+ as it will import the actual replacement rules used by kernel-doc.
+ """
+
+ struct_xforms = [
+ (CMatch("__attribute__"), ' '),
+ (CMatch('__aligned'), ' '),
+ (CMatch('__counted_by'), ' '),
+ (CMatch('__counted_by_(le|be)'), ' '),
+ (CMatch('__guarded_by'), ' '),
+ (CMatch('__pt_guarded_by'), ' '),
+
+ (CMatch('__cacheline_group_(begin|end)'), ''),
+
+ (CMatch('struct_group'), r'\2'),
+ (CMatch('struct_group_attr'), r'\3'),
+ (CMatch('struct_group_tagged'), r'struct \1 { \3+ } \2;'),
+ (CMatch('__struct_group'), r'\4'),
+
+ (CMatch('__ETHTOOL_DECLARE_LINK_MODE_MASK'), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
+ (CMatch('DECLARE_PHY_INTERFACE_MASK',), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
+ (CMatch('DECLARE_BITMAP'), r'unsigned long \1[BITS_TO_LONGS(\2)]'),
+
+ (CMatch('DECLARE_HASHTABLE'), r'unsigned long \1[1 << ((\2) - 1)]'),
+ (CMatch('DECLARE_KFIFO'), r'\2 *\1'),
+ (CMatch('DECLARE_KFIFO_PTR'), r'\2 *\1'),
+ (CMatch('(?:__)?DECLARE_FLEX_ARRAY'), r'\1 \2[]'),
+ (CMatch('DEFINE_DMA_UNMAP_ADDR'), r'dma_addr_t \1'),
+ (CMatch('DEFINE_DMA_UNMAP_LEN'), r'__u32 \1'),
+ (CMatch('VIRTIO_DECLARE_FEATURES'), r'union { u64 \1; u64 \1_array[VIRTIO_FEATURES_U64S]; }'),
+ ]
+
+ function_xforms = [
+ (CMatch('__printf'), ""),
+ (CMatch('__(?:re)?alloc_size'), ""),
+ (CMatch("__diagnose_as"), ""),
+ (CMatch("DECL_BUCKET_PARAMS"), r"\1, \2"),
+
+ (CMatch("__cond_acquires"), ""),
+ (CMatch("__cond_releases"), ""),
+ (CMatch("__acquires"), ""),
+ (CMatch("__releases"), ""),
+ (CMatch("__must_hold"), ""),
+ (CMatch("__must_not_hold"), ""),
+ (CMatch("__must_hold_shared"), ""),
+ (CMatch("__cond_acquires_shared"), ""),
+ (CMatch("__acquires_shared"), ""),
+ (CMatch("__releases_shared"), ""),
+ (CMatch("__attribute__"), ""),
+ ]
+
+ var_xforms = [
+ (CMatch('__guarded_by'), ""),
+ (CMatch('__pt_guarded_by'), ""),
+ (CMatch("LIST_HEAD"), r"struct list_head \1"),
+ ]
+
+ #: Transforms main dictionary used at apply_transforms().
+ xforms = {
+ "struct": struct_xforms,
+ "func": function_xforms,
+ "var": var_xforms,
+ }
+
+ @classmethod
+ def apply_transforms(cls, xform_type, text):
+ """
+ Mimic the behavior of kdoc_parser.apply_transforms() method.
+
+ For each element of STRUCT_XFORMS, apply apply_transforms.
+
+ There are two parameters:
+
+ - ``xform_type``
+ Can be ``func``, ``struct`` or ``var``;
+ - ``text``
+ The text where the sub patterns from CTransforms will be applied.
+ """
+ for search, subst in cls.xforms.get(xform_type):
+ text = search.sub(subst, text)
+
+ return text.strip()
+
+ cls.matcher = CMatch(r"struct_group[\w\_]*")
+
+ def test_struct_group(self):
+ """
+ Test struct_group using a pattern from
+ drivers/net/ethernet/asix/ax88796c_main.h.
+ """
+ line = """
+ struct tx_pkt_info {
+ struct_group(tx_overhead,
+ struct tx_sop_header sop;
+ struct tx_segment_header seg;
+ );
+ struct tx_eop_header eop;
+ u16 pkt_len;
+ u16 seq_num;
+ };
+ """
+ expected = """
+ struct tx_pkt_info {
+ struct tx_sop_header sop;
+ struct tx_segment_header seg;
+ ;
+ struct tx_eop_header eop;
+ u16 pkt_len;
+ u16 seq_num;
+ };
+ """
+
+ result = self.apply_transforms("struct", line)
+ self.assertLogicallyEqual(result, expected)
+
+ def test_struct_group_attr(self):
+ """
+ Test two struct_group_attr using patterns from fs/smb/client/cifspdu.h.
+ """
+ line = """
+ typedef struct smb_com_open_rsp {
+ struct smb_hdr hdr; /* wct = 34 BB */
+ __u8 AndXCommand;
+ __u8 AndXReserved;
+ __le16 AndXOffset;
+ __u8 OplockLevel;
+ __u16 Fid;
+ __le32 CreateAction;
+ struct_group_attr(common_attributes,,
+ __le64 CreationTime;
+ __le64 LastAccessTime;
+ __le64 LastWriteTime;
+ __le64 ChangeTime;
+ __le32 FileAttributes;
+ );
+ __le64 AllocationSize;
+ __le64 EndOfFile;
+ __le16 FileType;
+ __le16 DeviceState;
+ __u8 DirectoryFlag;
+ __u16 ByteCount; /* bct = 0 */
+ } OPEN_RSP;
+ typedef struct {
+ struct_group_attr(common_attributes,,
+ __le64 CreationTime;
+ __le64 LastAccessTime;
+ __le64 LastWriteTime;
+ __le64 ChangeTime;
+ __le32 Attributes;
+ );
+ __u32 Pad1;
+ __le64 AllocationSize;
+ __le64 EndOfFile;
+ __le32 NumberOfLinks;
+ __u8 DeletePending;
+ __u8 Directory;
+ __u16 Pad2;
+ __le32 EASize;
+ __le32 FileNameLength;
+ union {
+ char __pad;
+ DECLARE_FLEX_ARRAY(char, FileName);
+ };
+ } FILE_ALL_INFO; /* level 0x107 QPathInfo */
+ """
+ expected = """
+ typedef struct smb_com_open_rsp {
+ struct smb_hdr hdr;
+ __u8 AndXCommand;
+ __u8 AndXReserved;
+ __le16 AndXOffset;
+ __u8 OplockLevel;
+ __u16 Fid;
+ __le32 CreateAction;
+ __le64 CreationTime;
+ __le64 LastAccessTime;
+ __le64 LastWriteTime;
+ __le64 ChangeTime;
+ __le32 FileAttributes;
+ ;
+ __le64 AllocationSize;
+ __le64 EndOfFile;
+ __le16 FileType;
+ __le16 DeviceState;
+ __u8 DirectoryFlag;
+ __u16 ByteCount;
+ } OPEN_RSP;
+ typedef struct {
+ __le64 CreationTime;
+ __le64 LastAccessTime;
+ __le64 LastWriteTime;
+ __le64 ChangeTime;
+ __le32 Attributes;
+ ;
+ __u32 Pad1;
+ __le64 AllocationSize;
+ __le64 EndOfFile;
+ __le32 NumberOfLinks;
+ __u8 DeletePending;
+ __u8 Directory;
+ __u16 Pad2;
+ __le32 EASize;
+ __le32 FileNameLength;
+ union {
+ char __pad;
+ char FileName[];
+ };
+ } FILE_ALL_INFO;
+ """
+
+ result = self.apply_transforms("struct", line)
+ self.assertLogicallyEqual(result, expected)
+
+ def test_raw_struct_group(self):
+ """
+ Test a __struct_group pattern from include/uapi/cxl/features.h.
+ """
+ line = """
+ struct cxl_mbox_get_sup_feats_out {
+ __struct_group(cxl_mbox_get_sup_feats_out_hdr, hdr, /* empty */,
+ __le16 num_entries;
+ __le16 supported_feats;
+ __u8 reserved[4];
+ );
+ struct cxl_feat_entry ents[] __counted_by_le(num_entries);
+ } __attribute__ ((__packed__));
+ """
+ expected = """
+ struct cxl_mbox_get_sup_feats_out {
+ __le16 num_entries;
+ __le16 supported_feats;
+ __u8 reserved[4];
+ ;
+ struct cxl_feat_entry ents[];
+ };
+ """
+
+ result = self.apply_transforms("struct", line)
+ self.assertLogicallyEqual(result, expected)
+
+ def test_raw_struct_group_tagged(self):
+ """
+ Test cxl_regs with struct_group_tagged patterns from drivers/cxl/cxl.h.
+
+ NOTE:
+
+ This one has actually a violation from what kernel-doc would
+ expect: Kernel-doc regex expects only 3 members, but this is
+ actually defined as::
+
+ #define struct_group_tagged(TAG, NAME, MEMBERS...)
+
+ The replace expression there is::
+
+ struct \1 { \3 } \2;
+
+ but it should be really something like::
+
+ struct \1 { \3 \4 \5 \6 \7 \8 ... } \2;
+
+ a later fix would be needed to address it.
+
+ """
+ line = """
+ struct cxl_regs {
+ struct_group_tagged(cxl_component_regs, component,
+ void __iomem *hdm_decoder;
+ void __iomem *ras;
+ );
+
+
+ /* This is actually a violation: too much commas */
+ struct_group_tagged(cxl_device_regs, device_regs,
+ void __iomem *status, *mbox, *memdev;
+ );
+
+ struct_group_tagged(cxl_pmu_regs, pmu_regs,
+ void __iomem *pmu;
+ );
+
+ struct_group_tagged(cxl_rch_regs, rch_regs,
+ void __iomem *dport_aer;
+ );
+
+ struct_group_tagged(cxl_rcd_regs, rcd_regs,
+ void __iomem *rcd_pcie_cap;
+ );
+ };
+ """
+ expected = """
+ struct cxl_regs {
+ struct cxl_component_regs {
+ void __iomem *hdm_decoder;
+ void __iomem *ras;
+ } component;;
+
+ struct cxl_device_regs {
+ void __iomem *status, *mbox, *memdev;
+ } device_regs;;
+
+ struct cxl_pmu_regs {
+ void __iomem *pmu;
+ } pmu_regs;;
+
+ struct cxl_rch_regs {
+ void __iomem *dport_aer;
+ } rch_regs;;
+
+ struct cxl_rcd_regs {
+ void __iomem *rcd_pcie_cap;
+ } rcd_regs;;
+ };
+ """
+
+ result = self.apply_transforms("struct", line)
+ self.assertLogicallyEqual(result, expected)
+
+ def test_struct_group_tagged_with_private(self):
+ """
+ Replace struct_group_tagged with private, using the same regex
+ for the replacement as what happens in xforms_lists.py.
+
+ As the private removal happens outside NestedGroup class, we manually
+ dropped the remaining part of the struct, to simulate what happens
+ at kdoc_parser.
+
+ Taken from include/net/page_pool/types.h
+ """
+ line = """
+ struct page_pool_params {
+ struct_group_tagged(page_pool_params_slow, slow,
+ struct net_device *netdev;
+ unsigned int queue_idx;
+ unsigned int flags;
+ /* private: only under "slow" struct */
+ unsigned int ignored;
+ );
+ /* Struct below shall not be ignored */
+ struct_group_tagged(page_pool_params_fast, fast,
+ unsigned int order;
+ unsigned int pool_size;
+ int nid;
+ struct device *dev;
+ struct napi_struct *napi;
+ enum dma_data_direction dma_dir;
+ unsigned int max_len;
+ unsigned int offset;
+ );
+ };
+ """
+ expected = """
+ struct page_pool_params {
+ struct page_pool_params_slow {
+ struct net_device *netdev;
+ unsigned int queue_idx;
+ unsigned int flags;
+ } slow;;
+ struct page_pool_params_fast {
+ unsigned int order;
+ unsigned int pool_size;
+ int nid;
+ struct device *dev;
+ struct napi_struct *napi;
+ enum dma_data_direction dma_dir;
+ unsigned int max_len;
+ unsigned int offset;
+ } fast;;
+ };
+ """
+
+ result = self.apply_transforms("struct", line)
+ self.assertLogicallyEqual(result, expected)
+
+ def test_struct_kcov(self):
+ """
+ """
+ line = """
+ struct kcov {
+ refcount_t refcount;
+ spinlock_t lock;
+ enum kcov_mode mode __guarded_by(&lock);
+ unsigned int size __guarded_by(&lock);
+ void *area __guarded_by(&lock);
+ struct task_struct *t __guarded_by(&lock);
+ bool remote;
+ unsigned int remote_size;
+ int sequence;
+ };
+ """
+ expected = """
+ """
+
+ result = self.apply_transforms("struct", line)
+ self.assertLogicallyEqual(result, expected)
+
+
+ def test_struct_kcov(self):
+ """
+ Test a struct from kernel/kcov.c.
+ """
+ line = """
+ struct kcov {
+ refcount_t refcount;
+ spinlock_t lock;
+ enum kcov_mode mode __guarded_by(&lock);
+ unsigned int size __guarded_by(&lock);
+ void *area __guarded_by(&lock);
+ struct task_struct *t __guarded_by(&lock);
+ bool remote;
+ unsigned int remote_size;
+ int sequence;
+ };
+ """
+ expected = """
+ struct kcov {
+ refcount_t refcount;
+ spinlock_t lock;
+ enum kcov_mode mode;
+ unsigned int size;
+ void *area;
+ struct task_struct *t;
+ bool remote;
+ unsigned int remote_size;
+ int sequence;
+ };
+ """
+
+ result = self.apply_transforms("struct", line)
+ self.assertLogicallyEqual(result, expected)
+
+ def test_vars_stackdepot(self):
+ """
+ Test guarded_by on vars from lib/stackdepot.c.
+ """
+ line = """
+ size_t pool_offset __guarded_by(&pool_lock) = DEPOT_POOL_SIZE;
+ __guarded_by(&pool_lock) LIST_HEAD(free_stacks);
+ void **stack_pools __pt_guarded_by(&pool_lock);
+ """
+ expected = """
+ size_t pool_offset = DEPOT_POOL_SIZE;
+ struct list_head free_stacks;
+ void **stack_pools;
+ """
+
+ result = self.apply_transforms("var", line)
+ self.assertLogicallyEqual(result, expected)
+
+ def test_functions_with_acquires_and_releases(self):
+ """
+ Test guarded_by on vars from lib/stackdepot.c.
+ """
+ line = """
+ bool prepare_report_consumer(unsigned long *flags,
+ const struct access_info *ai,
+ struct other_info *other_info) \
+ __cond_acquires(true, &report_lock);
+
+ int tcp_sigpool_start(unsigned int id, struct tcp_sigpool *c) \
+ __cond_acquires(0, RCU_BH);
+
+ bool undo_report_consumer(unsigned long *flags,
+ const struct access_info *ai,
+ struct other_info *other_info) \
+ __cond_releases(true, &report_lock);
+
+ void debugfs_enter_cancellation(struct file *file,
+ struct debugfs_cancellation *c) \
+ __acquires(cancellation);
+
+ void debugfs_leave_cancellation(struct file *file,
+ struct debugfs_cancellation *c) \
+ __releases(cancellation);
+
+ acpi_cpu_flags acpi_os_acquire_lock(acpi_spinlock lockp) \
+ __acquires(lockp);
+
+ void acpi_os_release_lock(acpi_spinlock lockp,
+ acpi_cpu_flags not_used) \
+ __releases(lockp)
+ """
+ expected = """
+ bool prepare_report_consumer(unsigned long *flags,
+ const struct access_info *ai,
+ struct other_info *other_info);
+
+ int tcp_sigpool_start(unsigned int id, struct tcp_sigpool *c);
+
+ bool undo_report_consumer(unsigned long *flags,
+ const struct access_info *ai,
+ struct other_info *other_info);
+
+ void debugfs_enter_cancellation(struct file *file,
+ struct debugfs_cancellation *c);
+
+ void debugfs_leave_cancellation(struct file *file,
+ struct debugfs_cancellation *c);
+
+ acpi_cpu_flags acpi_os_acquire_lock(acpi_spinlock lockp);
+
+ void acpi_os_release_lock(acpi_spinlock lockp,
+ acpi_cpu_flags not_used)
+ """
+
+ result = self.apply_transforms("func", line)
+ self.assertLogicallyEqual(result, expected)
+
#
# Run all tests
#
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 15/28] docs: kdoc: replace NestedMatch with CMatch
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (13 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 14/28] unittests: test_cmatch: add tests for sub() Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 16/28] docs: kdoc_re: get rid of NestedMatch class Mauro Carvalho Chehab
` (15 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
Our previous approach to solve nested structs were to use
NestedMatch. It works well, but adding support to parse delimiters
is very complex.
Instead, use CMatch, which uses a C tokenizer, making the code more
reliable and simpler.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/kdoc_parser.py | 2 +-
tools/lib/python/kdoc/xforms_lists.py | 31 ++++++++++++++-------------
2 files changed, 17 insertions(+), 16 deletions(-)
diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index e804e61b09c0..0da95b090a34 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -13,7 +13,7 @@ import sys
import re
from pprint import pformat
-from kdoc.kdoc_re import NestedMatch, KernRe
+from kdoc.kdoc_re import KernRe
from kdoc.c_lex import CTokenizer
from kdoc.kdoc_item import KdocItem
diff --git a/tools/lib/python/kdoc/xforms_lists.py b/tools/lib/python/kdoc/xforms_lists.py
index c07cbe1e6349..7fa7f52cec7b 100644
--- a/tools/lib/python/kdoc/xforms_lists.py
+++ b/tools/lib/python/kdoc/xforms_lists.py
@@ -4,7 +4,8 @@
import re
-from kdoc.kdoc_re import KernRe, NestedMatch
+from kdoc.kdoc_re import KernRe
+from kdoc.c_lex import CMatch
struct_args_pattern = r'([^,)]+)'
@@ -60,7 +61,7 @@ class CTransforms:
#
# As it doesn't properly match the end parenthesis on some cases.
#
- # So, a better solution was crafted: there's now a NestedMatch
+ # So, a better solution was crafted: there's now a CMatch
# class that ensures that delimiters after a search are properly
# matched. So, the implementation to drop STRUCT_GROUP() will be
# handled in separate.
@@ -72,9 +73,9 @@ class CTransforms:
#
# Replace macros
#
- # TODO: use NestedMatch for FOO($1, $2, ...) matches
+ # TODO: use CMatch for FOO($1, $2, ...) matches
#
- # it is better to also move those to the NestedMatch logic,
+ # it is better to also move those to the CMatch logic,
# to ensure that parentheses will be properly matched.
#
(KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S),
@@ -95,17 +96,17 @@ class CTransforms:
(KernRe(r'DEFINE_DMA_UNMAP_LEN\s*\(' + struct_args_pattern + r'\)', re.S), r'__u32 \1'),
(KernRe(r'VIRTIO_DECLARE_FEATURES\(([\w_]+)\)'), r'union { u64 \1; u64 \1_array[VIRTIO_FEATURES_U64S]; }'),
- (NestedMatch(r"__cond_acquires\s*\("), ""),
- (NestedMatch(r"__cond_releases\s*\("), ""),
- (NestedMatch(r"__acquires\s*\("), ""),
- (NestedMatch(r"__releases\s*\("), ""),
- (NestedMatch(r"__must_hold\s*\("), ""),
- (NestedMatch(r"__must_not_hold\s*\("), ""),
- (NestedMatch(r"__must_hold_shared\s*\("), ""),
- (NestedMatch(r"__cond_acquires_shared\s*\("), ""),
- (NestedMatch(r"__acquires_shared\s*\("), ""),
- (NestedMatch(r"__releases_shared\s*\("), ""),
- (NestedMatch(r'\bSTRUCT_GROUP\('), r'\0'),
+ (CMatch(r"__cond_acquires"), ""),
+ (CMatch(r"__cond_releases"), ""),
+ (CMatch(r"__acquires"), ""),
+ (CMatch(r"__releases"), ""),
+ (CMatch(r"__must_hold"), ""),
+ (CMatch(r"__must_not_hold"), ""),
+ (CMatch(r"__must_hold_shared"), ""),
+ (CMatch(r"__cond_acquires_shared"), ""),
+ (CMatch(r"__acquires_shared"), ""),
+ (CMatch(r"__releases_shared"), ""),
+ (CMatch(r"STRUCT_GROUP"), r'\0'),
]
#: Transforms for function prototypes.
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 16/28] docs: kdoc_re: get rid of NestedMatch class
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (14 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 15/28] docs: kdoc: replace NestedMatch with CMatch Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 17/28] docs: xforms_lists: handle struct_group directly Mauro Carvalho Chehab
` (14 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
Now that everything was converted to CMatch, we can get rid of
the previous NestedMatch implementation.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/kdoc_re.py | 202 -------------------------------
1 file changed, 202 deletions(-)
diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
index ba601a4f5035..6f3ae28859ea 100644
--- a/tools/lib/python/kdoc/kdoc_re.py
+++ b/tools/lib/python/kdoc/kdoc_re.py
@@ -140,205 +140,3 @@ class KernRe:
"""
return self.last_match.groups()
-
-
-#: Nested delimited pairs (brackets and parenthesis)
-DELIMITER_PAIRS = {
- '{': '}',
- '(': ')',
- '[': ']',
-}
-
-#: compiled delimiters
-RE_DELIM = KernRe(r'[\{\}\[\]\(\)]')
-
-
-class NestedMatch:
- """
- Finding nested delimiters is hard with regular expressions. It is
- even harder on Python with its normal re module, as there are several
- advanced regular expressions that are missing.
-
- This is the case of this pattern::
-
- '\\bSTRUCT_GROUP(\\(((?:(?>[^)(]+)|(?1))*)\\))[^;]*;'
-
- which is used to properly match open/close parentheses of the
- string search STRUCT_GROUP(),
-
- Add a class that counts pairs of delimiters, using it to match and
- replace nested expressions.
-
- The original approach was suggested by:
-
- https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
-
- Although I re-implemented it to make it more generic and match 3 types
- of delimiters. The logic checks if delimiters are paired. If not, it
- will ignore the search string.
- """
-
- # TODO: make NestedMatch handle multiple match groups
- #
- # Right now, regular expressions to match it are defined only up to
- # the start delimiter, e.g.:
- #
- # \bSTRUCT_GROUP\(
- #
- # is similar to: STRUCT_GROUP\((.*)\)
- # except that the content inside the match group is delimiter-aligned.
- #
- # The content inside parentheses is converted into a single replace
- # group (e.g. r`\0').
- #
- # It would be nice to change such definition to support multiple
- # match groups, allowing a regex equivalent to:
- #
- # FOO\((.*), (.*), (.*)\)
- #
- # it is probably easier to define it not as a regular expression, but
- # with some lexical definition like:
- #
- # FOO(arg1, arg2, arg3)
-
- def __init__(self, regex):
- self.regex = KernRe(regex)
-
- def _search(self, line):
- """
- Finds paired blocks for a regex that ends with a delimiter.
-
- The suggestion of using finditer to match pairs came from:
- https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
- but I ended using a different implementation to align all three types
- of delimiters and seek for an initial regular expression.
-
- The algorithm seeks for open/close paired delimiters and places them
- into a stack, yielding a start/stop position of each match when the
- stack is zeroed.
-
- The algorithm should work fine for properly paired lines, but will
- silently ignore end delimiters that precede a start delimiter.
- This should be OK for kernel-doc parser, as unaligned delimiters
- would cause compilation errors. So, we don't need to raise exceptions
- to cover such issues.
- """
-
- stack = []
-
- for match_re in self.regex.finditer(line):
- start = match_re.start()
- offset = match_re.end()
- string_char = None
- escape = False
-
- d = line[offset - 1]
- if d not in DELIMITER_PAIRS:
- continue
-
- end = DELIMITER_PAIRS[d]
- stack.append(end)
-
- for match in RE_DELIM.finditer(line[offset:]):
- pos = match.start() + offset
-
- d = line[pos]
-
- if escape:
- escape = False
- continue
-
- if string_char:
- if d == '\\':
- escape = True
- elif d == string_char:
- string_char = None
-
- continue
-
- if d in ('"', "'"):
- string_char = d
- continue
-
- if d in DELIMITER_PAIRS:
- end = DELIMITER_PAIRS[d]
-
- stack.append(end)
- continue
-
- # Does the end delimiter match what is expected?
- if stack and d == stack[-1]:
- stack.pop()
-
- if not stack:
- yield start, offset, pos + 1
- break
-
- def search(self, line):
- """
- This is similar to re.search:
-
- It matches a regex that it is followed by a delimiter,
- returning occurrences only if all delimiters are paired.
- """
-
- for t in self._search(line):
-
- yield line[t[0]:t[2]]
-
- def sub(self, sub, line, count=0):
- """
- This is similar to re.sub:
-
- It matches a regex that it is followed by a delimiter,
- replacing occurrences only if all delimiters are paired.
-
- if the sub argument contains::
-
- r'\0'
-
- it will work just like re: it places there the matched paired data
- with the delimiter stripped.
-
- If count is different than zero, it will replace at most count
- items.
- """
- out = ""
-
- cur_pos = 0
- n = 0
-
- for start, end, pos in self._search(line):
- out += line[cur_pos:start]
-
- # Value, ignoring start/end delimiters
- value = line[end:pos - 1]
-
- # replaces \0 at the sub string, if \0 is used there
- new_sub = sub
- new_sub = new_sub.replace(r'\0', value)
-
- out += new_sub
-
- # Drop end ';' if any
- if pos < len(line) and line[pos] == ';':
- pos += 1
-
- cur_pos = pos
- n += 1
-
- if count and count >= n:
- break
-
- # Append the remaining string
- l = len(line)
- out += line[cur_pos:l]
-
- return out
-
- def __repr__(self):
- """
- Returns a displayable version of the class init.
- """
-
- return f'NestedMatch("{self.regex.regex.pattern}")'
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 17/28] docs: xforms_lists: handle struct_group directly
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (15 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 16/28] docs: kdoc_re: get rid of NestedMatch class Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 18/28] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
` (13 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
The previous logic was handling struct_group on two steps.
Remove the previous approach, as CMatch can do it the right
way on a single step.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/xforms_lists.py | 53 +++------------------------
1 file changed, 6 insertions(+), 47 deletions(-)
diff --git a/tools/lib/python/kdoc/xforms_lists.py b/tools/lib/python/kdoc/xforms_lists.py
index 7fa7f52cec7b..98632c50a146 100644
--- a/tools/lib/python/kdoc/xforms_lists.py
+++ b/tools/lib/python/kdoc/xforms_lists.py
@@ -32,52 +32,6 @@ class CTransforms:
(KernRe(r'\s*____cacheline_aligned_in_smp', re.S), ' '),
(KernRe(r'\s*____cacheline_aligned', re.S), ' '),
(KernRe(r'\s*__cacheline_group_(begin|end)\([^\)]+\);'), ''),
- #
- # Unwrap struct_group macros based on this definition:
- # __struct_group(TAG, NAME, ATTRS, MEMBERS...)
- # which has variants like: struct_group(NAME, MEMBERS...)
- # Only MEMBERS arguments require documentation.
- #
- # Parsing them happens on two steps:
- #
- # 1. drop struct group arguments that aren't at MEMBERS,
- # storing them as STRUCT_GROUP(MEMBERS)
- #
- # 2. remove STRUCT_GROUP() ancillary macro.
- #
- # The original logic used to remove STRUCT_GROUP() using an
- # advanced regex:
- #
- # \bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;
- #
- # with two patterns that are incompatible with
- # Python re module, as it has:
- #
- # - a recursive pattern: (?1)
- # - an atomic grouping: (?>...)
- #
- # I tried a simpler version: but it didn't work either:
- # \bSTRUCT_GROUP\(([^\)]+)\)[^;]*;
- #
- # As it doesn't properly match the end parenthesis on some cases.
- #
- # So, a better solution was crafted: there's now a CMatch
- # class that ensures that delimiters after a search are properly
- # matched. So, the implementation to drop STRUCT_GROUP() will be
- # handled in separate.
- #
- (KernRe(r'\bstruct_group\s*\(([^,]*,)', re.S), r'STRUCT_GROUP('),
- (KernRe(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S), r'STRUCT_GROUP('),
- (KernRe(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S), r'struct \1 \2; STRUCT_GROUP('),
- (KernRe(r'\b__struct_group\s*\(([^,]*,){3}', re.S), r'STRUCT_GROUP('),
- #
- # Replace macros
- #
- # TODO: use CMatch for FOO($1, $2, ...) matches
- #
- # it is better to also move those to the CMatch logic,
- # to ensure that parentheses will be properly matched.
- #
(KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S),
r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
(KernRe(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S),
@@ -106,7 +60,12 @@ class CTransforms:
(CMatch(r"__cond_acquires_shared"), ""),
(CMatch(r"__acquires_shared"), ""),
(CMatch(r"__releases_shared"), ""),
- (CMatch(r"STRUCT_GROUP"), r'\0'),
+
+ (CMatch('struct_group'), r'\2'),
+ (CMatch('struct_group_attr'), r'\3'),
+ (CMatch('struct_group_tagged'), r'struct \1 \2; \3'),
+ (CMatch('__struct_group'), r'\4'),
+
]
#: Transforms for function prototypes.
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 18/28] docs: xforms_lists: better evaluate struct_group macros
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (16 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 17/28] docs: xforms_lists: handle struct_group directly Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 19/28] docs: c_lex: add support to work with pure name ids Mauro Carvalho Chehab
` (12 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
The previous approach were to unwind nested structs/unions.
Now that we have a logic that can handle it well, use it to
ensure that struct_group macros will properly reflect the
actual struct.
Note that the replacemend logic still simplifies the code
a little bit, as the basic build block for struct group is:
union { \
struct { MEMBERS } ATTRS; \
struct __struct_group_tag(TAG) { MEMBERS } ATTRS NAME; \
} ATTRS
There:
- ATTRS is meant to add extra macro attributes like __packed
which we already discard, as they aren't relevant to
document struct members;
- TAG is used only when built with __cplusplus.
So, instead, convert them into just:
struct { MEMBERS };
Please notice that here, we're using the greedy version of the
backrefs, as MEMBERS is actually MEMBERS... on all such macros.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/xforms_lists.py | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/tools/lib/python/kdoc/xforms_lists.py b/tools/lib/python/kdoc/xforms_lists.py
index 98632c50a146..2056572852fd 100644
--- a/tools/lib/python/kdoc/xforms_lists.py
+++ b/tools/lib/python/kdoc/xforms_lists.py
@@ -61,10 +61,16 @@ class CTransforms:
(CMatch(r"__acquires_shared"), ""),
(CMatch(r"__releases_shared"), ""),
- (CMatch('struct_group'), r'\2'),
- (CMatch('struct_group_attr'), r'\3'),
- (CMatch('struct_group_tagged'), r'struct \1 \2; \3'),
- (CMatch('__struct_group'), r'\4'),
+ #
+ # Macro __struct_group() creates an union with an anonymous
+ # and a non-anonymous struct, depending on the parameters. We only
+ # need one of those at kernel-doc, as we won't be documenting the same
+ # members twice.
+ #
+ (CMatch('struct_group'), r'struct { \2+ };'),
+ (CMatch('struct_group_attr'), r'struct { \3+ };'),
+ (CMatch('struct_group_tagged'), r'struct { \3+ };'),
+ (CMatch('__struct_group'), r'struct { \4+ };'),
]
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 19/28] docs: c_lex: add support to work with pure name ids
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (17 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 18/28] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 20/28] docs: xforms_lists: use CMatch for all identifiers Mauro Carvalho Chehab
` (11 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
Most of CMatch complexity is due to the need of parse macros
with arguments. Still, it is easy enough to support also simple
name identifiers.
Add support for it, as it simplifies xforms logic.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/c_lex.py | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index 98031cb7907c..689ad64ecbe4 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -477,9 +477,17 @@ class CMatch:
continue
- if not started and tok.kind == CToken.BEGIN:
- started = True
- continue
+ if not started:
+ if tok.kind == CToken.SPACE:
+ continue
+
+ if tok.kind == CToken.BEGIN:
+ started = True
+ continue
+ else:
+ # Name only token without BEGIN/END
+ yield start, i
+ start = None
if tok.kind == CToken.END and tok.level == stack[-1][1]:
start, level = stack.pop()
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 20/28] docs: xforms_lists: use CMatch for all identifiers
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (18 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 19/28] docs: c_lex: add support to work with pure name ids Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 21/28] docs: c_lex: add "@" operator Mauro Carvalho Chehab
` (10 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Kees Cook, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Gustavo A. R. Silva, Aleksandr Loktionov, Randy Dunlap
CMatch is lexically correct and replaces only identifiers,
which is exactly where macro transformations happen.
Use it to make the output safer and ensure that all arguments
will be parsed the right way, even on complex cases.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/xforms_lists.py | 159 +++++++++++++-------------
1 file changed, 79 insertions(+), 80 deletions(-)
diff --git a/tools/lib/python/kdoc/xforms_lists.py b/tools/lib/python/kdoc/xforms_lists.py
index 2056572852fd..c3c532c45cdc 100644
--- a/tools/lib/python/kdoc/xforms_lists.py
+++ b/tools/lib/python/kdoc/xforms_lists.py
@@ -7,7 +7,8 @@ import re
from kdoc.kdoc_re import KernRe
from kdoc.c_lex import CMatch
-struct_args_pattern = r'([^,)]+)'
+struct_args_pattern = r"([^,)]+)"
+
class CTransforms:
"""
@@ -18,48 +19,40 @@ class CTransforms:
#: Transforms for structs and unions.
struct_xforms = [
- # Strip attributes
- (KernRe(r"__attribute__\s*\(\([a-z0-9,_\*\s\(\)]*\)\)", flags=re.I | re.S, cache=False), ' '),
- (KernRe(r'\s*__aligned\s*\([^;]*\)', re.S), ' '),
- (KernRe(r'\s*__counted_by\s*\([^;]*\)', re.S), ' '),
- (KernRe(r'\s*__counted_by_(le|be)\s*\([^;]*\)', re.S), ' '),
- (KernRe(r'\s*__guarded_by\s*\([^\)]*\)', re.S), ' '),
- (KernRe(r'\s*__pt_guarded_by\s*\([^\)]*\)', re.S), ' '),
- (KernRe(r'\s*__packed\s*', re.S), ' '),
- (KernRe(r'\s*CRYPTO_MINALIGN_ATTR', re.S), ' '),
- (KernRe(r'\s*__private', re.S), ' '),
- (KernRe(r'\s*__rcu', re.S), ' '),
- (KernRe(r'\s*____cacheline_aligned_in_smp', re.S), ' '),
- (KernRe(r'\s*____cacheline_aligned', re.S), ' '),
- (KernRe(r'\s*__cacheline_group_(begin|end)\([^\)]+\);'), ''),
- (KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S),
- r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
- (KernRe(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S),
- r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
- (KernRe(r'DECLARE_BITMAP\s*\(' + struct_args_pattern + r',\s*' + struct_args_pattern + r'\)',
- re.S), r'unsigned long \1[BITS_TO_LONGS(\2)]'),
- (KernRe(r'DECLARE_HASHTABLE\s*\(' + struct_args_pattern + r',\s*' + struct_args_pattern + r'\)',
- re.S), r'unsigned long \1[1 << ((\2) - 1)]'),
- (KernRe(r'DECLARE_KFIFO\s*\(' + struct_args_pattern + r',\s*' + struct_args_pattern +
- r',\s*' + struct_args_pattern + r'\)', re.S), r'\2 *\1'),
- (KernRe(r'DECLARE_KFIFO_PTR\s*\(' + struct_args_pattern + r',\s*' +
- struct_args_pattern + r'\)', re.S), r'\2 *\1'),
- (KernRe(r'(?:__)?DECLARE_FLEX_ARRAY\s*\(' + struct_args_pattern + r',\s*' +
- struct_args_pattern + r'\)', re.S), r'\1 \2[]'),
- (KernRe(r'DEFINE_DMA_UNMAP_ADDR\s*\(' + struct_args_pattern + r'\)', re.S), r'dma_addr_t \1'),
- (KernRe(r'DEFINE_DMA_UNMAP_LEN\s*\(' + struct_args_pattern + r'\)', re.S), r'__u32 \1'),
- (KernRe(r'VIRTIO_DECLARE_FEATURES\(([\w_]+)\)'), r'union { u64 \1; u64 \1_array[VIRTIO_FEATURES_U64S]; }'),
-
- (CMatch(r"__cond_acquires"), ""),
- (CMatch(r"__cond_releases"), ""),
- (CMatch(r"__acquires"), ""),
- (CMatch(r"__releases"), ""),
- (CMatch(r"__must_hold"), ""),
- (CMatch(r"__must_not_hold"), ""),
- (CMatch(r"__must_hold_shared"), ""),
- (CMatch(r"__cond_acquires_shared"), ""),
- (CMatch(r"__acquires_shared"), ""),
- (CMatch(r"__releases_shared"), ""),
+ (CMatch("__attribute__"), ""),
+ (CMatch("__aligned"), ""),
+ (CMatch("__counted_by"), ""),
+ (CMatch("__counted_by_(le|be)"), ""),
+ (CMatch("__guarded_by"), ""),
+ (CMatch("__pt_guarded_by"), ""),
+ (CMatch("__packed"), ""),
+ (CMatch("CRYPTO_MINALIGN_ATTR"), ""),
+ (CMatch("__private"), ""),
+ (CMatch("__rcu"), ""),
+ (CMatch("____cacheline_aligned_in_smp"), ""),
+ (CMatch("____cacheline_aligned"), ""),
+ (CMatch("__cacheline_group_(?:begin|end)"), ""),
+ (CMatch("__ETHTOOL_DECLARE_LINK_MODE_MASK"), r"DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)"),
+ (CMatch("DECLARE_PHY_INTERFACE_MASK",),r"DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)"),
+ (CMatch("DECLARE_BITMAP"), r"unsigned long \1[BITS_TO_LONGS(\2)]"),
+ (CMatch("DECLARE_HASHTABLE"), r"unsigned long \1[1 << ((\2) - 1)]"),
+ (CMatch("DECLARE_KFIFO"), r"\2 *\1"),
+ (CMatch("DECLARE_KFIFO_PTR"), r"\2 *\1"),
+ (CMatch("(?:__)?DECLARE_FLEX_ARRAY"), r"\1 \2[]"),
+ (CMatch("DEFINE_DMA_UNMAP_ADDR"), r"dma_addr_t \1"),
+ (CMatch("DEFINE_DMA_UNMAP_LEN"), r"__u32 \1"),
+ (CMatch("VIRTIO_DECLARE_FEATURES"), r"union { u64 \1; u64 \1_array[VIRTIO_FEATURES_U64S]; }"),
+ (CMatch("__cond_acquires"), ""),
+ (CMatch("__cond_releases"), ""),
+ (CMatch("__acquires"), ""),
+ (CMatch("__releases"), ""),
+ (CMatch("__must_hold"), ""),
+ (CMatch("__must_not_hold"), ""),
+ (CMatch("__must_hold_shared"), ""),
+ (CMatch("__cond_acquires_shared"), ""),
+ (CMatch("__acquires_shared"), ""),
+ (CMatch("__releases_shared"), ""),
+ (CMatch("__attribute__"), ""),
#
# Macro __struct_group() creates an union with an anonymous
@@ -67,51 +60,57 @@ class CTransforms:
# need one of those at kernel-doc, as we won't be documenting the same
# members twice.
#
- (CMatch('struct_group'), r'struct { \2+ };'),
- (CMatch('struct_group_attr'), r'struct { \3+ };'),
- (CMatch('struct_group_tagged'), r'struct { \3+ };'),
- (CMatch('__struct_group'), r'struct { \4+ };'),
-
+ (CMatch("struct_group"), r"struct { \2+ };"),
+ (CMatch("struct_group_attr"), r"struct { \3+ };"),
+ (CMatch("struct_group_tagged"), r"struct { \3+ };"),
+ (CMatch("__struct_group"), r"struct { \4+ };"),
]
#: Transforms for function prototypes.
function_xforms = [
- (KernRe(r"^static +"), ""),
- (KernRe(r"^extern +"), ""),
- (KernRe(r"^asmlinkage +"), ""),
- (KernRe(r"^inline +"), ""),
- (KernRe(r"^__inline__ +"), ""),
- (KernRe(r"^__inline +"), ""),
- (KernRe(r"^__always_inline +"), ""),
- (KernRe(r"^noinline +"), ""),
- (KernRe(r"^__FORTIFY_INLINE +"), ""),
- (KernRe(r"__init +"), ""),
- (KernRe(r"__init_or_module +"), ""),
- (KernRe(r"__exit +"), ""),
- (KernRe(r"__deprecated +"), ""),
- (KernRe(r"__flatten +"), ""),
- (KernRe(r"__meminit +"), ""),
- (KernRe(r"__must_check +"), ""),
- (KernRe(r"__weak +"), ""),
- (KernRe(r"__sched +"), ""),
- (KernRe(r"_noprof"), ""),
- (KernRe(r"__always_unused *"), ""),
- (KernRe(r"__printf\s*\(\s*\d*\s*,\s*\d*\s*\) +"), ""),
- (KernRe(r"__(?:re)?alloc_size\s*\(\s*\d+\s*(?:,\s*\d+\s*)?\) +"), ""),
- (KernRe(r"__diagnose_as\s*\(\s*\S+\s*(?:,\s*\d+\s*)*\) +"), ""),
- (KernRe(r"DECL_BUCKET_PARAMS\s*\(\s*(\S+)\s*,\s*(\S+)\s*\)"), r"\1, \2"),
- (KernRe(r"__no_context_analysis\s*"), ""),
- (KernRe(r"__attribute_const__ +"), ""),
- (KernRe(r"__attribute__\s*\(\((?:[\w\s]+(?:\([^)]*\))?\s*,?)+\)\)\s+"), ""),
+ (CMatch("static"), ""),
+ (CMatch("extern"), ""),
+ (CMatch("asmlinkage"), ""),
+ (CMatch("inline"), ""),
+ (CMatch("__inline__"), ""),
+ (CMatch("__inline"), ""),
+ (CMatch("__always_inline"), ""),
+ (CMatch("noinline"), ""),
+ (CMatch("__FORTIFY_INLINE"), ""),
+ (CMatch("__init"), ""),
+ (CMatch("__init_or_module"), ""),
+ (CMatch("__exit"), ""),
+ (CMatch("__deprecated"), ""),
+ (CMatch("__flatten"), ""),
+ (CMatch("__meminit"), ""),
+ (CMatch("__must_check"), ""),
+ (CMatch("__weak"), ""),
+ (CMatch("__sched"), ""),
+ (CMatch("__always_unused"), ""),
+ (CMatch("__printf"), ""),
+ (CMatch("__(?:re)?alloc_size"), ""),
+ (CMatch("__diagnose_as"), ""),
+ (CMatch("DECL_BUCKET_PARAMS"), r"\1, \2"),
+ (CMatch("__no_context_analysis"), ""),
+ (CMatch("__attribute_const__"), ""),
+ (CMatch("__attribute__"), ""),
+
+ #
+ # HACK: this is similar to process_export() hack. It is meant to
+ # drop _noproof from function name. See for instance:
+ # ahash_request_alloc kernel-doc declaration at include/crypto/hash.h.
+ #
+ (KernRe("_noprof"), ""),
]
#: Transforms for variable prototypes.
var_xforms = [
- (KernRe(r"__read_mostly"), ""),
- (KernRe(r"__ro_after_init"), ""),
- (KernRe(r'\s*__guarded_by\s*\([^\)]*\)', re.S), ""),
- (KernRe(r'\s*__pt_guarded_by\s*\([^\)]*\)', re.S), ""),
- (KernRe(r"LIST_HEAD\(([\w_]+)\)"), r"struct list_head \1"),
+ (CMatch("__read_mostly"), ""),
+ (CMatch("__ro_after_init"), ""),
+ (CMatch("__guarded_by"), ""),
+ (CMatch("__pt_guarded_by"), ""),
+ (CMatch("LIST_HEAD"), r"struct list_head \1"),
+
(KernRe(r"(?://.*)$"), ""),
(KernRe(r"(?:/\*.*\*/)"), ""),
(KernRe(r";$"), ""),
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 21/28] docs: c_lex: add "@" operator
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (19 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 20/28] docs: xforms_lists: use CMatch for all identifiers Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 22/28] docs: c_lex: don't exclude an extra token Mauro Carvalho Chehab
` (9 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
This was missing at OP regex.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/c_lex.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index 689ad64ecbe4..a61f5fe88363 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -100,7 +100,7 @@ TOKEN_LIST = [
(CToken.HASH, r"#"),
(CToken.OP, r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
- r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
+ r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:|\@"),
(CToken.STRUCT, r"\bstruct\b"),
(CToken.UNION, r"\bunion\b"),
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 22/28] docs: c_lex: don't exclude an extra token
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (20 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 21/28] docs: c_lex: add "@" operator Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 23/28] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
` (8 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
On simple match, replace only the match and following spaces.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/c_lex.py | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index a61f5fe88363..bc70b55f0dbe 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -486,6 +486,8 @@ class CMatch:
continue
else:
# Name only token without BEGIN/END
+ if i > start:
+ i -= 1
yield start, i
start = None
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 23/28] docs: c_lex: setup a logger to report tokenizer issues
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (21 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 22/28] docs: c_lex: don't exclude an extra token Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 24/28] docs: unittests: add and adjust tests to check for errors Mauro Carvalho Chehab
` (7 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
Report file that has issues detected via CMatch and CTokenizer.
This is done by setting up a logger that will be overriden by
kdoc_parser, when used on it.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/c_lex.py | 36 ++++++++++++++++++++++++----
tools/lib/python/kdoc/kdoc_parser.py | 3 +++
2 files changed, 34 insertions(+), 5 deletions(-)
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index bc70b55f0dbe..596510bb4e95 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -6,14 +6,39 @@
Regular expression ancillary classes.
Those help caching regular expressions and do matching for kernel-doc.
+
+Please notice that the code here may rise exceptions to indicate bad
+usage inside kdoc to indicate problems at the replace pattern.
+
+Other errors are logged via log instance.
"""
+import logging
import re
from copy import copy
from .kdoc_re import KernRe
+log = logging.getLogger(__name__)
+
+def tokenizer_set_log(logger, prefix = ""):
+ """
+ Replace the module‑level logger with a LoggerAdapter that
+ prepends *prefix* to every message.
+ """
+ global log
+
+ class PrefixAdapter(logging.LoggerAdapter):
+ """
+ Ancillary class to set prefix on all message logs.
+ """
+ def process(self, msg, kwargs):
+ return f"{prefix}{msg}", kwargs
+
+ # Wrap the provided logger in our adapter
+ log = PrefixAdapter(logger, {"prefix": prefix})
+
class CToken():
"""
Data class to define a C token.
@@ -169,7 +194,7 @@ class CTokenizer():
value = match.group()
if kind == CToken.MISMATCH:
- raise RuntimeError(f"Unexpected token '{value}' on {pos}:\n\t{source}")
+ log.error(f"Unexpected token '{value}' on {pos}:\n\t{source}")
elif kind == CToken.BEGIN:
if value == '(':
paren_level += 1
@@ -189,7 +214,7 @@ class CTokenizer():
yield CToken(kind, value, pos,
brace_level, paren_level, bracket_level)
- def __init__(self, source=None):
+ def __init__(self, source=None, log=None):
"""
Create a regular expression to handle TOKEN_LIST.
@@ -349,7 +374,7 @@ class CTokenArgs:
elif tok.value == "(":
delim = ","
else:
- raise ValueError(fr"Can't handle \1..\n on {sub_str}")
+ self.log.error(fr"Can't handle \1..\n on {sub_str}")
level = tok.level
break
@@ -383,7 +408,7 @@ class CTokenArgs:
groups_list[pos].append(tok)
if pos < self.max_group:
- raise ValueError(fr"{self.sub_str} groups are up to {pos} instead of {self.max_group}")
+ log.error(fr"{self.sub_str} groups are up to {pos} instead of {self.max_group}")
return level, groups_list
@@ -503,7 +528,8 @@ class CMatch:
# picking an incomplete block.
#
if start and stack:
- print("WARNING: can't find an end", file=sys.stderr)
+ s = str(tokenizer)
+ log.warning(f"can't find a final end at {s}")
yield start, len(tokenizer.tokens)
def search(self, source):
diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 0da95b090a34..3ff17b07c1c9 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -14,6 +14,7 @@ import re
from pprint import pformat
from kdoc.kdoc_re import KernRe
+from kdoc.c_lex import tokenizer_set_log
from kdoc.c_lex import CTokenizer
from kdoc.kdoc_item import KdocItem
@@ -253,6 +254,8 @@ class KernelDoc:
self.config = config
self.xforms = xforms
+ tokenizer_set_log(self.config.log, f"{self.fname}: CMatch: ")
+
# Initial state for the state machines
self.state = state.NORMAL
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 24/28] docs: unittests: add and adjust tests to check for errors
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (22 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 23/28] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 25/28] docs: c_lex: better handle BEGIN/END at search Mauro Carvalho Chehab
` (6 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
Test the errors that are rised and the ones that are logged.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/c_lex.py | 2 +-
tools/unittests/test_cmatch.py | 15 ++++++++++++++-
tools/unittests/test_tokenizer.py | 11 ++++++-----
3 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index 596510bb4e95..8beac59166fc 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -194,7 +194,7 @@ class CTokenizer():
value = match.group()
if kind == CToken.MISMATCH:
- log.error(f"Unexpected token '{value}' on {pos}:\n\t{source}")
+ log.error(f"Unexpected token '{value}' on pos {pos}:\n\t'{source}'")
elif kind == CToken.BEGIN:
if value == '(':
paren_level += 1
diff --git a/tools/unittests/test_cmatch.py b/tools/unittests/test_cmatch.py
index f6ccd2a942f1..3fbc5d3bc244 100755
--- a/tools/unittests/test_cmatch.py
+++ b/tools/unittests/test_cmatch.py
@@ -288,6 +288,19 @@ class TestSubSimple(TestCaseDiff):
self.assertLogicallyEqual(result, "int foo;")
+ def test_rise_early_greedy(self):
+ line = f"{self.MACRO}(a, b, c, d);"
+ sub = r"\1, \2+, \3"
+
+ with self.assertRaises(ValueError):
+ result = self.matcher.sub(sub, line)
+
+ def test_rise_multiple_greedy(self):
+ line = f"{self.MACRO}(a, b, c, d);"
+ sub = r"\1, \2+, \3+"
+
+ with self.assertRaises(ValueError):
+ result = self.matcher.sub(sub, line)
#
# Test replacements with slashrefs
@@ -539,7 +552,7 @@ class TestSubWithLocalXforms(TestCaseDiff):
self.assertLogicallyEqual(result, expected)
def test_raw_struct_group_tagged(self):
- """
+ r"""
Test cxl_regs with struct_group_tagged patterns from drivers/cxl/cxl.h.
NOTE:
diff --git a/tools/unittests/test_tokenizer.py b/tools/unittests/test_tokenizer.py
index 3081f27a7786..6a0bd49df72e 100755
--- a/tools/unittests/test_tokenizer.py
+++ b/tools/unittests/test_tokenizer.py
@@ -44,11 +44,12 @@ def make_tokenizer_test(name, data):
"""In-lined lambda-like function to run the test"""
#
- # Check if exceptions are properly handled
+ # Check if logger is working
#
- if "raises" in data:
- with self.assertRaises(data["raises"]):
- CTokenizer(data["source"])
+ if "log_level" in data:
+ with self.assertLogs('kdoc.c_lex', level='ERROR') as cm:
+ tokenizer = CTokenizer(data["source"])
+
return
#
@@ -123,7 +124,7 @@ TESTS_TOKENIZER = {
"mismatch_error": {
"source": "int a$ = 5;", # $ is illegal
- "raises": RuntimeError,
+ "log_level": "ERROR",
},
}
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 25/28] docs: c_lex: better handle BEGIN/END at search
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (23 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 24/28] docs: unittests: add and adjust tests to check for errors Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 26/28] docs: kernel-doc.rst: document private: scope propagation Mauro Carvalho Chehab
` (5 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
Currently, the logic is emitting warnings after finishing
to parse a prototype like:
static inline unsigned \
read_seqretry(const seqlock_t *sl, unsigned start)
__releases_shared(sl) __no_context_analysis
The problem is that the last CMatch there doesn't have BEGIN/END,
but this is already expected.
Make the logic more restrict by:
- ensure that BEGIN/END there refers to function-like calls,
e.g. foo(...);
- only emit a warning after BEGIN is detected.
Instead of hardcoding "(" delim, let the caller specify if a different
one would be required.
While here, remove an uneeded elsif.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/c_lex.py | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index 8beac59166fc..e641bace5d69 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -463,8 +463,9 @@ class CMatch:
"""
- def __init__(self, regex):
+ def __init__(self, regex, delim="("):
self.regex = KernRe("^" + regex + r"\b")
+ self.start_delim = delim
def _search(self, tokenizer):
"""
@@ -506,15 +507,15 @@ class CMatch:
if tok.kind == CToken.SPACE:
continue
- if tok.kind == CToken.BEGIN:
+ if tok.kind == CToken.BEGIN and tok.value == self.start_delim:
started = True
continue
- else:
- # Name only token without BEGIN/END
- if i > start:
- i -= 1
- yield start, i
- start = None
+
+ # Name only token without BEGIN/END
+ if i > start:
+ i -= 1
+ yield start, i
+ start = None
if tok.kind == CToken.END and tok.level == stack[-1][1]:
start, level = stack.pop()
@@ -528,8 +529,10 @@ class CMatch:
# picking an incomplete block.
#
if start and stack:
- s = str(tokenizer)
- log.warning(f"can't find a final end at {s}")
+ if started:
+ s = str(tokenizer)
+ log.warning(f"can't find a final end at {s}")
+
yield start, len(tokenizer.tokens)
def search(self, source):
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 26/28] docs: kernel-doc.rst: document private: scope propagation
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (24 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 25/28] docs: c_lex: better handle BEGIN/END at search Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 27/28] docs: c_lex: produce a cleaner str() representation Mauro Carvalho Chehab
` (4 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Randy Dunlap, Shuah Khan, Vincent Mailhol
This was an undefined behavior, but at least one place used private:
inside a nested struct meant to not be propagated outside it.
Kernel-doc now defines how this is propagated. So, document that.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
Documentation/doc-guide/kernel-doc.rst | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/Documentation/doc-guide/kernel-doc.rst b/Documentation/doc-guide/kernel-doc.rst
index 8d2c09fb36e4..1c148fe8e1f9 100644
--- a/Documentation/doc-guide/kernel-doc.rst
+++ b/Documentation/doc-guide/kernel-doc.rst
@@ -213,6 +213,10 @@ The ``private:`` and ``public:`` tags must begin immediately following a
``/*`` comment marker. They may optionally include comments between the
``:`` and the ending ``*/`` marker.
+When ``private:`` is used on nested structs, it propagates only to inner
+structs/unions.
+
+
Example::
/**
@@ -256,8 +260,10 @@ It is possible to document nested structs and unions, like::
union {
struct {
int memb1;
+ /* private: hides memb2 from documentation */
int memb2;
};
+ /* Everything here is public again, as private scope finished */
struct {
void *memb3;
int memb4;
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 27/28] docs: c_lex: produce a cleaner str() representation
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (25 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 26/28] docs: kernel-doc.rst: document private: scope propagation Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 28/28] unittests: test_cmatch: remove weird stuff from expected results Mauro Carvalho Chehab
` (3 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
Avoid adding whitespaces before ";" and have duplicated ones
at the output after converting to string.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/c_lex.py | 27 ++++++++++++++++++++++++---
1 file changed, 24 insertions(+), 3 deletions(-)
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index e641bace5d69..95c4dd5afe77 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -241,7 +241,7 @@ class CTokenizer():
out=""
show_stack = [True]
- for tok in self.tokens:
+ for i, tok in enumerate(self.tokens):
if tok.kind == CToken.BEGIN:
show_stack.append(show_stack[-1])
@@ -270,8 +270,29 @@ class CTokenizer():
continue
- if show_stack[-1]:
- out += str(tok.value)
+ if not show_stack[-1]:
+ continue
+
+ if i < len(self.tokens) - 1:
+ next_tok = self.tokens[i + 1]
+
+ # Do some cleanups before ";"
+
+ if (tok.kind == CToken.SPACE and
+ next_tok.kind == CToken.PUNC and
+ next_tok.value == ";"):
+
+ continue
+
+ if (tok.kind == CToken.PUNC and
+ next_tok.kind == CToken.PUNC and
+ tok.value == ";" and
+ next_tok.kind == CToken.PUNC and
+ next_tok.value == ";"):
+
+ continue
+
+ out += str(tok.value)
return out
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 28/28] unittests: test_cmatch: remove weird stuff from expected results
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (26 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 27/28] docs: c_lex: produce a cleaner str() representation Mauro Carvalho Chehab
@ 2026-03-12 14:54 ` Mauro Carvalho Chehab
2026-03-13 8:34 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
` (2 subsequent siblings)
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-12 14:54 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel
Now that c_lex produces a cleaner output, change the expected
results for them to not have duplicated ";" or whitespaces just
before it.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/unittests/test_cmatch.py | 18 +++++++-----------
1 file changed, 7 insertions(+), 11 deletions(-)
diff --git a/tools/unittests/test_cmatch.py b/tools/unittests/test_cmatch.py
index 3fbc5d3bc244..7b996f83784d 100755
--- a/tools/unittests/test_cmatch.py
+++ b/tools/unittests/test_cmatch.py
@@ -416,7 +416,6 @@ class TestSubWithLocalXforms(TestCaseDiff):
struct tx_pkt_info {
struct tx_sop_header sop;
struct tx_segment_header seg;
- ;
struct tx_eop_header eop;
u16 pkt_len;
u16 seq_num;
@@ -490,7 +489,6 @@ class TestSubWithLocalXforms(TestCaseDiff):
__le64 LastWriteTime;
__le64 ChangeTime;
__le32 FileAttributes;
- ;
__le64 AllocationSize;
__le64 EndOfFile;
__le16 FileType;
@@ -504,7 +502,6 @@ class TestSubWithLocalXforms(TestCaseDiff):
__le64 LastWriteTime;
__le64 ChangeTime;
__le32 Attributes;
- ;
__u32 Pad1;
__le64 AllocationSize;
__le64 EndOfFile;
@@ -543,7 +540,6 @@ class TestSubWithLocalXforms(TestCaseDiff):
__le16 num_entries;
__le16 supported_feats;
__u8 reserved[4];
- ;
struct cxl_feat_entry ents[];
};
"""
@@ -605,23 +601,23 @@ class TestSubWithLocalXforms(TestCaseDiff):
struct cxl_component_regs {
void __iomem *hdm_decoder;
void __iomem *ras;
- } component;;
+ } component;
struct cxl_device_regs {
void __iomem *status, *mbox, *memdev;
- } device_regs;;
+ } device_regs;
struct cxl_pmu_regs {
void __iomem *pmu;
- } pmu_regs;;
+ } pmu_regs;
struct cxl_rch_regs {
void __iomem *dport_aer;
- } rch_regs;;
+ } rch_regs;
struct cxl_rcd_regs {
void __iomem *rcd_pcie_cap;
- } rcd_regs;;
+ } rcd_regs;
};
"""
@@ -667,7 +663,7 @@ class TestSubWithLocalXforms(TestCaseDiff):
struct net_device *netdev;
unsigned int queue_idx;
unsigned int flags;
- } slow;;
+ } slow;
struct page_pool_params_fast {
unsigned int order;
unsigned int pool_size;
@@ -677,7 +673,7 @@ class TestSubWithLocalXforms(TestCaseDiff):
enum dma_data_direction dma_dir;
unsigned int max_len;
unsigned int offset;
- } fast;;
+ } fast;
};
"""
--
2.52.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto()
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (27 preceding siblings ...)
2026-03-12 14:54 ` [PATCH v2 28/28] unittests: test_cmatch: remove weird stuff from expected results Mauro Carvalho Chehab
@ 2026-03-13 8:34 ` Mauro Carvalho Chehab
2026-03-13 8:34 ` [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
2026-03-13 11:05 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Loktionov, Aleksandr
2026-03-13 9:17 ` [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-17 17:12 ` Jonathan Corbet
30 siblings, 2 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-13 8:34 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-kernel, Aleksandr Loktionov,
Mauro Carvalho Chehab, Randy Dunlap
Changeset 2b957decdb6c ("docs: kdoc: don't add broken comments inside prototypes")
revealed a hidden bug at split_struct_proto(): some comments there may break
its capability of properly identifying a struct.
Fixing it is as simple as stripping comments before calling it.
Fixes: 2b957decdb6c ("docs: kdoc: don't add broken comments inside prototypes")
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/kdoc_parser.py | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 3ff17b07c1c9..ed378edb1e05 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -724,6 +724,7 @@ class KernelDoc:
#
# Do the basic parse to get the pieces of the declaration.
#
+ proto = trim_private_members(proto)
struct_parts = self.split_struct_proto(proto)
if not struct_parts:
self.emit_msg(ln, f"{proto} error: Cannot parse struct or union!")
@@ -764,6 +765,7 @@ class KernelDoc:
# Strip preprocessor directives. Note that this depends on the
# trailing semicolon we added in process_proto_type().
#
+ proto = trim_private_members(proto)
proto = KernRe(r'#\s*((define|ifdef|if)\s+|endif)[^;]*;', flags=re.S).sub('', proto)
#
# Parse out the name and members of the enum. Typedef form first.
@@ -771,7 +773,7 @@ class KernelDoc:
r = KernRe(r'typedef\s+enum\s*\{(.*)\}\s*(\w*)\s*;')
if r.search(proto):
declaration_name = r.group(2)
- members = trim_private_members(r.group(1))
+ members = r.group(1)
#
# Failing that, look for a straight enum
#
@@ -779,7 +781,7 @@ class KernelDoc:
r = KernRe(r'enum\s+(\w*)\s*\{(.*)\}')
if r.match(proto):
declaration_name = r.group(1)
- members = trim_private_members(r.group(2))
+ members = r.group(2)
#
# OK, this isn't going to work.
#
--
2.53.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs everytime
2026-03-13 8:34 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
@ 2026-03-13 8:34 ` Mauro Carvalho Chehab
2026-03-13 11:05 ` Loktionov, Aleksandr
2026-03-13 11:05 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Loktionov, Aleksandr
1 sibling, 1 reply; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-13 8:34 UTC (permalink / raw)
To: Jonathan Corbet, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-kernel, Aleksandr Loktionov,
Mauro Carvalho Chehab, Randy Dunlap
Most of the rules inside CTransforms are of the type CMatch.
Don't re-parse the source code every time.
Doing this doesn't change the output, but makes kdoc almost
as fast as before the tokenizer patches:
# Before tokenizer patches
$ time ./scripts/kernel-doc . -man >original 2>&1
real 0m42.933s
user 0m36.523s
sys 0m1.145s
# After tokenizer patches
$ time ./scripts/kernel-doc . -man >before 2>&1
real 1m29.853s
user 1m23.974s
sys 0m1.237s
# After this patch
$ time ./scripts/kernel-doc . -man >after 2>&1
real 0m48.579s
user 0m45.938s
sys 0m0.988s
$ diff -s before after
Files before and after are identical
Manually checked the differences between original and after
with:
$ diff -U0 -prBw original after|grep -v Warning|grep -v "@@"|less
They're due:
- whitespace fixes;
- struct_group are now better handled;
- several badly-generated man pages from broken inline kernel-doc
markups are now fixed.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
tools/lib/python/kdoc/kdoc_parser.py | 1 -
tools/lib/python/kdoc/xforms_lists.py | 30 +++++++++++++++++++++------
2 files changed, 24 insertions(+), 7 deletions(-)
diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index ed378edb1e05..3b99740ebed3 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -738,7 +738,6 @@ class KernelDoc:
#
# Go through the list of members applying all of our transformations.
#
- members = trim_private_members(members)
members = self.xforms.apply("struct", members)
#
diff --git a/tools/lib/python/kdoc/xforms_lists.py b/tools/lib/python/kdoc/xforms_lists.py
index c3c532c45cdc..f6ea9efb11ae 100644
--- a/tools/lib/python/kdoc/xforms_lists.py
+++ b/tools/lib/python/kdoc/xforms_lists.py
@@ -5,7 +5,7 @@
import re
from kdoc.kdoc_re import KernRe
-from kdoc.c_lex import CMatch
+from kdoc.c_lex import CMatch, CTokenizer
struct_args_pattern = r"([^,)]+)"
@@ -17,6 +17,12 @@ class CTransforms:
into something we can parse and generate kdoc for.
"""
+ #
+ # NOTE:
+ # Due to performance reasons, place CMatch rules before KernRe,
+ # as this avoids running the C parser every time.
+ #
+
#: Transforms for structs and unions.
struct_xforms = [
(CMatch("__attribute__"), ""),
@@ -123,13 +129,25 @@ class CTransforms:
"var": var_xforms,
}
- def apply(self, xforms_type, text):
+ def apply(self, xforms_type, source):
"""
- Apply a set of transforms to a block of text.
+ Apply a set of transforms to a block of source.
+
+ As tokenizer is used here, this function also remove comments
+ at the end.
"""
if xforms_type not in self.xforms:
- return text
+ return source
+
+ if isinstance(source, str):
+ source = CTokenizer(source)
for search, subst in self.xforms[xforms_type]:
- text = search.sub(subst, text)
- return text
+ #
+ # KernRe only accept strings.
+ #
+ if isinstance(search, KernRe):
+ source = str(source)
+
+ source = search.sub(subst, source)
+ return str(source)
--
2.53.0
^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (28 preceding siblings ...)
2026-03-13 8:34 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
@ 2026-03-13 9:17 ` Mauro Carvalho Chehab
2026-03-17 17:12 ` Jonathan Corbet
30 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-13 9:17 UTC (permalink / raw)
To: Jonathan Corbet, Mauro Carvalho Chehab
Cc: Kees Cook, linux-doc, linux-hardening, linux-kernel,
Gustavo A. R. Silva, Aleksandr Loktionov, Randy Dunlap,
Shuah Khan, Vincent Mailhol
Hi Jon,
On Thu, 12 Mar 2026 15:54:20 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Also, I didn't notice any relevant change on the documentation build
> time.
After more tests, I actually noticed an issue after this changeset:
https://lore.kernel.org/linux-doc/2b957decdb6cedab4268f71a166c25b7abdb9a61.1773326442.git.mchehab+huawei@kernel.org/
Basically, a broken kernel-doc like this:
/**
* enum dmub_abm_ace_curve_type - ACE curve type.
*/
enum dmub_abm_ace_curve_type {
/**
* ACE curve as defined by the SW layer.
*/
ABM_ACE_CURVE_TYPE__SW = 0,
/**
* ACE curve as defined by the SW to HW translation interface layer.
*/
ABM_ACE_CURVE_TYPE__SW_IF = 1,
};
where the inlined markups don't have "@symbol" doesn't parse well. If
you run current kernel-doc, it would produce:
.. c:enum:: dmub_abm_ace_curve_type
ACE curve type.
.. container:: kernelindent
**Constants**
``*/ ABM_ACE_CURVE_TYPE__SW = 0``
*undescribed*
`` */ ABM_ACE_CURVE_TYPE__SW_IF = 1``
*undescribed*
Because Kernel-doc currently drops the "/**" line. My fix patch
above fixes it, but inlined comments confuse enum/struct detection.
To avoid that, we need to strip comments earlier at dump_struct and
dump_enum:
https://lore.kernel.org/linux-doc/d112804ace83e0ad8496f687977596bb7f091560.1773390831.git.mchehab+huawei@kernel.org/T/#u
After such fix, the output is now:
.. c:enum:: dmub_abm_ace_curve_type
ACE curve type.
.. container:: kernelindent
**Constants**
``ABM_ACE_CURVE_TYPE__SW``
*undescribed*
``ABM_ACE_CURVE_TYPE__SW_IF``
*undescribed*
which is the result expected when there's no proper inlined
kernel-doc markups.
Due to this issue, I ended adding a 29/28 patch on this series.
> With that regards, right now, every time a CMatch replacement
> rule takes in place, it does:
>
> for each transform:
> - tokenizes the source code;
> - handle CMatch;
> - convert tokens back to a string.
>
> A possible optimization would be to do, instead:
>
> - tokenizes source code;
> - for each transform handle CMatch;
> - convert tokens back to a string.
>
> For now, I opted not do do it, because:
>
> - too much changes on a single row;
> - docs build time is taking ~3:30 minutes, which is
> about the same time it ws taken before the changes;
> - there is a very dirty hack inside function_xforms:
> (KernRe(r"_noprof"), ""). This is meant to change
> function prototypes instead of function arguments.
>
> So, if ok for you, I would prefer to merge this one first. We can later
> optimize kdoc_parser to avoid multiple token <-> string conversions.
I did such optimization and it worked fine. So, I ended adding
a 30/28 patch at the end. With that, running kernel-doc before/after
the entire series won't have significant performance changes.
# Current approach
$ time ./scripts/kernel-doc . -man >original 2>&1
real 0m37.344s
user 0m36.447s
sys 0m0.712s
# Tokenizer running multiple times (patch 29)
$ time ./scripts/kernel-doc . -man >before 2>&1
real 1m32.427s
user 1m25.377s
sys 0m1.293s
# After optimization (patch 30)
$ time ./scripts/kernel-doc . -man >after 2>&1
real 0m47.094s
user 0m46.106s
sys 0m0.751s
10 seconds slower than before when parsing everything, which affects
make mandocs, but the time differences spent at kernel-doc parser during
make htmldocs is minimal: ir is about ~4 seconds(*):
$ run_kdoc.py -none 2>/dev/null
Checking what files are currently used on documentation...
Running kernel-doc
Elapsed time: 0:00:04.348008
(*) the slowest logic when building docs with Sphinx is inside its
RST parser code.
See the enclosed script to see how I measured the parsing time for
existing ".. kernel-doc::" markups inside Documentation.
Thanks,
Mauro
---
This is the run_kdoc.py script I'm using here to pick the same files
as make htmldocs do:
#!/bin/env python3
import os
import re
import subprocess
import sys
from datetime import datetime
from glob import glob
print("Checking what files are currently used on documentation...")
kdoc_files = set()
re_kernel_doc = re.compile(r"^\.\.\s+kernel-doc::\s*(\S+)")
for fname in glob(os.path.join(".", "**"), recursive=True):
if os.path.isfile(fname) and fname.endswith(".rst"):
with open(fname, "r", encoding="utf-8") as in_fp:
data = in_fp.read()
for line in data.split("\n"):
match = re_kernel_doc.match(line)
if match:
if os.path.isfile(match.group(1)):
kdoc_files.add(match.group(1))
if not kdoc_files:
sys.exit(f"Directory doesn't contain kernel-doc tags")
cmd = [ "./tools/docs/kernel-doc" ]
cmd += sys.argv[1:]
cmd += sorted(kdoc_files)
print("Running kernel-doc")
start_time = datetime.now()
try:
result = subprocess.run(cmd, check=True)
except subprocess.CalledProcessError as e:
print(f"kernel-doc failed: {repr(e)}")
elapsed = datetime.now() - start_time
print(f"\nElapsed time: {elapsed}")
^ permalink raw reply [flat|nested] 47+ messages in thread
* RE: [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs everytime
2026-03-13 8:34 ` [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
@ 2026-03-13 11:05 ` Loktionov, Aleksandr
0 siblings, 0 replies; 47+ messages in thread
From: Loktionov, Aleksandr @ 2026-03-13 11:05 UTC (permalink / raw)
To: Mauro Carvalho Chehab, Jonathan Corbet, Linux Doc Mailing List
Cc: linux-kernel@vger.kernel.org, Mauro Carvalho Chehab, Randy Dunlap
> -----Original Message-----
> From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Sent: Friday, March 13, 2026 9:34 AM
> To: Jonathan Corbet <corbet@lwn.net>; Linux Doc Mailing List <linux-
> doc@vger.kernel.org>
> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>; linux-
> kernel@vger.kernel.org; Loktionov, Aleksandr
> <aleksandr.loktionov@intel.com>; Mauro Carvalho Chehab
> <mchehab@kernel.org>; Randy Dunlap <rdunlap@infradead.org>
> Subject: [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs
> everytime
>
> Most of the rules inside CTransforms are of the type CMatch.
>
> Don't re-parse the source code every time.
>
> Doing this doesn't change the output, but makes kdoc almost as fast as
> before the tokenizer patches:
>
> # Before tokenizer patches
> $ time ./scripts/kernel-doc . -man >original 2>&1
>
> real 0m42.933s
> user 0m36.523s
> sys 0m1.145s
>
> # After tokenizer patches
> $ time ./scripts/kernel-doc . -man >before 2>&1
>
> real 1m29.853s
> user 1m23.974s
> sys 0m1.237s
>
> # After this patch
> $ time ./scripts/kernel-doc . -man >after 2>&1
>
> real 0m48.579s
> user 0m45.938s
> sys 0m0.988s
>
> $ diff -s before after
> Files before and after are identical
>
> Manually checked the differences between original and after
> with:
>
> $ diff -U0 -prBw original after|grep -v Warning|grep -v "@@"|less
>
> They're due:
> - whitespace fixes;
> - struct_group are now better handled;
> - several badly-generated man pages from broken inline kernel-doc
> markups are now fixed.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
> tools/lib/python/kdoc/kdoc_parser.py | 1 -
> tools/lib/python/kdoc/xforms_lists.py | 30 +++++++++++++++++++++------
> 2 files changed, 24 insertions(+), 7 deletions(-)
>
> diff --git a/tools/lib/python/kdoc/kdoc_parser.py
> b/tools/lib/python/kdoc/kdoc_parser.py
> index ed378edb1e05..3b99740ebed3 100644
> --- a/tools/lib/python/kdoc/kdoc_parser.py
> +++ b/tools/lib/python/kdoc/kdoc_parser.py
> @@ -738,7 +738,6 @@ class KernelDoc:
> #
> # Go through the list of members applying all of our
> transformations.
> #
> - members = trim_private_members(members)
> members = self.xforms.apply("struct", members)
>
> #
> diff --git a/tools/lib/python/kdoc/xforms_lists.py
> b/tools/lib/python/kdoc/xforms_lists.py
> index c3c532c45cdc..f6ea9efb11ae 100644
> --- a/tools/lib/python/kdoc/xforms_lists.py
> +++ b/tools/lib/python/kdoc/xforms_lists.py
> @@ -5,7 +5,7 @@
> import re
>
> from kdoc.kdoc_re import KernRe
> -from kdoc.c_lex import CMatch
> +from kdoc.c_lex import CMatch, CTokenizer
>
> struct_args_pattern = r"([^,)]+)"
>
> @@ -17,6 +17,12 @@ class CTransforms:
> into something we can parse and generate kdoc for.
> """
>
> + #
> + # NOTE:
> + # Due to performance reasons, place CMatch rules before
> KernRe,
> + # as this avoids running the C parser every time.
> + #
> +
> #: Transforms for structs and unions.
> struct_xforms = [
> (CMatch("__attribute__"), ""),
> @@ -123,13 +129,25 @@ class CTransforms:
> "var": var_xforms,
> }
>
> - def apply(self, xforms_type, text):
> + def apply(self, xforms_type, source):
> """
> - Apply a set of transforms to a block of text.
> + Apply a set of transforms to a block of source.
> +
> + As tokenizer is used here, this function also remove comments
> + at the end.
> """
> if xforms_type not in self.xforms:
> - return text
> + return source
> +
> + if isinstance(source, str):
> + source = CTokenizer(source)
>
> for search, subst in self.xforms[xforms_type]:
> - text = search.sub(subst, text)
> - return text
> + #
> + # KernRe only accept strings.
> + #
> + if isinstance(search, KernRe):
> + source = str(source)
> +
> + source = search.sub(subst, source)
> + return str(source)
> --
> 2.53.0
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
^ permalink raw reply [flat|nested] 47+ messages in thread
* RE: [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto()
2026-03-13 8:34 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
2026-03-13 8:34 ` [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
@ 2026-03-13 11:05 ` Loktionov, Aleksandr
1 sibling, 0 replies; 47+ messages in thread
From: Loktionov, Aleksandr @ 2026-03-13 11:05 UTC (permalink / raw)
To: Mauro Carvalho Chehab, Jonathan Corbet, Linux Doc Mailing List
Cc: linux-kernel@vger.kernel.org, Mauro Carvalho Chehab, Randy Dunlap
> -----Original Message-----
> From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Sent: Friday, March 13, 2026 9:34 AM
> To: Jonathan Corbet <corbet@lwn.net>; Linux Doc Mailing List <linux-
> doc@vger.kernel.org>
> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>; linux-
> kernel@vger.kernel.org; Loktionov, Aleksandr
> <aleksandr.loktionov@intel.com>; Mauro Carvalho Chehab
> <mchehab@kernel.org>; Randy Dunlap <rdunlap@infradead.org>
> Subject: [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped
> before calling split_struct_proto()
>
> Changeset 2b957decdb6c ("docs: kdoc: don't add broken comments inside
> prototypes") revealed a hidden bug at split_struct_proto(): some
> comments there may break its capability of properly identifying a
> struct.
>
> Fixing it is as simple as stripping comments before calling it.
>
> Fixes: 2b957decdb6c ("docs: kdoc: don't add broken comments inside
> prototypes")
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
> tools/lib/python/kdoc/kdoc_parser.py | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/tools/lib/python/kdoc/kdoc_parser.py
> b/tools/lib/python/kdoc/kdoc_parser.py
> index 3ff17b07c1c9..ed378edb1e05 100644
> --- a/tools/lib/python/kdoc/kdoc_parser.py
> +++ b/tools/lib/python/kdoc/kdoc_parser.py
> @@ -724,6 +724,7 @@ class KernelDoc:
> #
> # Do the basic parse to get the pieces of the declaration.
> #
> + proto = trim_private_members(proto)
> struct_parts = self.split_struct_proto(proto)
> if not struct_parts:
> self.emit_msg(ln, f"{proto} error: Cannot parse struct or
> union!") @@ -764,6 +765,7 @@ class KernelDoc:
> # Strip preprocessor directives. Note that this depends on
> the
> # trailing semicolon we added in process_proto_type().
> #
> + proto = trim_private_members(proto)
> proto = KernRe(r'#\s*((define|ifdef|if)\s+|endif)[^;]*;',
> flags=re.S).sub('', proto)
> #
> # Parse out the name and members of the enum. Typedef form
> first.
> @@ -771,7 +773,7 @@ class KernelDoc:
> r = KernRe(r'typedef\s+enum\s*\{(.*)\}\s*(\w*)\s*;')
> if r.search(proto):
> declaration_name = r.group(2)
> - members = trim_private_members(r.group(1))
> + members = r.group(1)
> #
> # Failing that, look for a straight enum
> #
> @@ -779,7 +781,7 @@ class KernelDoc:
> r = KernRe(r'enum\s+(\w*)\s*\{(.*)\}')
> if r.match(proto):
> declaration_name = r.group(1)
> - members = trim_private_members(r.group(2))
> + members = r.group(2)
> #
> # OK, this isn't going to work.
> #
> --
> 2.53.0
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
2026-03-12 14:54 ` [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer Mauro Carvalho Chehab
@ 2026-03-16 23:01 ` Jonathan Corbet
2026-03-17 7:59 ` Mauro Carvalho Chehab
2026-03-16 23:03 ` Jonathan Corbet
1 sibling, 1 reply; 47+ messages in thread
From: Jonathan Corbet @ 2026-03-16 23:01 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Corbet, Linux Doc Mailing List, linux-hardening,
linux-kernel, Aleksandr Loktionov, Randy Dunlap
On Thu, 12 Mar 2026 15:54:25 +0100, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Handling C code purely using regular expressions doesn't work well.
>
> Add a C tokenizer to help doing it the right way.
>
> The tokenizer was written using as basis the Python re documentation
> tokenizer example from:
> https://docs.python.org/3/library/re.html#writing-a-tokenizer
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Message-ID: <c63ad36c81fe043e9e33ca55630414893f127413.1773074166.git.mchehab+huawei@kernel.org>
> Message-ID: <8541ffa469647db1a7154f274fb2d55b4c127dcb.1773326442.git.mchehab+huawei@kernel.org>
This is a combined effort to review this patch and to try out "b4 review",
we'll see how it goes :).
> diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> index 085b89a4547c0..7bed4e9a88108 100644
> --- a/tools/lib/python/kdoc/kdoc_re.py
> +++ b/tools/lib/python/kdoc/kdoc_re.py
> @@ -141,6 +141,240 @@ class KernRe:
> [ ... skip 4 lines ... ]
> +
> + @staticmethod
> + def __str__(val):
> + """Return the name of an enum value"""
> + return TokType._name_by_val.get(val, f"UNKNOWN({val})")
> +
What is this class supposed to do?
> [ ... skip 27 lines ... ]
> + _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> +
> + # Dict to convert from string to an enum-like integer value.
> + _name_to_val = {k: v for v, k in _name_by_val.items()}
> +
> + @staticmethod
This stuff strikes me as a bit overdone; _name_to_val is really just the
variable list for the class, right?
> [ ... skip 30 lines ... ]
> + f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> +
> +#: Tokens to parse C code.
> +TOKEN_LIST = [
> + (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
> +
So these aren't "tokens", this is a list of regexes; how is it intended
to be used?
> + (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
> + (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
> +
> + (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
How does "[\s\S]*" differ from plain old "*" ?
> [ ... skip 15 lines ... ]
> + (CToken.STRUCT, r"\bstruct\b"),
> + (CToken.UNION, r"\bunion\b"),
> + (CToken.ENUM, r"\benum\b"),
> + (CToken.TYPEDEF, r"\bkinddef\b"),
> +
> + (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
"-" and "!" never need to be escaped.
> +
> + (CToken.SPACE, r"[\s]+"),
> +
> + (CToken.MISMATCH,r"."),
> +]
> +
"kinddef" ?
> +#: Handle C continuation lines.
> +RE_CONT = KernRe(r"\\\n")
> +
> +RE_COMMENT_START = KernRe(r'/\*\s*')
> +
Don't need the [brackets] here
> [ ... skip 6 lines ... ]
> +
> + When converted to string, it drops comments and handle public/private
> + values, respecting depth.
> + """
> +
> + # This class is inspired and follows the basic concepts of:
That seems weird, why don't you just initialize it here?
> [ ... skip 14 lines ... ]
> + source = RE_CONT.sub("", source)
> +
> + brace_level = 0
> + paren_level = 0
> + bracket_level = 0
> +
Do you mean "iterator" here?
> [ ... skip 33 lines ... ]
> + in this particular case, it makes sense, as we can pick the name
> + when matching a code via re_scanner().
> + """
> + global re_scanner
> +
> + if not re_scanner:
Putting __init__() first is fairly standard, methinks.
> [ ... skip 15 lines ... ]
> +
> + for tok in self.tokens:
> + if tok.kind == CToken.BEGIN:
> + show_stack.append(show_stack[-1])
> +
> + elif tok.kind == CToken.END:
I still don't understand why you do this here - this is all constant, right?
> + prev = show_stack[-1]
> + if len(show_stack) > 1:
> + show_stack.pop()
> +
> + if not prev and show_stack[-1]:
So you create a nice iterator structure, then just put it all together into a
list anyway?
--
Jonathan Corbet <corbet@lwn.net>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
2026-03-12 14:54 ` [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer Mauro Carvalho Chehab
2026-03-16 23:01 ` Jonathan Corbet
@ 2026-03-16 23:03 ` Jonathan Corbet
2026-03-16 23:29 ` Randy Dunlap
1 sibling, 1 reply; 47+ messages in thread
From: Jonathan Corbet @ 2026-03-16 23:03 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Jonathan Corbet, Linux Doc Mailing List, linux-hardening,
linux-kernel, Aleksandr Loktionov, Randy Dunlap
On Thu, 12 Mar 2026 15:54:25 +0100, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> Handling C code purely using regular expressions doesn't work well.
>
> Add a C tokenizer to help doing it the right way.
>
> The tokenizer was written using as basis the Python re documentation
> tokenizer example from:
> https://docs.python.org/3/library/re.html#writing-a-tokenizer
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Message-ID: <c63ad36c81fe043e9e33ca55630414893f127413.1773074166.git.mchehab+huawei@kernel.org>
> Message-ID: <8541ffa469647db1a7154f274fb2d55b4c127dcb.1773326442.git.mchehab+huawei@kernel.org>
This is a combined effort to review this patch and to try out "b4 review",
we'll see how it goes :).
> diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> index 085b89a4547c0..7bed4e9a88108 100644
> --- a/tools/lib/python/kdoc/kdoc_re.py
> +++ b/tools/lib/python/kdoc/kdoc_re.py
> @@ -141,6 +141,240 @@ class KernRe:
> [ ... skip 4 lines ... ]
> +
> + @staticmethod
> + def __str__(val):
> + """Return the name of an enum value"""
> + return TokType._name_by_val.get(val, f"UNKNOWN({val})")
> +
What is this class supposed to do?
> [ ... skip 27 lines ... ]
> + _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> +
> + # Dict to convert from string to an enum-like integer value.
> + _name_to_val = {k: v for v, k in _name_by_val.items()}
> +
> + @staticmethod
This stuff strikes me as a bit overdone; _name_to_val is really just the
variable list for the class, right?
> [ ... skip 30 lines ... ]
> + f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> +
> +#: Tokens to parse C code.
> +TOKEN_LIST = [
> + (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
> +
So these aren't "tokens", this is a list of regexes; how is it intended
to be used?
> + (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
> + (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
> +
> + (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
How does "[\s\S]*" differ from plain old "*" ?
> [ ... skip 15 lines ... ]
> + (CToken.STRUCT, r"\bstruct\b"),
> + (CToken.UNION, r"\bunion\b"),
> + (CToken.ENUM, r"\benum\b"),
> + (CToken.TYPEDEF, r"\bkinddef\b"),
> +
> + (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
"-" and "!" never need to be escaped.
> +
> + (CToken.SPACE, r"[\s]+"),
> +
> + (CToken.MISMATCH,r"."),
> +]
> +
"kinddef" ?
> +#: Handle C continuation lines.
> +RE_CONT = KernRe(r"\\\n")
> +
> +RE_COMMENT_START = KernRe(r'/\*\s*')
> +
Don't need the [brackets] here
> [ ... skip 6 lines ... ]
> +
> + When converted to string, it drops comments and handle public/private
> + values, respecting depth.
> + """
> +
> + # This class is inspired and follows the basic concepts of:
That seems weird, why don't you just initialize it here?
> [ ... skip 14 lines ... ]
> + source = RE_CONT.sub("", source)
> +
> + brace_level = 0
> + paren_level = 0
> + bracket_level = 0
> +
Do you mean "iterator" here?
> [ ... skip 33 lines ... ]
> + in this particular case, it makes sense, as we can pick the name
> + when matching a code via re_scanner().
> + """
> + global re_scanner
> +
> + if not re_scanner:
Putting __init__() first is fairly standard, methinks.
> [ ... skip 15 lines ... ]
> +
> + for tok in self.tokens:
> + if tok.kind == CToken.BEGIN:
> + show_stack.append(show_stack[-1])
> +
> + elif tok.kind == CToken.END:
I still don't understand why you do this here - this is all constant, right?
> + prev = show_stack[-1]
> + if len(show_stack) > 1:
> + show_stack.pop()
> +
> + if not prev and show_stack[-1]:
So you create a nice iterator structure, then just put it all together into a
list anyway?
--
Jonathan Corbet <corbet@lwn.net>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
2026-03-16 23:03 ` Jonathan Corbet
@ 2026-03-16 23:29 ` Randy Dunlap
2026-03-16 23:40 ` Jonathan Corbet
2026-03-17 7:03 ` Mauro Carvalho Chehab
0 siblings, 2 replies; 47+ messages in thread
From: Randy Dunlap @ 2026-03-16 23:29 UTC (permalink / raw)
To: Jonathan Corbet, Mauro Carvalho Chehab
Cc: Linux Doc Mailing List, linux-hardening, linux-kernel,
Aleksandr Loktionov
Uh, I find this review confusing.
Do your (Jon) comments refer to the code above them?
(more below)
On 3/16/26 4:03 PM, Jonathan Corbet wrote:
> On Thu, 12 Mar 2026 15:54:25 +0100, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
>> Handling C code purely using regular expressions doesn't work well.
>>
>> Add a C tokenizer to help doing it the right way.
>>
>> The tokenizer was written using as basis the Python re documentation
>> tokenizer example from:
>> https://docs.python.org/3/library/re.html#writing-a-tokenizer
>>
>> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
>> Message-ID: <c63ad36c81fe043e9e33ca55630414893f127413.1773074166.git.mchehab+huawei@kernel.org>
>> Message-ID: <8541ffa469647db1a7154f274fb2d55b4c127dcb.1773326442.git.mchehab+huawei@kernel.org>
>
> This is a combined effort to review this patch and to try out "b4 review",
> we'll see how it goes :).
>
>> diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
>> index 085b89a4547c0..7bed4e9a88108 100644
>> --- a/tools/lib/python/kdoc/kdoc_re.py
>> +++ b/tools/lib/python/kdoc/kdoc_re.py
>> @@ -141,6 +141,240 @@ class KernRe:
>> [ ... skip 4 lines ... ]
>> +
>> + @staticmethod
>> + def __str__(val):
>> + """Return the name of an enum value"""
>> + return TokType._name_by_val.get(val, f"UNKNOWN({val})")
>> +
>
> What is this class supposed to do?
>
>> [ ... skip 27 lines ... ]
>> + _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
>> +
>> + # Dict to convert from string to an enum-like integer value.
>> + _name_to_val = {k: v for v, k in _name_by_val.items()}
>> +
>> + @staticmethod
>
> This stuff strikes me as a bit overdone; _name_to_val is really just the
> variable list for the class, right?
>
>> [ ... skip 30 lines ... ]
>> + f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
>> +
>> +#: Tokens to parse C code.
>> +TOKEN_LIST = [
>> + (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
>> +
>
> So these aren't "tokens", this is a list of regexes; how is it intended
> to be used?
>
>> + (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
>> + (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
>> +
>> + (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
>
> How does "[\s\S]*" differ from plain old "*" ?
>
>> [ ... skip 15 lines ... ]
>> + (CToken.STRUCT, r"\bstruct\b"),
>> + (CToken.UNION, r"\bunion\b"),
>> + (CToken.ENUM, r"\benum\b"),
>> + (CToken.TYPEDEF, r"\bkinddef\b"),
>> +
>> + (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
>
> "-" and "!" never need to be escaped.
>
>> +
>> + (CToken.SPACE, r"[\s]+"),
>> +
>> + (CToken.MISMATCH,r"."),
>> +]
>> +
>
> "kinddef" ?
What does that refer to?
>
>> +#: Handle C continuation lines.
>> +RE_CONT = KernRe(r"\\\n")
>> +
>> +RE_COMMENT_START = KernRe(r'/\*\s*')
>> +
>
> Don't need the [brackets] here
what brackets?
>
>> [ ... skip 6 lines ... ]
>> +
>> + When converted to string, it drops comments and handle public/private
>> + values, respecting depth.
>> + """
>> +
>> + # This class is inspired and follows the basic concepts of:
>
> That seems weird, why don't you just initialize it here?
I can't tell what that comments refers to.
>> [ ... skip 14 lines ... ]
>> + source = RE_CONT.sub("", source)
>> +
>> + brace_level = 0
>> + paren_level = 0
>> + bracket_level = 0
>> +
>
> Do you mean "iterator" here?
Ditto.
>> [ ... skip 33 lines ... ]
>> + in this particular case, it makes sense, as we can pick the name
>> + when matching a code via re_scanner().
>> + """
>> + global re_scanner
>> +
>> + if not re_scanner:
>
> Putting __init__() first is fairly standard, methinks.
>
>> [ ... skip 15 lines ... ]
>> +
>> + for tok in self.tokens:
>> + if tok.kind == CToken.BEGIN:
>> + show_stack.append(show_stack[-1])
>> +
>> + elif tok.kind == CToken.END:
>
> I still don't understand why you do this here - this is all constant, right?
>
>> + prev = show_stack[-1]
>> + if len(show_stack) > 1:
>> + show_stack.pop()
>> +
>> + if not prev and show_stack[-1]:
>
> So you create a nice iterator structure, then just put it all together into a
> list anyway?
>
--
~Randy
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module
2026-03-12 14:54 ` [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module Mauro Carvalho Chehab
@ 2026-03-16 23:30 ` Jonathan Corbet
2026-03-17 8:02 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 47+ messages in thread
From: Jonathan Corbet @ 2026-03-16 23:30 UTC (permalink / raw)
To: Mauro Carvalho Chehab, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> Place the C tokenizer on a different module.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
> tools/lib/python/kdoc/c_lex.py | 239 +++++++++++++++++++++++++++
> tools/lib/python/kdoc/kdoc_parser.py | 3 +-
> tools/lib/python/kdoc/kdoc_re.py | 233 --------------------------
> 3 files changed, 241 insertions(+), 234 deletions(-)
> create mode 100644 tools/lib/python/kdoc/c_lex.py
One has to ask...why not just put it in its own file in the first place?
Thanks,
jon
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
2026-03-16 23:29 ` Randy Dunlap
@ 2026-03-16 23:40 ` Jonathan Corbet
2026-03-17 8:21 ` Mauro Carvalho Chehab
2026-03-17 7:03 ` Mauro Carvalho Chehab
1 sibling, 1 reply; 47+ messages in thread
From: Jonathan Corbet @ 2026-03-16 23:40 UTC (permalink / raw)
To: Randy Dunlap, Mauro Carvalho Chehab
Cc: Linux Doc Mailing List, linux-hardening, linux-kernel,
Aleksandr Loktionov
Randy Dunlap <rdunlap@infradead.org> writes:
> Uh, I find this review confusing.
> Do your (Jon) comments refer to the code above them?
> (more below)
They do
Or, at least, they did...but they clearly got mixed up in the sending
somewhere. Below is the intended version...
> tools/lib/python/kdoc/kdoc_re.py | 234 +++++++++++++++++++++++++++++++
> 1 file changed, 234 insertions(+)
>
> diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> index 085b89a4547c..7bed4e9a8810 100644
> --- a/tools/lib/python/kdoc/kdoc_re.py
> +++ b/tools/lib/python/kdoc/kdoc_re.py
> @@ -141,6 +141,240 @@ class KernRe:
>
> return self.last_match.groups()
>
> +class TokType():
> +
> + @staticmethod
> + def __str__(val):
> + ""Return the name of an enum value""
> + return TokType._name_by_val.get(val, f"UNKNOWN({val})")
What is this class supposed to do?
> +
> +class CToken():
> + ""
> + Data class to define a C token.
> + ""
> +
> + # Tokens that can be used by the parser. Works like an C enum.
> +
> + COMMENT = 0 #: A standard C or C99 comment, including delimiter.
> + STRING = 1 #: A string, including quotation marks.
> + CHAR = 2 #: A character, including apostophes.
> + NUMBER = 3 #: A number.
> + PUNC = 4 #: A puntuation mark: ``;`` / ``,`` / ``.``.
> + BEGIN = 5 #: A begin character: ``{`` / ``[`` / ``(``.
> + END = 6 #: A end character: ``}`` / ``]`` / ``)``.
> + CPP = 7 #: A preprocessor macro.
> + HASH = 8 #: The hash character - useful to handle other macros.
> + OP = 9 #: A C operator (add, subtract, ...).
> + STRUCT = 10 #: A ``struct`` keyword.
> + UNION = 11 #: An ``union`` keyword.
> + ENUM = 12 #: A ``struct`` keyword.
> + TYPEDEF = 13 #: A ``typedef`` keyword.
> + NAME = 14 #: A name. Can be an ID or a type.
> + SPACE = 15 #: Any space characters, including new lines
> +
> + MISMATCH = 255 #: an error indicator: should never happen in practice.
> +
> + # Dict to convert from an enum interger into a string.
> + _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> +
> + # Dict to convert from string to an enum-like integer value.
> + _name_to_val = {k: v for v, k in _name_by_val.items()}
This stuff strikes me as a bit overdone; _name_to_val is really just the
variable list for the class, right?
> +
> + @staticmethod
> + def to_name(val):
> + ""Convert from an integer value from CToken enum into a string""
> +
> + return CToken._name_by_val.get(val, f"UNKNOWN({val})")
> +
> + @staticmethod
> + def from_name(name):
> + ""Convert a string into a CToken enum value""
> + if name in CToken._name_to_val:
> + return CToken._name_to_val[name]
> +
> + return CToken.MISMATCH
> +
> + def __init__(self, kind, value, pos,
> + brace_level, paren_level, bracket_level):
> + self.kind = kind
> + self.value = value
> + self.pos = pos
> + self.brace_level = brace_level
> + self.paren_level = paren_level
> + self.bracket_level = bracket_level
> +
> + def __repr__(self):
> + name = self.to_name(self.kind)
> + if isinstance(self.value, str):
> + value = '"' + self.value + '"'
> + else:
> + value = self.value
> +
> + return f"CToken({name}, {value}, {self.pos}, " \
> + f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> +
> +#: Tokens to parse C code.
> +TOKEN_LIST = [
So these aren't "tokens", this is a list of regexes; how is it intended
to be used?
> + (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
How does "[\s\S]*" differ from plain old "*" ?
> +
> + (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
> + (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
> +
> + (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
> + r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
> +
> + (CToken.PUNC, r"[;,\.]"),
> +
> + (CToken.BEGIN, r"[\[\(\{]"),
> +
> + (CToken.END, r"[\]\)\}]"),
> +
> + (CToken.CPP, r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
> +
> + (CToken.HASH, r"#"),
> +
> + (CToken.OP, r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
> + r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
"-" and "!" never need to be escaped.
> +
> + (CToken.STRUCT, r"\bstruct\b"),
> + (CToken.UNION, r"\bunion\b"),
> + (CToken.ENUM, r"\benum\b"),
> + (CToken.TYPEDEF, r"\bkinddef\b"),
"kinddef" ?
> +
> + (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
> +
> + (CToken.SPACE, r"[\s]+"),
Don't need the [brackets] here
> +
> + (CToken.MISMATCH,r"."),
> +]
> +
> +#: Handle C continuation lines.
> +RE_CONT = KernRe(r"\\\n")
> +
> +RE_COMMENT_START = KernRe(r'/\*\s*')
> +
> +#: tokenizer regex. Will be filled at the first CTokenizer usage.
> +re_scanner = None
That seems weird, why don't you just initialize it here?
> +
> +class CTokenizer():
> + ""
> + Scan C statements and definitions and produce tokens.
> +
> + When converted to string, it drops comments and handle public/private
> + values, respecting depth.
> + ""
> +
> + # This class is inspired and follows the basic concepts of:
> + # https://docs.python.org/3/library/re.html#writing-a-tokenizer
> +
> + def _tokenize(self, source):
> + ""
> + Interactor that parses ``source``, splitting it into tokens, as defined
> + at ``self.TOKEN_LIST``.
> +
> + The interactor returns a CToken class object.
> + ""
Do you mean "iterator" here?
> +
> + # Handle continuation lines. Note that kdoc_parser already has a
> + # logic to do that. Still, let's keep it for completeness, as we might
> + # end re-using this tokenizer outsize kernel-doc some day - or we may
> + # eventually remove from there as a future cleanup.
> + source = RE_CONT.sub(", source)
> +
> + brace_level = 0
> + paren_level = 0
> + bracket_level = 0
> +
> + for match in re_scanner.finditer(source):
> + kind = CToken.from_name(match.lastgroup)
> + pos = match.start()
> + value = match.group()
> +
> + if kind == CToken.MISMATCH:
> + raise RuntimeError(f"Unexpected token '{value}' on {pos}:\n\t{source}")
> + elif kind == CToken.BEGIN:
> + if value == '(':
> + paren_level += 1
> + elif value == '[':
> + bracket_level += 1
> + else: # value == '{'
> + brace_level += 1
> +
> + elif kind == CToken.END:
> + if value == ')' and paren_level > 0:
> + paren_level -= 1
> + elif value == ']' and bracket_level > 0:
> + bracket_level -= 1
> + elif brace_level > 0: # value == '}'
> + brace_level -= 1
> +
> + yield CToken(kind, value, pos,
> + brace_level, paren_level, bracket_level)
> +
> + def __init__(self, source):
Putting __init__() first is fairly standard, methinks.
> + ""
> + Create a regular expression to handle TOKEN_LIST.
> +
> + While I generally don't like using regex group naming via:
> + (?P<name>...)
> +
> + in this particular case, it makes sense, as we can pick the name
> + when matching a code via re_scanner().
> + ""
> + global re_scanner
> +
> + if not re_scanner:
> + re_tokens = []
> +
> + for kind, pattern in TOKEN_LIST:
> + name = CToken.to_name(kind)
> + re_tokens.append(f"(?P<{name}>{pattern})")
> +
> + re_scanner = KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)
I still don't understand why you do this here - this is all constant, right?
> +
> + self.tokens = []
> + for tok in self._tokenize(source):
> + self.tokens.append(tok)
So you create a nice iterator structure, then just put it all together into a
list anyway?
> +
> + def __str__(self):
> + out="
> + show_stack = [True]
> +
> + for tok in self.tokens:
> + if tok.kind == CToken.BEGIN:
> + show_stack.append(show_stack[-1])
> +
> + elif tok.kind == CToken.END:
> + prev = show_stack[-1]
> + if len(show_stack) > 1:
> + show_stack.pop()
> +
> + if not prev and show_stack[-1]:
> + #
> + # Try to preserve indent
> + #
> + out += "\t" * (len(show_stack) - 1)
> +
> + out += str(tok.value)
> + continue
> +
> + elif tok.kind == CToken.COMMENT:
> + comment = RE_COMMENT_START.sub(", tok.value)
> +
> + if comment.startswith("private:"):
> + show_stack[-1] = False
> + show = False
> + elif comment.startswith("public:"):
> + show_stack[-1] = True
> +
> + continue
> +
> + if show_stack[-1]:
> + out += str(tok.value)
> +
> + return out
> +
> +
> #: Nested delimited pairs (brackets and parenthesis)
> DELIMITER_PAIRS = {
> '{': '}',
Thanks,
jon
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
2026-03-16 23:29 ` Randy Dunlap
2026-03-16 23:40 ` Jonathan Corbet
@ 2026-03-17 7:03 ` Mauro Carvalho Chehab
1 sibling, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 7:03 UTC (permalink / raw)
To: Randy Dunlap
Cc: Jonathan Corbet, Linux Doc Mailing List, linux-hardening,
linux-kernel, Aleksandr Loktionov
On Mon, 16 Mar 2026 16:29:37 -0700
Randy Dunlap <rdunlap@infradead.org> wrote:
> Uh, I find this review confusing.
> Do your (Jon) comments refer to the code above them?
> (more below)
I was about to comment the same thing: it sounds that b4 review did a
big mess with your comments, as it is very hard to identify what part
of the code you're referring to.
I'll reply to your comments on a separate e-mail - at least the ones I
understand.
>
>
> On 3/16/26 4:03 PM, Jonathan Corbet wrote:
> > On Thu, 12 Mar 2026 15:54:25 +0100, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> >> Handling C code purely using regular expressions doesn't work well.
> >>
> >> Add a C tokenizer to help doing it the right way.
> >>
> >> The tokenizer was written using as basis the Python re documentation
> >> tokenizer example from:
> >> https://docs.python.org/3/library/re.html#writing-a-tokenizer
> >>
> >> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> >> Message-ID: <c63ad36c81fe043e9e33ca55630414893f127413.1773074166.git.mchehab+huawei@kernel.org>
> >> Message-ID: <8541ffa469647db1a7154f274fb2d55b4c127dcb.1773326442.git.mchehab+huawei@kernel.org>
> >
> > This is a combined effort to review this patch and to try out "b4 review",
> > we'll see how it goes :).
> >
> >> diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> >> index 085b89a4547c0..7bed4e9a88108 100644
> >> --- a/tools/lib/python/kdoc/kdoc_re.py
> >> +++ b/tools/lib/python/kdoc/kdoc_re.py
> >> @@ -141,6 +141,240 @@ class KernRe:
> >> [ ... skip 4 lines ... ]
> >> +
> >> + @staticmethod
> >> + def __str__(val):
> >> + """Return the name of an enum value"""
> >> + return TokType._name_by_val.get(val, f"UNKNOWN({val})")
> >> +
> >
> > What is this class supposed to do?
> >
> >> [ ... skip 27 lines ... ]
> >> + _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> >> +
> >> + # Dict to convert from string to an enum-like integer value.
> >> + _name_to_val = {k: v for v, k in _name_by_val.items()}
> >> +
> >> + @staticmethod
> >
> > This stuff strikes me as a bit overdone; _name_to_val is really just the
> > variable list for the class, right?
> >
> >> [ ... skip 30 lines ... ]
> >> + f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> >> +
> >> +#: Tokens to parse C code.
> >> +TOKEN_LIST = [
> >> + (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
> >> +
> >
> > So these aren't "tokens", this is a list of regexes; how is it intended
> > to be used?
> >
> >> + (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
> >> + (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
> >> +
> >> + (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
> >
> > How does "[\s\S]*" differ from plain old "*" ?
> >
> >> [ ... skip 15 lines ... ]
> >> + (CToken.STRUCT, r"\bstruct\b"),
> >> + (CToken.UNION, r"\bunion\b"),
> >> + (CToken.ENUM, r"\benum\b"),
> >> + (CToken.TYPEDEF, r"\bkinddef\b"),
> >> +
> >> + (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
> >
> > "-" and "!" never need to be escaped.
> >
> >> +
> >> + (CToken.SPACE, r"[\s]+"),
> >> +
> >> + (CToken.MISMATCH,r"."),
> >> +]
> >> +
> >
> > "kinddef" ?
>
> What does that refer to?
>
> >
> >> +#: Handle C continuation lines.
> >> +RE_CONT = KernRe(r"\\\n")
> >> +
> >> +RE_COMMENT_START = KernRe(r'/\*\s*')
> >> +
> >
> > Don't need the [brackets] here
>
> what brackets?
>
> >
> >> [ ... skip 6 lines ... ]
> >> +
> >> + When converted to string, it drops comments and handle public/private
> >> + values, respecting depth.
> >> + """
> >> +
> >> + # This class is inspired and follows the basic concepts of:
> >
> > That seems weird, why don't you just initialize it here?
>
> I can't tell what that comments refers to.
>
> >> [ ... skip 14 lines ... ]
> >> + source = RE_CONT.sub("", source)
> >> +
> >> + brace_level = 0
> >> + paren_level = 0
> >> + bracket_level = 0
> >> +
> >
> > Do you mean "iterator" here?
>
> Ditto.
>
> >> [ ... skip 33 lines ... ]
> >> + in this particular case, it makes sense, as we can pick the name
> >> + when matching a code via re_scanner().
> >> + """
> >> + global re_scanner
> >> +
> >> + if not re_scanner:
> >
> > Putting __init__() first is fairly standard, methinks.
> >
> >> [ ... skip 15 lines ... ]
> >> +
> >> + for tok in self.tokens:
> >> + if tok.kind == CToken.BEGIN:
> >> + show_stack.append(show_stack[-1])
> >> +
> >> + elif tok.kind == CToken.END:
> >
> > I still don't understand why you do this here - this is all constant, right?
> >
> >> + prev = show_stack[-1]
> >> + if len(show_stack) > 1:
> >> + show_stack.pop()
> >> +
> >> + if not prev and show_stack[-1]:
> >
> > So you create a nice iterator structure, then just put it all together into a
> > list anyway?
> >
>
Thanks,
Mauro
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
2026-03-16 23:01 ` Jonathan Corbet
@ 2026-03-17 7:59 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 7:59 UTC (permalink / raw)
To: Jonathan Corbet
Cc: Linux Doc Mailing List, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
On Mon, 16 Mar 2026 17:01:11 -0600
Jonathan Corbet <corbet@lwn.net> wrote:
> On Thu, 12 Mar 2026 15:54:25 +0100, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > Handling C code purely using regular expressions doesn't work well.
> >
> > Add a C tokenizer to help doing it the right way.
> >
> > The tokenizer was written using as basis the Python re documentation
> > tokenizer example from:
> > https://docs.python.org/3/library/re.html#writing-a-tokenizer
> >
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > Message-ID: <c63ad36c81fe043e9e33ca55630414893f127413.1773074166.git.mchehab+huawei@kernel.org>
> > Message-ID: <8541ffa469647db1a7154f274fb2d55b4c127dcb.1773326442.git.mchehab+huawei@kernel.org>
>
> This is a combined effort to review this patch and to try out "b4 review",
> we'll see how it goes :).
>
> > diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> > index 085b89a4547c0..7bed4e9a88108 100644
> > --- a/tools/lib/python/kdoc/kdoc_re.py
> > +++ b/tools/lib/python/kdoc/kdoc_re.py
> > @@ -141,6 +141,240 @@ class KernRe:
> > [ ... skip 4 lines ... ]
> > +
> > + @staticmethod
> > + def __str__(val):
> > + """Return the name of an enum value"""
> > + return TokType._name_by_val.get(val, f"UNKNOWN({val})")
> > +
>
> What is this class supposed to do?
This __str__() method ensures that, when printing a CToken object,
the name will be displayed, instead of a number. This is really
useful when debugging.
See, if I add a print:
<snip>
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -87,6 +87,7 @@ def trim_private_members(text):
"""
tokens = CTokenizer(text)
+ print(tokens.tokens)
return str(tokens)
</snip>
the tokens will appear as names at the output:
$ ./scripts/kernel-doc -none er.c
[CToken(CToken.ENUM, "enum", 0, (0, 0, 0)), CToken(CToken.SPACE, " ", 4, (0, 0, 0)), CToken(CToken.NAME, "dmub_abm_ace_curve_type", 5, (0, 0, 0)), CToken(CToken.SPACE, " ", 28, (0, 0, 0)), CToken(CToken.BEGIN, "{", 29, (0, 0, 1)), CToken(CToken.SPACE, " ", 30, (0, 0, 1)), CToken(CToken.COMMENT, "/**
* ACE curve as defined by the SW layer. */", 31, (0, 0, 1)), CToken(CToken.SPACE, " ", 86, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW", 87, (0, 0, 1)), CToken(CToken.SPACE, " ", 109, (0, 0, 1)), CToken(CToken.OP, "=", 110, (0, 0, 1)), CToken(CToken.SPACE, " ", 111, (0, 0, 1)), CToken(CToken.NUMBER, "0", 112, (0, 0, 1)), CToken(CToken.PUNC, ",", 113, (0, 0, 1)), CToken(CToken.SPACE, " ", 114, (0, 0, 1)), CToken(CToken.COMMENT, "/**
* ACE curve as defined by the SW to HW translation interface layer. */", 115, (0, 0, 1)), CToken(CToken.SPACE, " ", 198, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW_IF", 199, (0, 0, 1)), CToken(CToken.SPACE, " ", 224, (0, 0, 1)), CToken(CToken.OP, "=", 225, (0, 0, 1)), CToken(CToken.SPACE, " ", 226, (0, 0, 1)), CToken(CToken.NUMBER, "1", 227, (0, 0, 1)), CToken(CToken.PUNC, ",", 228, (0, 0, 1)), CToken(CToken.SPACE, " ", 229, (0, 0, 1)), CToken(CToken.END, "}", 230, (0, 0, 0)), CToken(CToken.PUNC, ";", 231, (0, 0, 0))]
>
> > [ ... skip 27 lines ... ]
> > + _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> > +
> > + # Dict to convert from string to an enum-like integer value.
> > + _name_to_val = {k: v for v, k in _name_by_val.items()}
> > +
> > + @staticmethod
>
> This stuff strikes me as a bit overdone; _name_to_val is really just the
> variable list for the class, right?
Those two vars are a kind of magic: they create two dictionaries:
- _name_by_val converts a token integer into a string;
- _name_to_val converts a string to an integer.
I opted to use this approach for a couple of reasons:
1. using tok.kind == "BEGIN" (and similar) everywhere is harder to
maintain, as python won't check for typos. Now, if one writes:
CToken.BEGHIN, an error will be raised;
2. the cost to convert from string to int is O(1), so not much
a performance issue at the conversion;
3. using an integer on all checks should make the code faster as
it doesn't require a loop to check the string.
>
> > [ ... skip 30 lines ... ]
> > + f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> > +
> > +#: Tokens to parse C code.
> > +TOKEN_LIST = [
> > + (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
> > +
>
> So these aren't "tokens", this is a list of regexes; how is it intended
> to be used?
Right. I could have named it better, like RE_TOKEN_LIST, or
TOKEN_REGEX, as at the original example, which came from Python
documentation for "re" module:
https://docs.python.org/3/library/re.html#writing-a-tokenizer
basically, we have the token type as the first element at the tuple,
and regex as the second one.
When regex matches, the CToken will be filed with kind=tuple[0].
the loop:
for match in re.finditer(TOKEN_LIST, code):
will parse the entire C code source, in order, converting it into
a token list. So, a file like this:
/**
* enum dmub_abm_ace_curve_type - ACE curve type.
*/
enum dmub_abm_ace_curve_type {
/**
* ACE curve as defined by the SW layer.
*/
ABM_ACE_CURVE_TYPE__SW = 0,
/**
* ACE curve as defined by the SW to HW translation interface layer.
*/
ABM_ACE_CURVE_TYPE__SW_IF = 1,
};
will become (I used pprint here to better align the tokens):
[CToken(CToken.ENUM, "enum", 0, (0, 0, 0)),
CToken(CToken.SPACE, " ", 4, (0, 0, 0)),
CToken(CToken.NAME, "dmub_abm_ace_curve_type", 5, (0, 0, 0)),
CToken(CToken.SPACE, " ", 28, (0, 0, 0)),
CToken(CToken.BEGIN, "{", 29, (0, 0, 1)),
CToken(CToken.SPACE, " ", 30, (0, 0, 1)),
CToken(CToken.COMMENT, "/**
* ACE curve as defined by the SW layer. */", 31, (0, 0, 1)),
CToken(CToken.SPACE, " ", 86, (0, 0, 1)),
CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW", 87, (0, 0, 1)),
CToken(CToken.SPACE, " ", 109, (0, 0, 1)),
CToken(CToken.OP, "=", 110, (0, 0, 1)),
CToken(CToken.SPACE, " ", 111, (0, 0, 1)),
CToken(CToken.NUMBER, "0", 112, (0, 0, 1)),
CToken(CToken.PUNC, ",", 113, (0, 0, 1)),
CToken(CToken.SPACE, " ", 114, (0, 0, 1)),
CToken(CToken.COMMENT, "/**
* ACE curve as defined by the SW to HW translation interface layer. */", 115, (0, 0, 1)),
CToken(CToken.SPACE, " ", 198, (0, 0, 1)),
CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW_IF", 199, (0, 0, 1)),
CToken(CToken.SPACE, " ", 224, (0, 0, 1)),
CToken(CToken.OP, "=", 225, (0, 0, 1)),
CToken(CToken.SPACE, " ", 226, (0, 0, 1)),
CToken(CToken.NUMBER, "1", 227, (0, 0, 1)),
CToken(CToken.PUNC, ",", 228, (0, 0, 1)),
CToken(CToken.SPACE, " ", 229, (0, 0, 1)),
CToken(CToken.END, "}", 230, (0, 0, 0)),
CToken(CToken.PUNC, ";", 231, (0, 0, 0))]
>
> > + (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
> > + (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
> > +
> > + (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
>
> How does "[\s\S]*" differ from plain old "*" ?
They are not identical, as "*" doesn't match "\n". As the tokenizer
also picks "\n" on several cases, like on comments, r"\s\S" works
better.
> > [ ... skip 15 lines ... ]
> > + (CToken.STRUCT, r"\bstruct\b"),
> > + (CToken.UNION, r"\bunion\b"),
> > + (CToken.ENUM, r"\benum\b"),
> > + (CToken.TYPEDEF, r"\bkinddef\b"),
> > +
> > + (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
>
> "-" and "!" never need to be escaped.
"-" usually needs to be escaped, because it can be a range. I had
some troubles with the parser due to the lack of escapes, so I
ended being conservative.
( I'm finding a little bit hard to follow your comments...
Here, for instance, "!" is not at CToken.NAME regex.
Did b4 review place your comment at the wrong place? )
>
> > +
> > + (CToken.SPACE, r"[\s]+"),
> > +
> > + (CToken.MISMATCH,r"."),
> > +]
> > +
>
> "kinddef" ?
Should be "typedef".
This was due to a "sed s,type,kind," I applied to avoid using
"type" for the token type, as, when I started integrating it
with kdoc_re, it became confusing.
I'll fix at the next respin.
>
> > +#: Handle C continuation lines.
> > +RE_CONT = KernRe(r"\\\n")
> > +
> > +RE_COMMENT_START = KernRe(r'/\*\s*')
> > +
>
> Don't need the [brackets] here
where?
>
> > [ ... skip 6 lines ... ]
> > +
> > + When converted to string, it drops comments and handle public/private
> > + values, respecting depth.
> > + """
> > +
> > + # This class is inspired and follows the basic concepts of:
>
> That seems weird, why don't you just initialize it here?
Hard to tell what you're referring to. Maybe this:
RE_SCANNER = fill_re_scanner(TOKEN_LIST)
The rationale is that I don't want to re-create this every time,
as this is const.
>
> > [ ... skip 14 lines ... ]
> > + source = RE_CONT.sub("", source)
> > +
> > + brace_level = 0
> > + paren_level = 0
> > + bracket_level = 0
> > +
>
> Do you mean "iterator" here?
If you mean the typo at _tokenize() help text, yes:
interactor -> iterator
>
> > [ ... skip 33 lines ... ]
> > + in this particular case, it makes sense, as we can pick the name
> > + when matching a code via re_scanner().
> > + """
> > + global re_scanner
> > +
> > + if not re_scanner:
>
> Putting __init__() first is fairly standard, methinks.
class CTokenizer __init__() module calls _tokenize() method on it.
My personal preference is to have the caller methods before the methods
that actually call them, even inside a class, where the order doesn't
matter - or even in C, when we have an include with all prototypes.
But if you prefer, I can reorder it.
> > [ ... skip 15 lines ... ]
> > +
> > + for tok in self.tokens:
> > + if tok.kind == CToken.BEGIN:
> > + show_stack.append(show_stack[-1])
> > +
> > + elif tok.kind == CToken.END:
>
> I still don't understand why you do this here - this is all constant, right?
This one I didn't get to what part of the code you're referring to.
All constants on this code are using upper case names. They are:
- TOKEN_LIST (which should probably be named as TOKEN_REGEX_LIST).
- CToken enum-like names (BEGIN, END, OP, NAME, ...)
- three regexes (RE_COUNT, RE_COMMENT_START, RE_SCANNER)
See, what the tokenizer does is a linear transformation from a C
source string into a token list.
So, for each instance, its content will change. Also, when we apply
CMatch logic, its content will also change.
>
> > + prev = show_stack[-1]
> > + if len(show_stack) > 1:
> > + show_stack.pop()
> > +
> > + if not prev and show_stack[-1]:
>
> So you create a nice iterator structure, then just put it all together into a
> list anyway?
Not sure what you meant here.
The end result of the tokenizer is a list of tokens, in the order
they appear at the source code.
To be able to handle public/private and do code transforms, using it
as a list is completely fine.
Now, if we want to use the tokenizer to parse things like:
typedef struct ca_descr_info {
unsigned int num;
unsigned int type;
} ca_descr_info_t;
Then having iterators to parse tokens on both directions
would be great, as the typedef identifier is at the end.
Thanks,
Mauro
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module
2026-03-16 23:30 ` Jonathan Corbet
@ 2026-03-17 8:02 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 8:02 UTC (permalink / raw)
To: Jonathan Corbet
Cc: Linux Doc Mailing List, linux-hardening, linux-kernel,
Aleksandr Loktionov, Randy Dunlap
On Mon, 16 Mar 2026 17:30:58 -0600
Jonathan Corbet <corbet@lwn.net> wrote:
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
>
> > Place the C tokenizer on a different module.
> >
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > ---
> > tools/lib/python/kdoc/c_lex.py | 239 +++++++++++++++++++++++++++
> > tools/lib/python/kdoc/kdoc_parser.py | 3 +-
> > tools/lib/python/kdoc/kdoc_re.py | 233 --------------------------
> > 3 files changed, 241 insertions(+), 234 deletions(-)
> > create mode 100644 tools/lib/python/kdoc/c_lex.py
>
> One has to ask...why not just put it in its own file in the first place?
Good question... As usual, this started simple. When complexity
increased, and, specially, when I wanted to mangle with NestedMatch,
it became clearer that placing it elsewhere would be for the best.
I can adjust it at the next respin.
>
> Thanks,
>
> jon
Thanks,
Mauro
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
2026-03-16 23:40 ` Jonathan Corbet
@ 2026-03-17 8:21 ` Mauro Carvalho Chehab
2026-03-17 17:04 ` Jonathan Corbet
0 siblings, 1 reply; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 8:21 UTC (permalink / raw)
To: Jonathan Corbet
Cc: Randy Dunlap, Linux Doc Mailing List, linux-hardening,
linux-kernel, Aleksandr Loktionov
On Mon, 16 Mar 2026 17:40:22 -0600
Jonathan Corbet <corbet@lwn.net> wrote:
> Randy Dunlap <rdunlap@infradead.org> writes:
>
> > Uh, I find this review confusing.
> > Do your (Jon) comments refer to the code above them?
> > (more below)
>
> They do
>
> Or, at least, they did...but they clearly got mixed up in the sending
> somewhere. Below is the intended version...
Oh, I should have read this one before... Ignore my previous comment.
I'll move the answers to this reply, and answer the other ones.
> > tools/lib/python/kdoc/kdoc_re.py | 234 +++++++++++++++++++++++++++++++
> > 1 file changed, 234 insertions(+)
> >
> > diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> > index 085b89a4547c..7bed4e9a8810 100644
> > --- a/tools/lib/python/kdoc/kdoc_re.py
> > +++ b/tools/lib/python/kdoc/kdoc_re.py
> > @@ -141,6 +141,240 @@ class KernRe:
> >
> > return self.last_match.groups()
> >
> > +class TokType():
> > +
> > + @staticmethod
> > + def __str__(val):
> > + ""Return the name of an enum value""
> > + return TokType._name_by_val.get(val, f"UNKNOWN({val})")
>
> What is this class supposed to do?
This __str__() method ensures that, when printing a CToken object,
the name will be displayed, instead of a number. This is really
useful when debugging.
See, if I add a print:
<snip>
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -87,6 +87,7 @@ def trim_private_members(text):
"""
tokens = CTokenizer(text)
+ print(tokens.tokens)
return str(tokens)
</snip>
the tokens will appear as names at the output:
$ ./scripts/kernel-doc -none er.c
[CToken(CToken.ENUM, "enum", 0, (0, 0, 0)), CToken(CToken.SPACE, " ", 4, (0, 0, 0)), CToken(CToken.NAME, "dmub_abm_ace_curve_type", 5, (0, 0, 0)), CToken(CToken.SPACE, " ", 28, (0, 0, 0)), CToken(CToken.BEGIN, "{", 29, (0, 0, 1)), CToken(CToken.SPACE, " ", 30, (0, 0, 1)), CToken(CToken.COMMENT, "/**
* ACE curve as defined by the SW layer. */", 31, (0, 0, 1)), CToken(CToken.SPACE, " ", 86, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW", 87, (0, 0, 1)), CToken(CToken.SPACE, " ", 109, (0, 0, 1)), CToken(CToken.OP, "=", 110, (0, 0, 1)), CToken(CToken.SPACE, " ", 111, (0, 0, 1)), CToken(CToken.NUMBER, "0", 112, (0, 0, 1)), CToken(CToken.PUNC, ",", 113, (0, 0, 1)), CToken(CToken.SPACE, " ", 114, (0, 0, 1)), CToken(CToken.COMMENT, "/**
* ACE curve as defined by the SW to HW translation interface layer. */", 115, (0, 0, 1)), CToken(CToken.SPACE, " ", 198, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW_IF", 199, (0, 0, 1)), CToken(CToken.SPACE, " ", 224, (0, 0, 1)), CToken(CToken.OP, "=", 225, (0, 0, 1)), CToken(CToken.SPACE, " ", 226, (0, 0, 1)), CToken(CToken.NUMBER, "1", 227, (0, 0, 1)), CToken(CToken.PUNC, ",", 228, (0, 0, 1)), CToken(CToken.SPACE, " ", 229, (0, 0, 1)), CToken(CToken.END, "}", 230, (0, 0, 0)), CToken(CToken.PUNC, ";", 231, (0, 0, 0))]
>
> > +
> > +class CToken():
> > + ""
> > + Data class to define a C token.
> > + ""
> > +
> > + # Tokens that can be used by the parser. Works like an C enum.
> > +
> > + COMMENT = 0 #: A standard C or C99 comment, including delimiter.
> > + STRING = 1 #: A string, including quotation marks.
> > + CHAR = 2 #: A character, including apostophes.
> > + NUMBER = 3 #: A number.
> > + PUNC = 4 #: A puntuation mark: ``;`` / ``,`` / ``.``.
> > + BEGIN = 5 #: A begin character: ``{`` / ``[`` / ``(``.
> > + END = 6 #: A end character: ``}`` / ``]`` / ``)``.
> > + CPP = 7 #: A preprocessor macro.
> > + HASH = 8 #: The hash character - useful to handle other macros.
> > + OP = 9 #: A C operator (add, subtract, ...).
> > + STRUCT = 10 #: A ``struct`` keyword.
> > + UNION = 11 #: An ``union`` keyword.
> > + ENUM = 12 #: A ``struct`` keyword.
> > + TYPEDEF = 13 #: A ``typedef`` keyword.
> > + NAME = 14 #: A name. Can be an ID or a type.
> > + SPACE = 15 #: Any space characters, including new lines
> > +
> > + MISMATCH = 255 #: an error indicator: should never happen in practice.
> > +
> > + # Dict to convert from an enum interger into a string.
> > + _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> > +
> > + # Dict to convert from string to an enum-like integer value.
> > + _name_to_val = {k: v for v, k in _name_by_val.items()}
>
> This stuff strikes me as a bit overdone; _name_to_val is really just the
> variable list for the class, right?
Those two vars are a kind of magic: they create two dictionaries:
- _name_by_val converts a token integer into a string;
- _name_to_val converts a string to an integer.
I opted to use this approach for a couple of reasons:
1. using tok.kind == "BEGIN" (and similar) everywhere is harder to
maintain, as python won't check for typos. Now, if one writes:
CToken.BEGHIN, an error will be raised;
2. the cost to convert from string to int is O(1), so not much
a performance issue at the conversion;
3. using an integer on all checks should make the code faster as
it doesn't require a loop to check the string.
>
> > +
> > + @staticmethod
> > + def to_name(val):
> > + ""Convert from an integer value from CToken enum into a string""
> > +
> > + return CToken._name_by_val.get(val, f"UNKNOWN({val})")
> > +
> > + @staticmethod
> > + def from_name(name):
> > + ""Convert a string into a CToken enum value""
> > + if name in CToken._name_to_val:
> > + return CToken._name_to_val[name]
> > +
> > + return CToken.MISMATCH
> > +
> > + def __init__(self, kind, value, pos,
> > + brace_level, paren_level, bracket_level):
> > + self.kind = kind
> > + self.value = value
> > + self.pos = pos
> > + self.brace_level = brace_level
> > + self.paren_level = paren_level
> > + self.bracket_level = bracket_level
> > +
> > + def __repr__(self):
> > + name = self.to_name(self.kind)
> > + if isinstance(self.value, str):
> > + value = '"' + self.value + '"'
> > + else:
> > + value = self.value
> > +
> > + return f"CToken({name}, {value}, {self.pos}, " \
> > + f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> > +
> > +#: Tokens to parse C code.
> > +TOKEN_LIST = [
>
> So these aren't "tokens", this is a list of regexes; how is it intended
> to be used?
>
> > + (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
>
> How does "[\s\S]*" differ from plain old "*" ?
They are not identical, as "*" doesn't match "\n". As the tokenizer
also picks "\n" on several cases, like on comments, r"\s\S" works
better.
>
> > +
> > + (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
> > + (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
> > +
> > + (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
> > + r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
> > +
> > + (CToken.PUNC, r"[;,\.]"),
> > +
> > + (CToken.BEGIN, r"[\[\(\{]"),
> > +
> > + (CToken.END, r"[\]\)\}]"),
> > +
> > + (CToken.CPP, r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
> > +
> > + (CToken.HASH, r"#"),
> > +
> > + (CToken.OP, r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
> > + r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
>
> "-" and "!" never need to be escaped.
"-" usually needs to be escaped, because it can be a range. I actually
tried without escaping it, but the regex failed. So I ended being
conservative.
>
> > +
> > + (CToken.STRUCT, r"\bstruct\b"),
> > + (CToken.UNION, r"\bunion\b"),
> > + (CToken.ENUM, r"\benum\b"),
> > + (CToken.TYPEDEF, r"\bkinddef\b"),
>
> "kinddef" ?
Should be "typedef".
This was due to a "sed s,type,kind," I applied to avoid using
"type" for the token type, as, when I started integrating it
with kdoc_re, it became confusing.
I'll fix at the next respin.
>
> > +
> > + (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
> > +
> > + (CToken.SPACE, r"[\s]+"),
>
> Don't need the [brackets] here
True. This was [ \t] and there as a separate token for new line.
I merged them, but forgot stripping the brackets.
Will cleanup at the next respin.
>
> > +
> > + (CToken.MISMATCH,r"."),
> > +]
> > +
> > +#: Handle C continuation lines.
> > +RE_CONT = KernRe(r"\\\n")
> > +
> > +RE_COMMENT_START = KernRe(r'/\*\s*')
> > +
> > +#: tokenizer regex. Will be filled at the first CTokenizer usage.
> > +re_scanner = None
>
> That seems weird, why don't you just initialize it here?
Yeah, I changed this one to:
def fill_re_scanner(token_list):
"""Ancillary routine to convert TOKEN_LIST into a finditer regex"""
re_tokens = []
for kind, pattern in token_list:
name = CToken.to_name(kind)
re_tokens.append(f"(?P<{name}>{pattern})")
return KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)
RE_SCANNER = fill_re_scanner(TOKEN_LIST)
but I guess tis is on a patch later on.
>
> > +
> > +class CTokenizer():
> > + ""
> > + Scan C statements and definitions and produce tokens.
> > +
> > + When converted to string, it drops comments and handle public/private
> > + values, respecting depth.
> > + ""
> > +
> > + # This class is inspired and follows the basic concepts of:
> > + # https://docs.python.org/3/library/re.html#writing-a-tokenizer
> > +
> > + def _tokenize(self, source):
> > + ""
> > + Interactor that parses ``source``, splitting it into tokens, as defined
> > + at ``self.TOKEN_LIST``.
> > +
> > + The interactor returns a CToken class object.
> > + ""
>
> Do you mean "iterator" here?
Yes. will fix at the next respin.
>
> > +
> > + # Handle continuation lines. Note that kdoc_parser already has a
> > + # logic to do that. Still, let's keep it for completeness, as we might
> > + # end re-using this tokenizer outsize kernel-doc some day - or we may
> > + # eventually remove from there as a future cleanup.
> > + source = RE_CONT.sub(", source)
> > +
> > + brace_level = 0
> > + paren_level = 0
> > + bracket_level = 0
> > +
> > + for match in re_scanner.finditer(source):
> > + kind = CToken.from_name(match.lastgroup)
> > + pos = match.start()
> > + value = match.group()
> > +
> > + if kind == CToken.MISMATCH:
> > + raise RuntimeError(f"Unexpected token '{value}' on {pos}:\n\t{source}")
> > + elif kind == CToken.BEGIN:
> > + if value == '(':
> > + paren_level += 1
> > + elif value == '[':
> > + bracket_level += 1
> > + else: # value == '{'
> > + brace_level += 1
> > +
> > + elif kind == CToken.END:
> > + if value == ')' and paren_level > 0:
> > + paren_level -= 1
> > + elif value == ']' and bracket_level > 0:
> > + bracket_level -= 1
> > + elif brace_level > 0: # value == '}'
> > + brace_level -= 1
> > +
> > + yield CToken(kind, value, pos,
> > + brace_level, paren_level, bracket_level)
> > +
> > + def __init__(self, source):
>
> Putting __init__() first is fairly standard, methinks.
Yes, but __init__ calls _tokenize().
My personal preference is to have the caller methods before the methods
that actually call them, even inside a class, where the order doesn't
matter - or even in C, when we have an include with all prototypes.
But if you prefer, I can reorder it.
>
> > + ""
> > + Create a regular expression to handle TOKEN_LIST.
> > +
> > + While I generally don't like using regex group naming via:
> > + (?P<name>...)
> > +
> > + in this particular case, it makes sense, as we can pick the name
> > + when matching a code via re_scanner().
> > + ""
> > + global re_scanner
> > +
> > + if not re_scanner:
> > + re_tokens = []
> > +
> > + for kind, pattern in TOKEN_LIST:
> > + name = CToken.to_name(kind)
> > + re_tokens.append(f"(?P<{name}>{pattern})")
> > +
> > + re_scanner = KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)
>
> I still don't understand why you do this here - this is all constant, right?
Yes. See above. I moved this logic to a function and called it during
module init time, for it to happen just once.
>
> > +
> > + self.tokens = []
> > + for tok in self._tokenize(source):
> > + self.tokens.append(tok)
>
> So you create a nice iterator structure, then just put it all together into a
> list anyway?
We could have used yield here, but what's the point? Due to C
transforms, we'll need to navigate on all tokens multiple times.
Having them on a list ends saving time, as we only need to
tokenize once per source code.
>
> > +
> > + def __str__(self):
> > + out="
> > + show_stack = [True]
> > +
> > + for tok in self.tokens:
> > + if tok.kind == CToken.BEGIN:
> > + show_stack.append(show_stack[-1])
> > +
> > + elif tok.kind == CToken.END:
> > + prev = show_stack[-1]
> > + if len(show_stack) > 1:
> > + show_stack.pop()
> > +
> > + if not prev and show_stack[-1]:
> > + #
> > + # Try to preserve indent
> > + #
> > + out += "\t" * (len(show_stack) - 1)
> > +
> > + out += str(tok.value)
> > + continue
> > +
> > + elif tok.kind == CToken.COMMENT:
> > + comment = RE_COMMENT_START.sub(", tok.value)
> > +
> > + if comment.startswith("private:"):
> > + show_stack[-1] = False
> > + show = False
> > + elif comment.startswith("public:"):
> > + show_stack[-1] = True
> > +
> > + continue
> > +
> > + if show_stack[-1]:
> > + out += str(tok.value)
> > +
> > + return out
> > +
> > +
> > #: Nested delimited pairs (brackets and parenthesis)
> > DELIMITER_PAIRS = {
> > '{': '}',
>
> Thanks,
>
> jon
>
Thanks,
Mauro
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
2026-03-17 8:21 ` Mauro Carvalho Chehab
@ 2026-03-17 17:04 ` Jonathan Corbet
0 siblings, 0 replies; 47+ messages in thread
From: Jonathan Corbet @ 2026-03-17 17:04 UTC (permalink / raw)
To: Mauro Carvalho Chehab
Cc: Randy Dunlap, Linux Doc Mailing List, linux-hardening,
linux-kernel, Aleksandr Loktionov
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
>> > tools/lib/python/kdoc/kdoc_re.py | 234 +++++++++++++++++++++++++++++++
>> > 1 file changed, 234 insertions(+)
>> >
>> > diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
>> > index 085b89a4547c..7bed4e9a8810 100644
>> > --- a/tools/lib/python/kdoc/kdoc_re.py
>> > +++ b/tools/lib/python/kdoc/kdoc_re.py
>> > @@ -141,6 +141,240 @@ class KernRe:
>> >
>> > return self.last_match.groups()
>> >
>> > +class TokType():
>> > +
>> > + @staticmethod
>> > + def __str__(val):
>> > + ""Return the name of an enum value""
>> > + return TokType._name_by_val.get(val, f"UNKNOWN({val})")
>>
>> What is this class supposed to do?
>
> This __str__() method ensures that, when printing a CToken object,
> the name will be displayed, instead of a number. This is really
> useful when debugging.
I was talking about the TokType class, though, not CToken. This class
doesn't appear to be used anywhere. Indeed, I notice now that when you
relocate CToken in patch 7, TokType is silently removed. So perhaps
it's better not to introduce it in the first place :)
jon
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
` (29 preceding siblings ...)
2026-03-13 9:17 ` [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
@ 2026-03-17 17:12 ` Jonathan Corbet
2026-03-17 18:00 ` Mauro Carvalho Chehab
2026-03-17 18:57 ` Mauro Carvalho Chehab
30 siblings, 2 replies; 47+ messages in thread
From: Jonathan Corbet @ 2026-03-17 17:12 UTC (permalink / raw)
To: Mauro Carvalho Chehab, Kees Cook, Mauro Carvalho Chehab
Cc: Mauro Carvalho Chehab, linux-doc, linux-hardening, linux-kernel,
Gustavo A. R. Silva, Aleksandr Loktionov, Randy Dunlap,
Shuah Khan, Vincent Mailhol
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> Sorry for respamming this one too quick. It ends that v1 had some
> bugs causing it to fail on several cases. I opted to add extra
> patches in the end. This way, it better integrates with kdoc_re.
> As part of it, now c_lex will output file name when reporting
> errors. With that regards, only more serious errors will raise
> an exception. They are meant to indicate problems at kernel-doc
> itself. Parsing errors are now using the same warning approach
> as kdoc_parser.
>
> I also added a filter at Ctokenizer __str__() logic for the
> string convertion to drop some weirdness whitespaces and uneeded
> ";" characters at the output.
>
> Finally, v2 address the undefined behavior about private: comment
> propagation.
>
> This patch series change how kdoc parser handles macro replacements.
So I have at least glanced at the whole series now; other than the few
things I pointed out, I don't find a whole lot to complain about. I do
worry about adding another 2000 lines to kernel-doc, even if more than
half of them are tests. But hopefully it leads to a better and more
maintainable system.
We're starting to get late enough in the cycle that I'm a bit leery of
applying this work for 7.1. What was your thinking on timing?
Thanks,
jon
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms
2026-03-17 17:12 ` Jonathan Corbet
@ 2026-03-17 18:00 ` Mauro Carvalho Chehab
2026-03-17 18:57 ` Mauro Carvalho Chehab
1 sibling, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:00 UTC (permalink / raw)
To: Jonathan Corbet
Cc: Kees Cook, Mauro Carvalho Chehab, linux-doc, linux-hardening,
linux-kernel, Gustavo A. R. Silva, Aleksandr Loktionov,
Randy Dunlap, Shuah Khan, Vincent Mailhol
On Tue, 17 Mar 2026 11:12:50 -0600
Jonathan Corbet <corbet@lwn.net> wrote:
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
>
> > Sorry for respamming this one too quick. It ends that v1 had some
> > bugs causing it to fail on several cases. I opted to add extra
> > patches in the end. This way, it better integrates with kdoc_re.
> > As part of it, now c_lex will output file name when reporting
> > errors. With that regards, only more serious errors will raise
> > an exception. They are meant to indicate problems at kernel-doc
> > itself. Parsing errors are now using the same warning approach
> > as kdoc_parser.
> >
> > I also added a filter at Ctokenizer __str__() logic for the
> > string convertion to drop some weirdness whitespaces and uneeded
> > ";" characters at the output.
> >
> > Finally, v2 address the undefined behavior about private: comment
> > propagation.
> >
> > This patch series change how kdoc parser handles macro replacements.
>
> So I have at least glanced at the whole series now; other than the few
> things I pointed out, I don't find a whole lot to complain about. I do
> worry about adding another 2000 lines to kernel-doc, even if more than
> half of them are tests. But hopefully it leads to a better and more
> maintainable system.
>
> We're starting to get late enough in the cycle that I'm a bit leery of
> applying this work for 7.1. What was your thinking on timing?
I'm sending now a v3. It basically address your points, which
reduced the series to 22 patches.
I'm adding the diff between the two versions here, as it may help
checking what changed. I'll also document the main changes at
patch 00/22.
--
Thanks,
Mauro
diff --git a/tools/lib/python/kdoc/c_lex.py b/tools/lib/python/kdoc/c_lex.py
index 95c4dd5afe77..b6d58bd470a9 100644
--- a/tools/lib/python/kdoc/c_lex.py
+++ b/tools/lib/python/kdoc/c_lex.py
@@ -50,7 +50,7 @@ class CToken():
STRING = 1 #: A string, including quotation marks.
CHAR = 2 #: A character, including apostophes.
NUMBER = 3 #: A number.
- PUNC = 4 #: A puntuation mark: ``;`` / ``,`` / ``.``.
+ PUNC = 4 #: A puntuation mark: / ``,`` / ``.``.
BEGIN = 5 #: A begin character: ``{`` / ``[`` / ``(``.
END = 6 #: A end character: ``}`` / ``]`` / ``)``.
CPP = 7 #: A preprocessor macro.
@@ -62,8 +62,9 @@ class CToken():
TYPEDEF = 13 #: A ``typedef`` keyword.
NAME = 14 #: A name. Can be an ID or a type.
SPACE = 15 #: Any space characters, including new lines
+ ENDSTMT = 16 #: End of an statement (``;``).
- BACKREF = 16 #: Not a valid C sequence, but used at sub regex patterns.
+ BACKREF = 17 #: Not a valid C sequence, but used at sub regex patterns.
MISMATCH = 255 #: an error indicator: should never happen in practice.
@@ -104,37 +105,42 @@ class CToken():
return f"CToken(CToken.{name}, {value}, {self.pos}, {self.level})"
-#: Tokens to parse C code.
-TOKEN_LIST = [
+#: Regexes to parse C code, transforming it into tokens.
+RE_SCANNER_LIST = [
+ #
+ # Note that \s\S is different than .*, as it also catches \n
+ #
(CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
(CToken.STRING, r'"(?:\\.|[^"\\])*"'),
(CToken.CHAR, r"'(?:\\.|[^'\\])'"),
- (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
- r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
+ (CToken.NUMBER, r"0[xX][\da-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
+ r"\d+(?:\.\d*)?(?:[eE][+-]?\d+)?[fFlL]*"),
- (CToken.PUNC, r"[;,\.]"),
+ (CToken.ENDSTMT, r"(?:\s+;|;)"),
+
+ (CToken.PUNC, r"[,\.]"),
(CToken.BEGIN, r"[\[\(\{]"),
(CToken.END, r"[\]\)\}]"),
- (CToken.CPP, r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
+ (CToken.CPP, r"#\s*(?:define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
(CToken.HASH, r"#"),
(CToken.OP, r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
- r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:|\@"),
+ r"|&=|\|=|\^=|[=\+\-\*/%<>&\|\^~!\?\:]"),
(CToken.STRUCT, r"\bstruct\b"),
(CToken.UNION, r"\bunion\b"),
(CToken.ENUM, r"\benum\b"),
- (CToken.TYPEDEF, r"\bkinddef\b"),
+ (CToken.TYPEDEF, r"\btypedef\b"),
- (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
+ (CToken.NAME, r"[A-Za-z_]\w*"),
- (CToken.SPACE, r"[\s]+"),
+ (CToken.SPACE, r"\s+"),
(CToken.BACKREF, r"\\\d+"),
@@ -142,7 +148,7 @@ TOKEN_LIST = [
]
def fill_re_scanner(token_list):
- """Ancillary routine to convert TOKEN_LIST into a finditer regex"""
+ """Ancillary routine to convert RE_SCANNER_LIST into a finditer regex"""
re_tokens = []
for kind, pattern in token_list:
@@ -157,7 +163,8 @@ RE_CONT = KernRe(r"\\\n")
RE_COMMENT_START = KernRe(r'/\*\s*')
#: tokenizer regex. Will be filled at the first CTokenizer usage.
-RE_SCANNER = fill_re_scanner(TOKEN_LIST)
+RE_SCANNER = fill_re_scanner(RE_SCANNER_LIST)
+
class CTokenizer():
"""
@@ -170,10 +177,39 @@ class CTokenizer():
# This class is inspired and follows the basic concepts of:
# https://docs.python.org/3/library/re.html#writing-a-tokenizer
+ def __init__(self, source=None, log=None):
+ """
+ Create a regular expression to handle RE_SCANNER_LIST.
+
+ While I generally don't like using regex group naming via:
+ (?P<name>...)
+
+ in this particular case, it makes sense, as we can pick the name
+ when matching a code via RE_SCANNER.
+ """
+
+ self.tokens = []
+
+ if not source:
+ return
+
+ if isinstance(source, list):
+ self.tokens = source
+ return
+
+ #
+ # While we could just use _tokenize directly via interator,
+ # As we'll need to use the tokenizer several times inside kernel-doc
+ # to handle macro transforms, cache the results on a list, as
+ # re-using it is cheaper than having to parse everytime.
+ #
+ for tok in self._tokenize(source):
+ self.tokens.append(tok)
+
def _tokenize(self, source):
"""
- Interactor that parses ``source``, splitting it into tokens, as defined
- at ``self.TOKEN_LIST``.
+ Iterator that parses ``source``, splitting it into tokens, as defined
+ at ``self.RE_SCANNER_LIST``.
The interactor returns a CToken class object.
"""
@@ -214,29 +250,6 @@ class CTokenizer():
yield CToken(kind, value, pos,
brace_level, paren_level, bracket_level)
- def __init__(self, source=None, log=None):
- """
- Create a regular expression to handle TOKEN_LIST.
-
- While I generally don't like using regex group naming via:
- (?P<name>...)
-
- in this particular case, it makes sense, as we can pick the name
- when matching a code via RE_SCANNER.
- """
-
- self.tokens = []
-
- if not source:
- return
-
- if isinstance(source, list):
- self.tokens = source
- return
-
- for tok in self._tokenize(source):
- self.tokens.append(tok)
-
def __str__(self):
out=""
show_stack = [True]
@@ -278,18 +291,10 @@ class CTokenizer():
# Do some cleanups before ";"
- if (tok.kind == CToken.SPACE and
- next_tok.kind == CToken.PUNC and
- next_tok.value == ";"):
-
+ if tok.kind == CToken.SPACE and next_tok.kind == CToken.ENDSTMT:
continue
- if (tok.kind == CToken.PUNC and
- next_tok.kind == CToken.PUNC and
- tok.value == ";" and
- next_tok.kind == CToken.PUNC and
- next_tok.value == ";"):
-
+ if tok.kind == CToken.ENDSTMT and next_tok.kind == tok.kind:
continue
out += str(tok.value)
@@ -368,9 +373,13 @@ class CTokenArgs:
if tok.kind == CToken.BEGIN:
inner_level += 1
- continue
- if tok.kind == CToken.END:
+ #
+ # Discard first begin
+ #
+ if not groups_list[0]:
+ continue
+ elif tok.kind == CToken.END:
inner_level -= 1
if inner_level < 0:
break
@@ -414,7 +423,7 @@ class CTokenArgs:
if inner_level < 0:
break
- if tok.kind == CToken.PUNC and delim == tok.value:
+ if tok.kind in [CToken.PUNC, CToken.ENDSTMT] and delim == tok.value:
pos += 1
if self.greedy and pos > self.max_group:
pos -= 1
@@ -458,6 +467,7 @@ class CTokenArgs:
return new.tokens
+
class CMatch:
"""
Finding nested delimiters is hard with regular expressions. It is
diff --git a/tools/lib/python/kdoc/kdoc_parser.py b/tools/lib/python/kdoc/kdoc_parser.py
index 3b99740ebed3..f6c4ee3b18c9 100644
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -13,9 +13,8 @@ import sys
import re
from pprint import pformat
+from kdoc.c_lex import CTokenizer, tokenizer_set_log
from kdoc.kdoc_re import KernRe
-from kdoc.c_lex import tokenizer_set_log
-from kdoc.c_lex import CTokenizer
from kdoc.kdoc_item import KdocItem
#
diff --git a/tools/unittests/test_tokenizer.py b/tools/unittests/test_tokenizer.py
index 6a0bd49df72e..5634b4a7283e 100755
--- a/tools/unittests/test_tokenizer.py
+++ b/tools/unittests/test_tokenizer.py
@@ -76,13 +76,13 @@ TESTS_TOKENIZER = {
"expected": [
CToken(CToken.NAME, "int"),
CToken(CToken.NAME, "a"),
- CToken(CToken.PUNC, ";"),
+ CToken(CToken.ENDSTMT, ";"),
CToken(CToken.COMMENT, "// comment"),
CToken(CToken.NAME, "float"),
CToken(CToken.NAME, "b"),
CToken(CToken.OP, "="),
CToken(CToken.NUMBER, "1.23"),
- CToken(CToken.PUNC, ";"),
+ CToken(CToken.ENDSTMT, ";"),
],
},
@@ -103,7 +103,7 @@ TESTS_TOKENIZER = {
CToken(CToken.BEGIN, "[", brace_level=1, bracket_level=1),
CToken(CToken.NUMBER, "10", brace_level=1, bracket_level=1),
CToken(CToken.END, "]", brace_level=1),
- CToken(CToken.PUNC, ";", brace_level=1),
+ CToken(CToken.ENDSTMT, ";", brace_level=1),
CToken(CToken.NAME, "func", brace_level=1),
CToken(CToken.BEGIN, "(", brace_level=1, paren_level=1),
CToken(CToken.NAME, "a", brace_level=1, paren_level=1),
@@ -117,7 +117,7 @@ TESTS_TOKENIZER = {
CToken(CToken.NAME, "c", brace_level=1, paren_level=2),
CToken(CToken.END, ")", brace_level=1, paren_level=1),
CToken(CToken.END, ")", brace_level=1),
- CToken(CToken.PUNC, ";", brace_level=1),
+ CToken(CToken.ENDSTMT, ";", brace_level=1),
CToken(CToken.END, "}"),
],
},
^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms
2026-03-17 17:12 ` Jonathan Corbet
2026-03-17 18:00 ` Mauro Carvalho Chehab
@ 2026-03-17 18:57 ` Mauro Carvalho Chehab
1 sibling, 0 replies; 47+ messages in thread
From: Mauro Carvalho Chehab @ 2026-03-17 18:57 UTC (permalink / raw)
To: Jonathan Corbet
Cc: Kees Cook, Mauro Carvalho Chehab, linux-doc, linux-hardening,
linux-kernel, Gustavo A. R. Silva, Aleksandr Loktionov,
Randy Dunlap, Shuah Khan, Vincent Mailhol
On Tue, 17 Mar 2026 11:12:50 -0600
Jonathan Corbet <corbet@lwn.net> wrote:
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
>
> > Sorry for respamming this one too quick. It ends that v1 had some
> > bugs causing it to fail on several cases. I opted to add extra
> > patches in the end. This way, it better integrates with kdoc_re.
> > As part of it, now c_lex will output file name when reporting
> > errors. With that regards, only more serious errors will raise
> > an exception. They are meant to indicate problems at kernel-doc
> > itself. Parsing errors are now using the same warning approach
> > as kdoc_parser.
> >
> > I also added a filter at Ctokenizer __str__() logic for the
> > string convertion to drop some weirdness whitespaces and uneeded
> > ";" characters at the output.
> >
> > Finally, v2 address the undefined behavior about private: comment
> > propagation.
> >
> > This patch series change how kdoc parser handles macro replacements.
>
> I do worry about adding another 2000 lines to kernel-doc, even if more than
> half of them are tests. But hopefully it leads to a better and more
> maintainable system.
Net change due to the parser itself was ~650 lines of code, excluding
unittests.
Yet, at least for me, the code looks a lot better with:
(CMatch("VIRTIO_DECLARE_FEATURES"), r"union { u64 \1; u64 \1_array[VIRTIO_FEATURES_U64S]; }"),
...
(CMatch("struct_group"), r"struct { \2+ };"),
(CMatch("struct_group_attr"), r"struct { \3+ };"),
(CMatch("struct_group_tagged"), r"struct { \3+ };"),
(CMatch("__struct_group"), r"struct { \4+ };"),
and other similar stuff than with the previous approach with
very complex regular expressions and/or handing it on two
steps. IMO this should be a lot easier to maintain as well.
Also, the unittests will hopefully help to detect regressions(
and to test new stuff there without hidden bugs.
> We're starting to get late enough in the cycle that I'm a bit leery of
> applying this work for 7.1. What was your thinking on timing?
There is something I want to change, but not sure if it will
be in time: get rid of the ugly code at:
- rewrite_struct_members
- create_parameter_list
- split_struct_proto
I started doing some changes with that regards, but unlikely to
have time for 7.1.
I do have a pile of patches sitting here to be rebased.
Among them, there are unittests for KernelDoc class.
IMO, it is worth rebasing at least some of them in time for this
merge window. The ones with unittests are independent (or
eventually might require minimal changes). I'd like to have
at least those merged for 7.1.
Among them, there are several tests written by Randy with
regards to some parsing issues at kernel-doc. We should at
least merge the ones that already pass after the tokenizer ;-)
Thanks,
Mauro
^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2026-03-17 18:57 UTC | newest]
Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 01/28] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 02/28] unittests: add a testbench to check public/private kdoc comments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 03/28] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 04/28] docs: kdoc: properly handle empty enum arguments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer Mauro Carvalho Chehab
2026-03-16 23:01 ` Jonathan Corbet
2026-03-17 7:59 ` Mauro Carvalho Chehab
2026-03-16 23:03 ` Jonathan Corbet
2026-03-16 23:29 ` Randy Dunlap
2026-03-16 23:40 ` Jonathan Corbet
2026-03-17 8:21 ` Mauro Carvalho Chehab
2026-03-17 17:04 ` Jonathan Corbet
2026-03-17 7:03 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 06/28] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module Mauro Carvalho Chehab
2026-03-16 23:30 ` Jonathan Corbet
2026-03-17 8:02 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 08/28] unittests: test_private: modify it to use CTokenizer directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 09/28] unittests: test_tokenizer: check if the tokenizer works Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 10/28] unittests: add a runner to execute all unittests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 11/28] docs: kdoc: create a CMatch to match nested C blocks Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 12/28] tools: unittests: add tests for CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 13/28] docs: c_lex: properly implement a sub() method " Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 14/28] unittests: test_cmatch: add tests for sub() Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 15/28] docs: kdoc: replace NestedMatch with CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 16/28] docs: kdoc_re: get rid of NestedMatch class Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 17/28] docs: xforms_lists: handle struct_group directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 18/28] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 19/28] docs: c_lex: add support to work with pure name ids Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 20/28] docs: xforms_lists: use CMatch for all identifiers Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 21/28] docs: c_lex: add "@" operator Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 22/28] docs: c_lex: don't exclude an extra token Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 23/28] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 24/28] docs: unittests: add and adjust tests to check for errors Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 25/28] docs: c_lex: better handle BEGIN/END at search Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 26/28] docs: kernel-doc.rst: document private: scope propagation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 27/28] docs: c_lex: produce a cleaner str() representation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 28/28] unittests: test_cmatch: remove weird stuff from expected results Mauro Carvalho Chehab
2026-03-13 8:34 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
2026-03-13 8:34 ` [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
2026-03-13 11:05 ` Loktionov, Aleksandr
2026-03-13 11:05 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Loktionov, Aleksandr
2026-03-13 9:17 ` [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-17 17:12 ` Jonathan Corbet
2026-03-17 18:00 ` Mauro Carvalho Chehab
2026-03-17 18:57 ` Mauro Carvalho Chehab
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox