From: Jonathan Corbet <corbet@lwn.net>
To: Randy Dunlap <rdunlap@infradead.org>,
Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Linux Doc Mailing List <linux-doc@vger.kernel.org>,
linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org,
Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Subject: Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
Date: Mon, 16 Mar 2026 17:40:22 -0600 [thread overview]
Message-ID: <87ldfrfimx.fsf@trenco.lwn.net> (raw)
In-Reply-To: <c53a3638-7a72-472c-81e8-86a6c235b598@infradead.org>
Randy Dunlap <rdunlap@infradead.org> writes:
> Uh, I find this review confusing.
> Do your (Jon) comments refer to the code above them?
> (more below)
They do
Or, at least, they did...but they clearly got mixed up in the sending
somewhere. Below is the intended version...
> tools/lib/python/kdoc/kdoc_re.py | 234 +++++++++++++++++++++++++++++++
> 1 file changed, 234 insertions(+)
>
> diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> index 085b89a4547c..7bed4e9a8810 100644
> --- a/tools/lib/python/kdoc/kdoc_re.py
> +++ b/tools/lib/python/kdoc/kdoc_re.py
> @@ -141,6 +141,240 @@ class KernRe:
>
> return self.last_match.groups()
>
> +class TokType():
> +
> + @staticmethod
> + def __str__(val):
> + ""Return the name of an enum value""
> + return TokType._name_by_val.get(val, f"UNKNOWN({val})")
What is this class supposed to do?
> +
> +class CToken():
> + ""
> + Data class to define a C token.
> + ""
> +
> + # Tokens that can be used by the parser. Works like an C enum.
> +
> + COMMENT = 0 #: A standard C or C99 comment, including delimiter.
> + STRING = 1 #: A string, including quotation marks.
> + CHAR = 2 #: A character, including apostophes.
> + NUMBER = 3 #: A number.
> + PUNC = 4 #: A puntuation mark: ``;`` / ``,`` / ``.``.
> + BEGIN = 5 #: A begin character: ``{`` / ``[`` / ``(``.
> + END = 6 #: A end character: ``}`` / ``]`` / ``)``.
> + CPP = 7 #: A preprocessor macro.
> + HASH = 8 #: The hash character - useful to handle other macros.
> + OP = 9 #: A C operator (add, subtract, ...).
> + STRUCT = 10 #: A ``struct`` keyword.
> + UNION = 11 #: An ``union`` keyword.
> + ENUM = 12 #: A ``struct`` keyword.
> + TYPEDEF = 13 #: A ``typedef`` keyword.
> + NAME = 14 #: A name. Can be an ID or a type.
> + SPACE = 15 #: Any space characters, including new lines
> +
> + MISMATCH = 255 #: an error indicator: should never happen in practice.
> +
> + # Dict to convert from an enum interger into a string.
> + _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> +
> + # Dict to convert from string to an enum-like integer value.
> + _name_to_val = {k: v for v, k in _name_by_val.items()}
This stuff strikes me as a bit overdone; _name_to_val is really just the
variable list for the class, right?
> +
> + @staticmethod
> + def to_name(val):
> + ""Convert from an integer value from CToken enum into a string""
> +
> + return CToken._name_by_val.get(val, f"UNKNOWN({val})")
> +
> + @staticmethod
> + def from_name(name):
> + ""Convert a string into a CToken enum value""
> + if name in CToken._name_to_val:
> + return CToken._name_to_val[name]
> +
> + return CToken.MISMATCH
> +
> + def __init__(self, kind, value, pos,
> + brace_level, paren_level, bracket_level):
> + self.kind = kind
> + self.value = value
> + self.pos = pos
> + self.brace_level = brace_level
> + self.paren_level = paren_level
> + self.bracket_level = bracket_level
> +
> + def __repr__(self):
> + name = self.to_name(self.kind)
> + if isinstance(self.value, str):
> + value = '"' + self.value + '"'
> + else:
> + value = self.value
> +
> + return f"CToken({name}, {value}, {self.pos}, " \
> + f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> +
> +#: Tokens to parse C code.
> +TOKEN_LIST = [
So these aren't "tokens", this is a list of regexes; how is it intended
to be used?
> + (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
How does "[\s\S]*" differ from plain old "*" ?
> +
> + (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
> + (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
> +
> + (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
> + r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
> +
> + (CToken.PUNC, r"[;,\.]"),
> +
> + (CToken.BEGIN, r"[\[\(\{]"),
> +
> + (CToken.END, r"[\]\)\}]"),
> +
> + (CToken.CPP, r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
> +
> + (CToken.HASH, r"#"),
> +
> + (CToken.OP, r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
> + r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
"-" and "!" never need to be escaped.
> +
> + (CToken.STRUCT, r"\bstruct\b"),
> + (CToken.UNION, r"\bunion\b"),
> + (CToken.ENUM, r"\benum\b"),
> + (CToken.TYPEDEF, r"\bkinddef\b"),
"kinddef" ?
> +
> + (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
> +
> + (CToken.SPACE, r"[\s]+"),
Don't need the [brackets] here
> +
> + (CToken.MISMATCH,r"."),
> +]
> +
> +#: Handle C continuation lines.
> +RE_CONT = KernRe(r"\\\n")
> +
> +RE_COMMENT_START = KernRe(r'/\*\s*')
> +
> +#: tokenizer regex. Will be filled at the first CTokenizer usage.
> +re_scanner = None
That seems weird, why don't you just initialize it here?
> +
> +class CTokenizer():
> + ""
> + Scan C statements and definitions and produce tokens.
> +
> + When converted to string, it drops comments and handle public/private
> + values, respecting depth.
> + ""
> +
> + # This class is inspired and follows the basic concepts of:
> + # https://docs.python.org/3/library/re.html#writing-a-tokenizer
> +
> + def _tokenize(self, source):
> + ""
> + Interactor that parses ``source``, splitting it into tokens, as defined
> + at ``self.TOKEN_LIST``.
> +
> + The interactor returns a CToken class object.
> + ""
Do you mean "iterator" here?
> +
> + # Handle continuation lines. Note that kdoc_parser already has a
> + # logic to do that. Still, let's keep it for completeness, as we might
> + # end re-using this tokenizer outsize kernel-doc some day - or we may
> + # eventually remove from there as a future cleanup.
> + source = RE_CONT.sub(", source)
> +
> + brace_level = 0
> + paren_level = 0
> + bracket_level = 0
> +
> + for match in re_scanner.finditer(source):
> + kind = CToken.from_name(match.lastgroup)
> + pos = match.start()
> + value = match.group()
> +
> + if kind == CToken.MISMATCH:
> + raise RuntimeError(f"Unexpected token '{value}' on {pos}:\n\t{source}")
> + elif kind == CToken.BEGIN:
> + if value == '(':
> + paren_level += 1
> + elif value == '[':
> + bracket_level += 1
> + else: # value == '{'
> + brace_level += 1
> +
> + elif kind == CToken.END:
> + if value == ')' and paren_level > 0:
> + paren_level -= 1
> + elif value == ']' and bracket_level > 0:
> + bracket_level -= 1
> + elif brace_level > 0: # value == '}'
> + brace_level -= 1
> +
> + yield CToken(kind, value, pos,
> + brace_level, paren_level, bracket_level)
> +
> + def __init__(self, source):
Putting __init__() first is fairly standard, methinks.
> + ""
> + Create a regular expression to handle TOKEN_LIST.
> +
> + While I generally don't like using regex group naming via:
> + (?P<name>...)
> +
> + in this particular case, it makes sense, as we can pick the name
> + when matching a code via re_scanner().
> + ""
> + global re_scanner
> +
> + if not re_scanner:
> + re_tokens = []
> +
> + for kind, pattern in TOKEN_LIST:
> + name = CToken.to_name(kind)
> + re_tokens.append(f"(?P<{name}>{pattern})")
> +
> + re_scanner = KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)
I still don't understand why you do this here - this is all constant, right?
> +
> + self.tokens = []
> + for tok in self._tokenize(source):
> + self.tokens.append(tok)
So you create a nice iterator structure, then just put it all together into a
list anyway?
> +
> + def __str__(self):
> + out="
> + show_stack = [True]
> +
> + for tok in self.tokens:
> + if tok.kind == CToken.BEGIN:
> + show_stack.append(show_stack[-1])
> +
> + elif tok.kind == CToken.END:
> + prev = show_stack[-1]
> + if len(show_stack) > 1:
> + show_stack.pop()
> +
> + if not prev and show_stack[-1]:
> + #
> + # Try to preserve indent
> + #
> + out += "\t" * (len(show_stack) - 1)
> +
> + out += str(tok.value)
> + continue
> +
> + elif tok.kind == CToken.COMMENT:
> + comment = RE_COMMENT_START.sub(", tok.value)
> +
> + if comment.startswith("private:"):
> + show_stack[-1] = False
> + show = False
> + elif comment.startswith("public:"):
> + show_stack[-1] = True
> +
> + continue
> +
> + if show_stack[-1]:
> + out += str(tok.value)
> +
> + return out
> +
> +
> #: Nested delimited pairs (brackets and parenthesis)
> DELIMITER_PAIRS = {
> '{': '}',
Thanks,
jon
next prev parent reply other threads:[~2026-03-16 23:40 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 01/28] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 02/28] unittests: add a testbench to check public/private kdoc comments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 03/28] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 04/28] docs: kdoc: properly handle empty enum arguments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer Mauro Carvalho Chehab
2026-03-16 23:01 ` Jonathan Corbet
2026-03-17 7:59 ` Mauro Carvalho Chehab
2026-03-16 23:03 ` Jonathan Corbet
2026-03-16 23:29 ` Randy Dunlap
2026-03-16 23:40 ` Jonathan Corbet [this message]
2026-03-17 8:21 ` Mauro Carvalho Chehab
2026-03-17 17:04 ` Jonathan Corbet
2026-03-17 7:03 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 06/28] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module Mauro Carvalho Chehab
2026-03-16 23:30 ` Jonathan Corbet
2026-03-17 8:02 ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 08/28] unittests: test_private: modify it to use CTokenizer directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 09/28] unittests: test_tokenizer: check if the tokenizer works Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 10/28] unittests: add a runner to execute all unittests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 11/28] docs: kdoc: create a CMatch to match nested C blocks Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 12/28] tools: unittests: add tests for CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 13/28] docs: c_lex: properly implement a sub() method " Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 14/28] unittests: test_cmatch: add tests for sub() Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 15/28] docs: kdoc: replace NestedMatch with CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 16/28] docs: kdoc_re: get rid of NestedMatch class Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 17/28] docs: xforms_lists: handle struct_group directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 18/28] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 19/28] docs: c_lex: add support to work with pure name ids Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 20/28] docs: xforms_lists: use CMatch for all identifiers Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 21/28] docs: c_lex: add "@" operator Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 22/28] docs: c_lex: don't exclude an extra token Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 23/28] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 24/28] docs: unittests: add and adjust tests to check for errors Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 25/28] docs: c_lex: better handle BEGIN/END at search Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 26/28] docs: kernel-doc.rst: document private: scope propagation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 27/28] docs: c_lex: produce a cleaner str() representation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 28/28] unittests: test_cmatch: remove weird stuff from expected results Mauro Carvalho Chehab
2026-03-13 8:34 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
2026-03-13 8:34 ` [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
2026-03-13 11:05 ` Loktionov, Aleksandr
2026-03-13 11:05 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Loktionov, Aleksandr
2026-03-13 9:17 ` [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-17 17:12 ` Jonathan Corbet
2026-03-17 18:00 ` Mauro Carvalho Chehab
2026-03-17 18:57 ` Mauro Carvalho Chehab
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ldfrfimx.fsf@trenco.lwn.net \
--to=corbet@lwn.net \
--cc=aleksandr.loktionov@intel.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-hardening@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab+huawei@kernel.org \
--cc=rdunlap@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox