Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jonathan Corbet <corbet@lwn.net>
To: Randy Dunlap <rdunlap@infradead.org>,
	Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Linux Doc Mailing List <linux-doc@vger.kernel.org>,
	linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org,
	Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Subject: Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
Date: Mon, 16 Mar 2026 17:40:22 -0600	[thread overview]
Message-ID: <87ldfrfimx.fsf@trenco.lwn.net> (raw)
In-Reply-To: <c53a3638-7a72-472c-81e8-86a6c235b598@infradead.org>

Randy Dunlap <rdunlap@infradead.org> writes:

> Uh, I find this review confusing.
> Do your (Jon) comments refer to the code above them?
> (more below)

They do

Or, at least, they did...but they clearly got mixed up in the sending
somewhere.  Below is the intended version...

> tools/lib/python/kdoc/kdoc_re.py | 234 +++++++++++++++++++++++++++++++
>  1 file changed, 234 insertions(+)
> 
> diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> index 085b89a4547c..7bed4e9a8810 100644
> --- a/tools/lib/python/kdoc/kdoc_re.py
> +++ b/tools/lib/python/kdoc/kdoc_re.py
> @@ -141,6 +141,240 @@ class KernRe:
> 
> 	 return self.last_match.groups()
> 
> +class TokType():
> +
> +    @staticmethod
> +    def __str__(val):
> +        ""Return the name of an enum value""
> +        return TokType._name_by_val.get(val, f"UNKNOWN({val})")

What is this class supposed to do?

> +
> +class CToken():
> +    ""
> +    Data class to define a C token.
> +    ""
> +
> +    # Tokens that can be used by the parser. Works like an C enum.
> +
> +    COMMENT = 0     #: A standard C or C99 comment, including delimiter.
> +    STRING = 1      #: A string, including quotation marks.
> +    CHAR = 2        #: A character, including apostophes.
> +    NUMBER = 3      #: A number.
> +    PUNC = 4        #: A puntuation mark: ``;`` / ``,`` / ``.``.
> +    BEGIN = 5       #: A begin character: ``{`` / ``[`` / ``(``.
> +    END = 6         #: A end character: ``}`` / ``]`` / ``)``.
> +    CPP = 7         #: A preprocessor macro.
> +    HASH = 8        #: The hash character - useful to handle other macros.
> +    OP = 9          #: A C operator (add, subtract, ...).
> +    STRUCT = 10     #: A ``struct`` keyword.
> +    UNION = 11      #: An ``union`` keyword.
> +    ENUM = 12       #: A ``struct`` keyword.
> +    TYPEDEF = 13    #: A ``typedef`` keyword.
> +    NAME = 14       #: A name. Can be an ID or a type.
> +    SPACE = 15      #: Any space characters, including new lines
> +
> +    MISMATCH = 255  #: an error indicator: should never happen in practice.
> +
> +    # Dict to convert from an enum interger into a string.
> +    _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> +
> +    # Dict to convert from string to an enum-like integer value.
> +    _name_to_val = {k: v for v, k in _name_by_val.items()}

This stuff strikes me as a bit overdone; _name_to_val is really just the
variable list for the class, right?

> +
> +    @staticmethod
> +    def to_name(val):
> +        ""Convert from an integer value from CToken enum into a string""
> +
> +        return CToken._name_by_val.get(val, f"UNKNOWN({val})")
> +
> +    @staticmethod
> +    def from_name(name):
> +        ""Convert a string into a CToken enum value""
> +        if name in CToken._name_to_val:
> +            return CToken._name_to_val[name]
> +
> +        return CToken.MISMATCH
> +
> +    def __init__(self, kind, value, pos,
> +                 brace_level, paren_level, bracket_level):
> +        self.kind = kind
> +        self.value = value
> +        self.pos = pos
> +        self.brace_level = brace_level
> +        self.paren_level = paren_level
> +        self.bracket_level = bracket_level
> +
> +    def __repr__(self):
> +        name = self.to_name(self.kind)
> +        if isinstance(self.value, str):
> +            value = '"' + self.value + '"'
> +        else:
> +            value = self.value
> +
> +        return f"CToken({name}, {value}, {self.pos}, " \
> +               f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> +
> +#: Tokens to parse C code.
> +TOKEN_LIST = [

So these aren't "tokens", this is a list of regexes; how is it intended
to be used?

> +    (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),

How does "[\s\S]*" differ from plain old "*" ?

> +
> +    (CToken.STRING,  r'"(?:\\.|[^"\\])*"'),
> +    (CToken.CHAR,    r"'(?:\\.|[^'\\])'"),
> +
> +    (CToken.NUMBER,  r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
> +                     r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
> +
> +    (CToken.PUNC,    r"[;,\.]"),
> +
> +    (CToken.BEGIN,   r"[\[\(\{]"),
> +
> +    (CToken.END,     r"[\]\)\}]"),
> +
> +    (CToken.CPP,     r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
> +
> +    (CToken.HASH,    r"#"),
> +
> +    (CToken.OP,      r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
> +                     r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),

"-" and "!" never need to be escaped.

> +
> +    (CToken.STRUCT,  r"\bstruct\b"),
> +    (CToken.UNION,   r"\bunion\b"),
> +    (CToken.ENUM,    r"\benum\b"),
> +    (CToken.TYPEDEF, r"\bkinddef\b"),

"kinddef" ?

> +
> +    (CToken.NAME,      r"[A-Za-z_][A-Za-z0-9_]*"),
> +
> +    (CToken.SPACE,   r"[\s]+"),

Don't need the [brackets] here

> +
> +    (CToken.MISMATCH,r"."),
> +]
> +
> +#: Handle C continuation lines.
> +RE_CONT = KernRe(r"\\\n")
> +
> +RE_COMMENT_START = KernRe(r'/\*\s*')
> +
> +#: tokenizer regex. Will be filled at the first CTokenizer usage.
> +re_scanner = None

That seems weird, why don't you just initialize it here?

> +
> +class CTokenizer():
> +    ""
> +    Scan C statements and definitions and produce tokens.
> +
> +    When converted to string, it drops comments and handle public/private
> +    values, respecting depth.
> +    ""
> +
> +    # This class is inspired and follows the basic concepts of:
> +    #   https://docs.python.org/3/library/re.html#writing-a-tokenizer
> +
> +    def _tokenize(self, source):
> +        ""
> +        Interactor that parses ``source``, splitting it into tokens, as defined
> +        at ``self.TOKEN_LIST``.
> +
> +        The interactor returns a CToken class object.
> +        ""

Do you mean "iterator" here?

> +
> +        # Handle continuation lines. Note that kdoc_parser already has a
> +        # logic to do that. Still, let's keep it for completeness, as we might
> +        # end re-using this tokenizer outsize kernel-doc some day - or we may
> +        # eventually remove from there as a future cleanup.
> +        source = RE_CONT.sub(", source)
> +
> +        brace_level = 0
> +        paren_level = 0
> +        bracket_level = 0
> +
> +        for match in re_scanner.finditer(source):
> +            kind = CToken.from_name(match.lastgroup)
> +            pos = match.start()
> +            value = match.group()
> +
> +            if kind == CToken.MISMATCH:
> +                raise RuntimeError(f"Unexpected token '{value}' on {pos}:\n\t{source}")
> +            elif kind == CToken.BEGIN:
> +                if value == '(':
> +                    paren_level += 1
> +                elif value == '[':
> +                    bracket_level += 1
> +                else:  # value == '{'
> +                    brace_level += 1
> +
> +            elif kind == CToken.END:
> +                if value == ')' and paren_level > 0:
> +                    paren_level -= 1
> +                elif value == ']' and bracket_level > 0:
> +                    bracket_level -= 1
> +                elif brace_level > 0:    # value == '}'
> +                    brace_level -= 1
> +
> +            yield CToken(kind, value, pos,
> +                         brace_level, paren_level, bracket_level)
> +
> +    def __init__(self, source):

Putting __init__() first is fairly standard, methinks.

> +        ""
> +        Create a regular expression to handle TOKEN_LIST.
> +
> +        While I generally don't like using regex group naming via:
> +            (?P<name>...)
> +
> +        in this particular case, it makes sense, as we can pick the name
> +        when matching a code via re_scanner().
> +        ""
> +        global re_scanner
> +
> +        if not re_scanner:
> +            re_tokens = []
> +
> +            for kind, pattern in TOKEN_LIST:
> +                name = CToken.to_name(kind)
> +                re_tokens.append(f"(?P<{name}>{pattern})")
> +
> +            re_scanner = KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)

I still don't understand why you do this here - this is all constant, right?

> +
> +        self.tokens = []
> +        for tok in self._tokenize(source):
> +            self.tokens.append(tok)

So you create a nice iterator structure, then just put it all together into a
list anyway?

> +
> +    def __str__(self):
> +        out="
> +        show_stack = [True]
> +
> +        for tok in self.tokens:
> +            if tok.kind == CToken.BEGIN:
> +                show_stack.append(show_stack[-1])
> +
> +            elif tok.kind == CToken.END:
> +                prev = show_stack[-1]
> +                if len(show_stack) > 1:
> +                    show_stack.pop()
> +
> +                if not prev and show_stack[-1]:
> +                    #
> +                    # Try to preserve indent
> +                    #
> +                    out += "\t" * (len(show_stack) - 1)
> +
> +                    out += str(tok.value)
> +                    continue
> +
> +            elif tok.kind == CToken.COMMENT:
> +                comment = RE_COMMENT_START.sub(", tok.value)
> +
> +                if comment.startswith("private:"):
> +                    show_stack[-1] = False
> +                    show = False
> +                elif comment.startswith("public:"):
> +                    show_stack[-1] = True
> +
> +                continue
> +
> +            if show_stack[-1]:
> +                    out += str(tok.value)
> +
> +        return out
> +
> +
>  #: Nested delimited pairs (brackets and parenthesis)
>  DELIMITER_PAIRS = {
>      '{': '}',

Thanks,

jon

next prev parent reply	other threads:[~2026-03-16 23:40 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 01/28] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 02/28] unittests: add a testbench to check public/private kdoc comments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 03/28] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 04/28] docs: kdoc: properly handle empty enum arguments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer Mauro Carvalho Chehab
2026-03-16 23:01   ` Jonathan Corbet
2026-03-17  7:59     ` Mauro Carvalho Chehab
2026-03-16 23:03   ` Jonathan Corbet
2026-03-16 23:29     ` Randy Dunlap
2026-03-16 23:40       ` Jonathan Corbet [this message]
2026-03-17  8:21         ` Mauro Carvalho Chehab
2026-03-17 17:04           ` Jonathan Corbet
2026-03-17  7:03       ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 06/28] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module Mauro Carvalho Chehab
2026-03-16 23:30   ` Jonathan Corbet
2026-03-17  8:02     ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 08/28] unittests: test_private: modify it to use CTokenizer directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 09/28] unittests: test_tokenizer: check if the tokenizer works Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 10/28] unittests: add a runner to execute all unittests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 11/28] docs: kdoc: create a CMatch to match nested C blocks Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 12/28] tools: unittests: add tests for CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 13/28] docs: c_lex: properly implement a sub() method " Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 14/28] unittests: test_cmatch: add tests for sub() Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 15/28] docs: kdoc: replace NestedMatch with CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 16/28] docs: kdoc_re: get rid of NestedMatch class Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 17/28] docs: xforms_lists: handle struct_group directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 18/28] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 19/28] docs: c_lex: add support to work with pure name ids Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 20/28] docs: xforms_lists: use CMatch for all identifiers Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 21/28] docs: c_lex: add "@" operator Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 22/28] docs: c_lex: don't exclude an extra token Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 23/28] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 24/28] docs: unittests: add and adjust tests to check for errors Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 25/28] docs: c_lex: better handle BEGIN/END at search Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 26/28] docs: kernel-doc.rst: document private: scope propagation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 27/28] docs: c_lex: produce a cleaner str() representation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 28/28] unittests: test_cmatch: remove weird stuff from expected results Mauro Carvalho Chehab
2026-03-13  8:34 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
2026-03-13  8:34   ` [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
2026-03-13 11:05     ` Loktionov, Aleksandr
2026-03-13 11:05   ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Loktionov, Aleksandr
2026-03-13  9:17 ` [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-17 17:12 ` Jonathan Corbet
2026-03-17 18:00   ` Mauro Carvalho Chehab
2026-03-17 18:57   ` Mauro Carvalho Chehab

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ldfrfimx.fsf@trenco.lwn.net \
    --to=corbet@lwn.net \
    --cc=aleksandr.loktionov@intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab+huawei@kernel.org \
    --cc=rdunlap@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.