Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>,
	Linux Doc Mailing List <linux-doc@vger.kernel.org>,
	linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org,
	Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Subject: Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
Date: Tue, 17 Mar 2026 09:21:53 +0100	[thread overview]
Message-ID: <20260317092153.11f2b10d@foz.lan> (raw)
In-Reply-To: <87ldfrfimx.fsf@trenco.lwn.net>

On Mon, 16 Mar 2026 17:40:22 -0600
Jonathan Corbet <corbet@lwn.net> wrote:

> Randy Dunlap <rdunlap@infradead.org> writes:
> 
> > Uh, I find this review confusing.
> > Do your (Jon) comments refer to the code above them?
> > (more below)  
> 
> They do
> 
> Or, at least, they did...but they clearly got mixed up in the sending
> somewhere.  Below is the intended version...

Oh, I should have read this one before... Ignore my previous comment.
I'll move the answers to this reply, and answer the other ones.

> > tools/lib/python/kdoc/kdoc_re.py | 234 +++++++++++++++++++++++++++++++
> >  1 file changed, 234 insertions(+)
> > 
> > diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> > index 085b89a4547c..7bed4e9a8810 100644
> > --- a/tools/lib/python/kdoc/kdoc_re.py
> > +++ b/tools/lib/python/kdoc/kdoc_re.py
> > @@ -141,6 +141,240 @@ class KernRe:
> > 
> > 	 return self.last_match.groups()
> > 
> > +class TokType():
> > +
> > +    @staticmethod
> > +    def __str__(val):
> > +        ""Return the name of an enum value""
> > +        return TokType._name_by_val.get(val, f"UNKNOWN({val})")  
> 
> What is this class supposed to do?

This __str__() method ensures that, when printing a CToken object,
the name will be displayed, instead of a number. This is really
useful when debugging.

See, if I add a print:

<snip>
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -87,6 +87,7 @@ def trim_private_members(text):
     """
 
     tokens = CTokenizer(text)
+    print(tokens.tokens)
     return str(tokens)
 
</snip>

the tokens will appear as names at the output:

$ ./scripts/kernel-doc -none er.c
[CToken(CToken.ENUM, "enum", 0, (0, 0, 0)), CToken(CToken.SPACE, " ", 4, (0, 0, 0)), CToken(CToken.NAME, "dmub_abm_ace_curve_type", 5, (0, 0, 0)), CToken(CToken.SPACE, " ", 28, (0, 0, 0)), CToken(CToken.BEGIN, "{", 29, (0, 0, 1)), CToken(CToken.SPACE, " ", 30, (0, 0, 1)), CToken(CToken.COMMENT, "/**
         * ACE curve as defined by the SW layer. */", 31, (0, 0, 1)), CToken(CToken.SPACE, " ", 86, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW", 87, (0, 0, 1)), CToken(CToken.SPACE, " ", 109, (0, 0, 1)), CToken(CToken.OP, "=", 110, (0, 0, 1)), CToken(CToken.SPACE, " ", 111, (0, 0, 1)), CToken(CToken.NUMBER, "0", 112, (0, 0, 1)), CToken(CToken.PUNC, ",", 113, (0, 0, 1)), CToken(CToken.SPACE, " ", 114, (0, 0, 1)), CToken(CToken.COMMENT, "/**
         * ACE curve as defined by the SW to HW translation interface layer. */", 115, (0, 0, 1)), CToken(CToken.SPACE, " ", 198, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW_IF", 199, (0, 0, 1)), CToken(CToken.SPACE, " ", 224, (0, 0, 1)), CToken(CToken.OP, "=", 225, (0, 0, 1)), CToken(CToken.SPACE, " ", 226, (0, 0, 1)), CToken(CToken.NUMBER, "1", 227, (0, 0, 1)), CToken(CToken.PUNC, ",", 228, (0, 0, 1)), CToken(CToken.SPACE, " ", 229, (0, 0, 1)), CToken(CToken.END, "}", 230, (0, 0, 0)), CToken(CToken.PUNC, ";", 231, (0, 0, 0))]

> 
> > +
> > +class CToken():
> > +    ""
> > +    Data class to define a C token.
> > +    ""
> > +
> > +    # Tokens that can be used by the parser. Works like an C enum.
> > +
> > +    COMMENT = 0     #: A standard C or C99 comment, including delimiter.
> > +    STRING = 1      #: A string, including quotation marks.
> > +    CHAR = 2        #: A character, including apostophes.
> > +    NUMBER = 3      #: A number.
> > +    PUNC = 4        #: A puntuation mark: ``;`` / ``,`` / ``.``.
> > +    BEGIN = 5       #: A begin character: ``{`` / ``[`` / ``(``.
> > +    END = 6         #: A end character: ``}`` / ``]`` / ``)``.
> > +    CPP = 7         #: A preprocessor macro.
> > +    HASH = 8        #: The hash character - useful to handle other macros.
> > +    OP = 9          #: A C operator (add, subtract, ...).
> > +    STRUCT = 10     #: A ``struct`` keyword.
> > +    UNION = 11      #: An ``union`` keyword.
> > +    ENUM = 12       #: A ``struct`` keyword.
> > +    TYPEDEF = 13    #: A ``typedef`` keyword.
> > +    NAME = 14       #: A name. Can be an ID or a type.
> > +    SPACE = 15      #: Any space characters, including new lines
> > +
> > +    MISMATCH = 255  #: an error indicator: should never happen in practice.
> > +
> > +    # Dict to convert from an enum interger into a string.
> > +    _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> > +
> > +    # Dict to convert from string to an enum-like integer value.
> > +    _name_to_val = {k: v for v, k in _name_by_val.items()}  
> 
> This stuff strikes me as a bit overdone; _name_to_val is really just the
> variable list for the class, right?

Those two vars are a kind of magic: they create two dictionaries:

- _name_by_val converts a token integer into a string;
- _name_to_val converts a string to an integer.

I opted to use this approach for a couple of reasons:

1. using tok.kind == "BEGIN" (and similar) everywhere is harder to
maintain, as python won't check for typos. Now, if one writes:
CToken.BEGHIN, an error will be raised;

2. the cost to convert from string to int is O(1), so not much
   a performance issue at the conversion;

3. using an integer on all checks should make the code faster as
   it doesn't require a loop to check the string.

> 
> > +
> > +    @staticmethod
> > +    def to_name(val):
> > +        ""Convert from an integer value from CToken enum into a string""
> > +
> > +        return CToken._name_by_val.get(val, f"UNKNOWN({val})")
> > +
> > +    @staticmethod
> > +    def from_name(name):
> > +        ""Convert a string into a CToken enum value""
> > +        if name in CToken._name_to_val:
> > +            return CToken._name_to_val[name]
> > +
> > +        return CToken.MISMATCH
> > +
> > +    def __init__(self, kind, value, pos,
> > +                 brace_level, paren_level, bracket_level):
> > +        self.kind = kind
> > +        self.value = value
> > +        self.pos = pos
> > +        self.brace_level = brace_level
> > +        self.paren_level = paren_level
> > +        self.bracket_level = bracket_level
> > +
> > +    def __repr__(self):
> > +        name = self.to_name(self.kind)
> > +        if isinstance(self.value, str):
> > +            value = '"' + self.value + '"'
> > +        else:
> > +            value = self.value
> > +
> > +        return f"CToken({name}, {value}, {self.pos}, " \
> > +               f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> > +
> > +#: Tokens to parse C code.
> > +TOKEN_LIST = [  
> 
> So these aren't "tokens", this is a list of regexes; how is it intended
> to be used?
> 
> > +    (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),  
> 
> How does "[\s\S]*" differ from plain old "*" ?

They are not identical, as "*" doesn't match "\n". As the tokenizer
also picks "\n" on several cases, like on comments, r"\s\S" works
better.

> 
> > +
> > +    (CToken.STRING,  r'"(?:\\.|[^"\\])*"'),
> > +    (CToken.CHAR,    r"'(?:\\.|[^'\\])'"),
> > +
> > +    (CToken.NUMBER,  r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
> > +                     r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
> > +
> > +    (CToken.PUNC,    r"[;,\.]"),
> > +
> > +    (CToken.BEGIN,   r"[\[\(\{]"),
> > +
> > +    (CToken.END,     r"[\]\)\}]"),
> > +
> > +    (CToken.CPP,     r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)\b"),
> > +
> > +    (CToken.HASH,    r"#"),
> > +
> > +    (CToken.OP,      r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
> > +                     r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),  
> 
> "-" and "!" never need to be escaped.

"-" usually needs to be escaped, because it can be a range. I actually
tried without escaping it, but the regex failed. So I ended being 
conservative. 

> 
> > +
> > +    (CToken.STRUCT,  r"\bstruct\b"),
> > +    (CToken.UNION,   r"\bunion\b"),
> > +    (CToken.ENUM,    r"\benum\b"),
> > +    (CToken.TYPEDEF, r"\bkinddef\b"),  
> 
> "kinddef" ?

Should be "typedef".

This was due to a "sed s,type,kind," I applied to avoid using
"type" for the token type, as, when I started integrating it
with kdoc_re, it became confusing.

I'll fix at the next respin.


> 
> > +
> > +    (CToken.NAME,      r"[A-Za-z_][A-Za-z0-9_]*"),
> > +
> > +    (CToken.SPACE,   r"[\s]+"),  
> 
> Don't need the [brackets] here

True. This was [ \t] and there as a separate token for new line.
I merged them, but forgot stripping the brackets.

Will cleanup at the next respin.

> 
> > +
> > +    (CToken.MISMATCH,r"."),
> > +]
> > +
> > +#: Handle C continuation lines.
> > +RE_CONT = KernRe(r"\\\n")
> > +
> > +RE_COMMENT_START = KernRe(r'/\*\s*')
> > +
> > +#: tokenizer regex. Will be filled at the first CTokenizer usage.
> > +re_scanner = None  
> 
> That seems weird, why don't you just initialize it here?

Yeah, I changed this one to:

	def fill_re_scanner(token_list):
	    """Ancillary routine to convert TOKEN_LIST into a finditer regex"""
	    re_tokens = []

	    for kind, pattern in token_list:
	        name = CToken.to_name(kind)
	        re_tokens.append(f"(?P<{name}>{pattern})")

	    return KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)

	RE_SCANNER = fill_re_scanner(TOKEN_LIST)

but I guess tis is on a patch later on.

> 
> > +
> > +class CTokenizer():
> > +    ""
> > +    Scan C statements and definitions and produce tokens.
> > +
> > +    When converted to string, it drops comments and handle public/private
> > +    values, respecting depth.
> > +    ""
> > +
> > +    # This class is inspired and follows the basic concepts of:
> > +    #   https://docs.python.org/3/library/re.html#writing-a-tokenizer
> > +
> > +    def _tokenize(self, source):
> > +        ""
> > +        Interactor that parses ``source``, splitting it into tokens, as defined
> > +        at ``self.TOKEN_LIST``.
> > +
> > +        The interactor returns a CToken class object.
> > +        ""  
> 
> Do you mean "iterator" here?

Yes. will fix at the next respin.

> 
> > +
> > +        # Handle continuation lines. Note that kdoc_parser already has a
> > +        # logic to do that. Still, let's keep it for completeness, as we might
> > +        # end re-using this tokenizer outsize kernel-doc some day - or we may
> > +        # eventually remove from there as a future cleanup.
> > +        source = RE_CONT.sub(", source)
> > +
> > +        brace_level = 0
> > +        paren_level = 0
> > +        bracket_level = 0
> > +
> > +        for match in re_scanner.finditer(source):
> > +            kind = CToken.from_name(match.lastgroup)
> > +            pos = match.start()
> > +            value = match.group()
> > +
> > +            if kind == CToken.MISMATCH:
> > +                raise RuntimeError(f"Unexpected token '{value}' on {pos}:\n\t{source}")
> > +            elif kind == CToken.BEGIN:
> > +                if value == '(':
> > +                    paren_level += 1
> > +                elif value == '[':
> > +                    bracket_level += 1
> > +                else:  # value == '{'
> > +                    brace_level += 1
> > +
> > +            elif kind == CToken.END:
> > +                if value == ')' and paren_level > 0:
> > +                    paren_level -= 1
> > +                elif value == ']' and bracket_level > 0:
> > +                    bracket_level -= 1
> > +                elif brace_level > 0:    # value == '}'
> > +                    brace_level -= 1
> > +
> > +            yield CToken(kind, value, pos,
> > +                         brace_level, paren_level, bracket_level)
> > +
> > +    def __init__(self, source):  
> 
> Putting __init__() first is fairly standard, methinks.

Yes, but __init__ calls _tokenize().

My personal preference is to have the caller methods before the methods 
that actually call them, even inside a class, where the order doesn't
matter - or even in C, when we have an include with all prototypes.

But if you prefer, I can reorder it.

> 
> > +        ""
> > +        Create a regular expression to handle TOKEN_LIST.
> > +
> > +        While I generally don't like using regex group naming via:
> > +            (?P<name>...)
> > +
> > +        in this particular case, it makes sense, as we can pick the name
> > +        when matching a code via re_scanner().
> > +        ""
> > +        global re_scanner
> > +
> > +        if not re_scanner:
> > +            re_tokens = []
> > +
> > +            for kind, pattern in TOKEN_LIST:
> > +                name = CToken.to_name(kind)
> > +                re_tokens.append(f"(?P<{name}>{pattern})")
> > +
> > +            re_scanner = KernRe("|".join(re_tokens), re.MULTILINE | re.DOTALL)  
> 
> I still don't understand why you do this here - this is all constant, right?

Yes. See above. I moved this logic to a function and called it during
module init time, for it to happen just once.

> 
> > +
> > +        self.tokens = []
> > +        for tok in self._tokenize(source):
> > +            self.tokens.append(tok)  
> 
> So you create a nice iterator structure, then just put it all together into a
> list anyway?

We could have used yield here, but what's the point? Due to C 
transforms, we'll need to navigate on all tokens multiple times. 

Having them on a list ends saving time, as we only need to 
tokenize once per source code.



> 
> > +
> > +    def __str__(self):
> > +        out="
> > +        show_stack = [True]
> > +
> > +        for tok in self.tokens:
> > +            if tok.kind == CToken.BEGIN:
> > +                show_stack.append(show_stack[-1])
> > +
> > +            elif tok.kind == CToken.END:
> > +                prev = show_stack[-1]
> > +                if len(show_stack) > 1:
> > +                    show_stack.pop()
> > +
> > +                if not prev and show_stack[-1]:
> > +                    #
> > +                    # Try to preserve indent
> > +                    #
> > +                    out += "\t" * (len(show_stack) - 1)
> > +
> > +                    out += str(tok.value)
> > +                    continue
> > +
> > +            elif tok.kind == CToken.COMMENT:
> > +                comment = RE_COMMENT_START.sub(", tok.value)
> > +
> > +                if comment.startswith("private:"):
> > +                    show_stack[-1] = False
> > +                    show = False
> > +                elif comment.startswith("public:"):
> > +                    show_stack[-1] = True
> > +
> > +                continue
> > +
> > +            if show_stack[-1]:
> > +                    out += str(tok.value)
> > +
> > +        return out
> > +
> > +
> >  #: Nested delimited pairs (brackets and parenthesis)
> >  DELIMITER_PAIRS = {
> >      '{': '}',  
> 
> Thanks,
> 
> jon
> 



Thanks,
Mauro

next prev parent reply	other threads:[~2026-03-17  8:21 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 01/28] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 02/28] unittests: add a testbench to check public/private kdoc comments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 03/28] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 04/28] docs: kdoc: properly handle empty enum arguments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer Mauro Carvalho Chehab
2026-03-16 23:01   ` Jonathan Corbet
2026-03-17  7:59     ` Mauro Carvalho Chehab
2026-03-16 23:03   ` Jonathan Corbet
2026-03-16 23:29     ` Randy Dunlap
2026-03-16 23:40       ` Jonathan Corbet
2026-03-17  8:21         ` Mauro Carvalho Chehab [this message]
2026-03-17 17:04           ` Jonathan Corbet
2026-03-17  7:03       ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 06/28] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module Mauro Carvalho Chehab
2026-03-16 23:30   ` Jonathan Corbet
2026-03-17  8:02     ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 08/28] unittests: test_private: modify it to use CTokenizer directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 09/28] unittests: test_tokenizer: check if the tokenizer works Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 10/28] unittests: add a runner to execute all unittests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 11/28] docs: kdoc: create a CMatch to match nested C blocks Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 12/28] tools: unittests: add tests for CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 13/28] docs: c_lex: properly implement a sub() method " Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 14/28] unittests: test_cmatch: add tests for sub() Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 15/28] docs: kdoc: replace NestedMatch with CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 16/28] docs: kdoc_re: get rid of NestedMatch class Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 17/28] docs: xforms_lists: handle struct_group directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 18/28] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 19/28] docs: c_lex: add support to work with pure name ids Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 20/28] docs: xforms_lists: use CMatch for all identifiers Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 21/28] docs: c_lex: add "@" operator Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 22/28] docs: c_lex: don't exclude an extra token Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 23/28] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 24/28] docs: unittests: add and adjust tests to check for errors Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 25/28] docs: c_lex: better handle BEGIN/END at search Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 26/28] docs: kernel-doc.rst: document private: scope propagation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 27/28] docs: c_lex: produce a cleaner str() representation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 28/28] unittests: test_cmatch: remove weird stuff from expected results Mauro Carvalho Chehab
2026-03-13  8:34 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
2026-03-13  8:34   ` [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
2026-03-13 11:05     ` Loktionov, Aleksandr
2026-03-13 11:05   ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Loktionov, Aleksandr
2026-03-13  9:17 ` [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-17 17:12 ` Jonathan Corbet
2026-03-17 18:00   ` Mauro Carvalho Chehab
2026-03-17 18:57   ` Mauro Carvalho Chehab

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260317092153.11f2b10d@foz.lan \
    --to=mchehab+huawei@kernel.org \
    --cc=aleksandr.loktionov@intel.com \
    --cc=corbet@lwn.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rdunlap@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.