Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Jonathan Corbet <corbet@lwn.net>
Cc: Linux Doc Mailing List <linux-doc@vger.kernel.org>,
	linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org,
	Aleksandr Loktionov <aleksandr.loktionov@intel.com>,
	Randy Dunlap <rdunlap@infradead.org>
Subject: Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
Date: Tue, 17 Mar 2026 08:59:33 +0100	[thread overview]
Message-ID: <20260317085933.68440366@foz.lan> (raw)
In-Reply-To: <177370207134.1753752.17403055172165325174.b4-review@b4>

On Mon, 16 Mar 2026 17:01:11 -0600
Jonathan Corbet <corbet@lwn.net> wrote:

> On Thu, 12 Mar 2026 15:54:25 +0100, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > Handling C code purely using regular expressions doesn't work well.
> > 
> > Add a C tokenizer to help doing it the right way.
> > 
> > The tokenizer was written using as basis the Python re documentation
> > tokenizer example from:
> > 	https://docs.python.org/3/library/re.html#writing-a-tokenizer
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> > Message-ID: <c63ad36c81fe043e9e33ca55630414893f127413.1773074166.git.mchehab+huawei@kernel.org>
> > Message-ID: <8541ffa469647db1a7154f274fb2d55b4c127dcb.1773326442.git.mchehab+huawei@kernel.org>  
> 
> This is a combined effort to review this patch and to try out "b4 review",
> we'll see how it goes :).
> 
> > diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> > index 085b89a4547c0..7bed4e9a88108 100644
> > --- a/tools/lib/python/kdoc/kdoc_re.py
> > +++ b/tools/lib/python/kdoc/kdoc_re.py
> > @@ -141,6 +141,240 @@ class KernRe:
> > [ ... skip 4 lines ... ]
> > +
> > +    @staticmethod
> > +    def __str__(val):
> > +        """Return the name of an enum value"""
> > +        return TokType._name_by_val.get(val, f"UNKNOWN({val})")
> > +  
> 
> What is this class supposed to do?

This __str__() method ensures that, when printing a CToken object,
the name will be displayed, instead of a number. This is really
useful when debugging.

See, if I add a print:

<snip>
--- a/tools/lib/python/kdoc/kdoc_parser.py
+++ b/tools/lib/python/kdoc/kdoc_parser.py
@@ -87,6 +87,7 @@ def trim_private_members(text):
     """
 
     tokens = CTokenizer(text)
+    print(tokens.tokens)
     return str(tokens)
 
</snip>

the tokens will appear as names at the output:

$ ./scripts/kernel-doc -none er.c
[CToken(CToken.ENUM, "enum", 0, (0, 0, 0)), CToken(CToken.SPACE, " ", 4, (0, 0, 0)), CToken(CToken.NAME, "dmub_abm_ace_curve_type", 5, (0, 0, 0)), CToken(CToken.SPACE, " ", 28, (0, 0, 0)), CToken(CToken.BEGIN, "{", 29, (0, 0, 1)), CToken(CToken.SPACE, " ", 30, (0, 0, 1)), CToken(CToken.COMMENT, "/**
         * ACE curve as defined by the SW layer. */", 31, (0, 0, 1)), CToken(CToken.SPACE, " ", 86, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW", 87, (0, 0, 1)), CToken(CToken.SPACE, " ", 109, (0, 0, 1)), CToken(CToken.OP, "=", 110, (0, 0, 1)), CToken(CToken.SPACE, " ", 111, (0, 0, 1)), CToken(CToken.NUMBER, "0", 112, (0, 0, 1)), CToken(CToken.PUNC, ",", 113, (0, 0, 1)), CToken(CToken.SPACE, " ", 114, (0, 0, 1)), CToken(CToken.COMMENT, "/**
         * ACE curve as defined by the SW to HW translation interface layer. */", 115, (0, 0, 1)), CToken(CToken.SPACE, " ", 198, (0, 0, 1)), CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW_IF", 199, (0, 0, 1)), CToken(CToken.SPACE, " ", 224, (0, 0, 1)), CToken(CToken.OP, "=", 225, (0, 0, 1)), CToken(CToken.SPACE, " ", 226, (0, 0, 1)), CToken(CToken.NUMBER, "1", 227, (0, 0, 1)), CToken(CToken.PUNC, ",", 228, (0, 0, 1)), CToken(CToken.SPACE, " ", 229, (0, 0, 1)), CToken(CToken.END, "}", 230, (0, 0, 0)), CToken(CToken.PUNC, ";", 231, (0, 0, 0))]

> 
> > [ ... skip 27 lines ... ]
> > +    _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> > +
> > +    # Dict to convert from string to an enum-like integer value.
> > +    _name_to_val = {k: v for v, k in _name_by_val.items()}
> > +
> > +    @staticmethod  
> 
> This stuff strikes me as a bit overdone; _name_to_val is really just the
> variable list for the class, right?

Those two vars are a kind of magic: they create two dictionaries:

- _name_by_val converts a token integer into a string;
- _name_to_val converts a string to an integer.

I opted to use this approach for a couple of reasons:

1. using tok.kind == "BEGIN" (and similar) everywhere is harder to
maintain, as python won't check for typos. Now, if one writes:
CToken.BEGHIN, an error will be raised;

2. the cost to convert from string to int is O(1), so not much
   a performance issue at the conversion;

3. using an integer on all checks should make the code faster as
   it doesn't require a loop to check the string.

> 
> > [ ... skip 30 lines ... ]
> > +               f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> > +
> > +#: Tokens to parse C code.
> > +TOKEN_LIST = [
> > +    (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
> > +  
> 
> So these aren't "tokens", this is a list of regexes; how is it intended
> to be used?

Right. I could have named it better, like RE_TOKEN_LIST, or
TOKEN_REGEX, as at the original example, which came from Python
documentation for "re" module:

	https://docs.python.org/3/library/re.html#writing-a-tokenizer

basically, we have the token type as the first element at the tuple,
and regex as the second one.

When regex matches, the CToken will be filed with kind=tuple[0].

the loop:
	for match in re.finditer(TOKEN_LIST, code):

will parse the entire C code source, in order, converting it into
a token list. So, a file like this:

	/**
	 * enum dmub_abm_ace_curve_type - ACE curve type.
	 */
	enum dmub_abm_ace_curve_type {
	        /**
	         * ACE curve as defined by the SW layer.
	         */
	        ABM_ACE_CURVE_TYPE__SW = 0,
	        /**
	         * ACE curve as defined by the SW to HW translation interface layer.
	         */
	        ABM_ACE_CURVE_TYPE__SW_IF = 1,
	};

will become (I used pprint here to better align the tokens):

	[CToken(CToken.ENUM, "enum", 0, (0, 0, 0)),
	 CToken(CToken.SPACE, " ", 4, (0, 0, 0)),
	 CToken(CToken.NAME, "dmub_abm_ace_curve_type", 5, (0, 0, 0)),
	 CToken(CToken.SPACE, " ", 28, (0, 0, 0)),
	 CToken(CToken.BEGIN, "{", 29, (0, 0, 1)),
	 CToken(CToken.SPACE, " ", 30, (0, 0, 1)),
	 CToken(CToken.COMMENT, "/**
	         * ACE curve as defined by the SW layer. */", 31, (0, 0, 1)),
	 CToken(CToken.SPACE, " ", 86, (0, 0, 1)),
	 CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW", 87, (0, 0, 1)),
	 CToken(CToken.SPACE, " ", 109, (0, 0, 1)),
	 CToken(CToken.OP, "=", 110, (0, 0, 1)),
	 CToken(CToken.SPACE, " ", 111, (0, 0, 1)),
	 CToken(CToken.NUMBER, "0", 112, (0, 0, 1)),
	 CToken(CToken.PUNC, ",", 113, (0, 0, 1)),
	 CToken(CToken.SPACE, " ", 114, (0, 0, 1)),
	 CToken(CToken.COMMENT, "/**
	         * ACE curve as defined by the SW to HW translation interface layer. */", 115, (0, 0, 1)),
	 CToken(CToken.SPACE, " ", 198, (0, 0, 1)),
	 CToken(CToken.NAME, "ABM_ACE_CURVE_TYPE__SW_IF", 199, (0, 0, 1)),
	 CToken(CToken.SPACE, " ", 224, (0, 0, 1)),
	 CToken(CToken.OP, "=", 225, (0, 0, 1)),
	 CToken(CToken.SPACE, " ", 226, (0, 0, 1)),
	 CToken(CToken.NUMBER, "1", 227, (0, 0, 1)),
	 CToken(CToken.PUNC, ",", 228, (0, 0, 1)),
	 CToken(CToken.SPACE, " ", 229, (0, 0, 1)),
	 CToken(CToken.END, "}", 230, (0, 0, 0)),
	 CToken(CToken.PUNC, ";", 231, (0, 0, 0))]

> 
> > +    (CToken.STRING,  r'"(?:\\.|[^"\\])*"'),
> > +    (CToken.CHAR,    r"'(?:\\.|[^'\\])'"),
> > +
> > +    (CToken.NUMBER,  r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"  
> 
> How does "[\s\S]*" differ from plain old "*" ?

They are not identical, as "*" doesn't match "\n". As the tokenizer
also picks "\n" on several cases, like on comments, r"\s\S" works
better.

> > [ ... skip 15 lines ... ]
> > +    (CToken.STRUCT,  r"\bstruct\b"),
> > +    (CToken.UNION,   r"\bunion\b"),
> > +    (CToken.ENUM,    r"\benum\b"),
> > +    (CToken.TYPEDEF, r"\bkinddef\b"),
> > +
> > +    (CToken.NAME,      r"[A-Za-z_][A-Za-z0-9_]*"),  
> 

> "-" and "!" never need to be escaped.

"-" usually needs to be escaped, because it can be a range. I had
some troubles with the parser due to the lack of escapes, so I
ended being conservative.

( I'm finding a little bit hard to follow your comments...
  Here, for instance, "!" is not at CToken.NAME regex.
  Did b4 review place your comment at the wrong place? )

> 
> > +
> > +    (CToken.SPACE,   r"[\s]+"),
> > +
> > +    (CToken.MISMATCH,r"."),
> > +]
> > +  
> 
> "kinddef" ?

Should be "typedef".

This was due to a "sed s,type,kind," I applied to avoid using
"type" for the token type, as, when I started integrating it
with kdoc_re, it became confusing.

I'll fix at the next respin.

> 
> > +#: Handle C continuation lines.
> > +RE_CONT = KernRe(r"\\\n")
> > +
> > +RE_COMMENT_START = KernRe(r'/\*\s*')
> > +  
> 
> Don't need the [brackets] here

where?

> 
> > [ ... skip 6 lines ... ]
> > +
> > +    When converted to string, it drops comments and handle public/private
> > +    values, respecting depth.
> > +    """
> > +
> > +    # This class is inspired and follows the basic concepts of:  
> 
> That seems weird, why don't you just initialize it here?

Hard to tell what you're referring to. Maybe this:

	RE_SCANNER = fill_re_scanner(TOKEN_LIST)

The rationale is that I don't want to re-create this every time,
as this is const.

> 
> > [ ... skip 14 lines ... ]
> > +        source = RE_CONT.sub("", source)
> > +
> > +        brace_level = 0
> > +        paren_level = 0
> > +        bracket_level = 0
> > +  
> 
> Do you mean "iterator" here?

If you mean the typo at _tokenize() help text, yes:

	interactor -> iterator

> 
> > [ ... skip 33 lines ... ]
> > +        in this particular case, it makes sense, as we can pick the name
> > +        when matching a code via re_scanner().
> > +        """
> > +        global re_scanner
> > +
> > +        if not re_scanner:  
> 
> Putting __init__() first is fairly standard, methinks.

class CTokenizer __init__() module calls _tokenize() method on it.

My personal preference is to have the caller methods before the methods 
that actually call them, even inside a class, where the order doesn't
matter - or even in C, when we have an include with all prototypes.

But if you prefer, I can reorder it.

> > [ ... skip 15 lines ... ]
> > +
> > +        for tok in self.tokens:
> > +            if tok.kind == CToken.BEGIN:
> > +                show_stack.append(show_stack[-1])
> > +
> > +            elif tok.kind == CToken.END:  
> 
> I still don't understand why you do this here - this is all constant, right?

This one I didn't get to what part of the code you're referring to.

All constants on this code are using upper case names. They are:

- TOKEN_LIST (which should probably be named as TOKEN_REGEX_LIST).
- CToken enum-like names (BEGIN, END, OP, NAME, ...)
- three regexes (RE_COUNT, RE_COMMENT_START, RE_SCANNER)

See, what the tokenizer does is a linear transformation from a C
source string into a token list.

So, for each instance, its content will change. Also, when we apply
CMatch logic, its content will also change.

> 
> > +                prev = show_stack[-1]
> > +                if len(show_stack) > 1:
> > +                    show_stack.pop()
> > +
> > +                if not prev and show_stack[-1]:  
> 
> So you create a nice iterator structure, then just put it all together into a
> list anyway?

Not sure what you meant here.

The end result of the tokenizer is a list of tokens, in the order
they appear at the source code.

To be able to handle public/private and do code transforms, using it 
as a list is completely fine.

Now, if we want to use the tokenizer to parse things like:

	typedef struct ca_descr_info {
	        unsigned int num;
	        unsigned int type;
	} ca_descr_info_t;

Then having iterators to parse tokens on both directions
would be great, as the typedef identifier is at the end.

Thanks,
Mauro

next prev parent reply	other threads:[~2026-03-17  7:59 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 01/28] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 02/28] unittests: add a testbench to check public/private kdoc comments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 03/28] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 04/28] docs: kdoc: properly handle empty enum arguments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer Mauro Carvalho Chehab
2026-03-16 23:01   ` Jonathan Corbet
2026-03-17  7:59     ` Mauro Carvalho Chehab [this message]
2026-03-16 23:03   ` Jonathan Corbet
2026-03-16 23:29     ` Randy Dunlap
2026-03-16 23:40       ` Jonathan Corbet
2026-03-17  8:21         ` Mauro Carvalho Chehab
2026-03-17 17:04           ` Jonathan Corbet
2026-03-17  7:03       ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 06/28] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module Mauro Carvalho Chehab
2026-03-16 23:30   ` Jonathan Corbet
2026-03-17  8:02     ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 08/28] unittests: test_private: modify it to use CTokenizer directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 09/28] unittests: test_tokenizer: check if the tokenizer works Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 10/28] unittests: add a runner to execute all unittests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 11/28] docs: kdoc: create a CMatch to match nested C blocks Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 12/28] tools: unittests: add tests for CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 13/28] docs: c_lex: properly implement a sub() method " Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 14/28] unittests: test_cmatch: add tests for sub() Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 15/28] docs: kdoc: replace NestedMatch with CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 16/28] docs: kdoc_re: get rid of NestedMatch class Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 17/28] docs: xforms_lists: handle struct_group directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 18/28] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 19/28] docs: c_lex: add support to work with pure name ids Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 20/28] docs: xforms_lists: use CMatch for all identifiers Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 21/28] docs: c_lex: add "@" operator Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 22/28] docs: c_lex: don't exclude an extra token Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 23/28] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 24/28] docs: unittests: add and adjust tests to check for errors Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 25/28] docs: c_lex: better handle BEGIN/END at search Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 26/28] docs: kernel-doc.rst: document private: scope propagation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 27/28] docs: c_lex: produce a cleaner str() representation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 28/28] unittests: test_cmatch: remove weird stuff from expected results Mauro Carvalho Chehab
2026-03-13  8:34 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
2026-03-13  8:34   ` [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
2026-03-13 11:05     ` Loktionov, Aleksandr
2026-03-13 11:05   ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Loktionov, Aleksandr
2026-03-13  9:17 ` [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-17 17:12 ` Jonathan Corbet
2026-03-17 18:00   ` Mauro Carvalho Chehab
2026-03-17 18:57   ` Mauro Carvalho Chehab

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260317085933.68440366@foz.lan \
    --to=mchehab+huawei@kernel.org \
    --cc=aleksandr.loktionov@intel.com \
    --cc=corbet@lwn.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rdunlap@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.