Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Randy Dunlap <rdunlap@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>,
	Linux Doc Mailing List <linux-doc@vger.kernel.org>,
	linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org,
	Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Subject: Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer
Date: Tue, 17 Mar 2026 08:03:32 +0100	[thread overview]
Message-ID: <20260317080332.355d451c@foz.lan> (raw)
In-Reply-To: <c53a3638-7a72-472c-81e8-86a6c235b598@infradead.org>

On Mon, 16 Mar 2026 16:29:37 -0700
Randy Dunlap <rdunlap@infradead.org> wrote:

> Uh, I find this review confusing.
> Do your (Jon) comments refer to the code above them?
> (more below)

I was about to comment the same thing: it sounds that b4 review did a
big mess with your comments, as it is very hard to identify what part
of the code you're referring to.

I'll reply to your comments on a separate e-mail - at least the ones I
understand.

> 
> 
> On 3/16/26 4:03 PM, Jonathan Corbet wrote:
> > On Thu, 12 Mar 2026 15:54:25 +0100, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:  
> >> Handling C code purely using regular expressions doesn't work well.
> >>
> >> Add a C tokenizer to help doing it the right way.
> >>
> >> The tokenizer was written using as basis the Python re documentation
> >> tokenizer example from:
> >> 	https://docs.python.org/3/library/re.html#writing-a-tokenizer
> >>
> >> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> >> Message-ID: <c63ad36c81fe043e9e33ca55630414893f127413.1773074166.git.mchehab+huawei@kernel.org>
> >> Message-ID: <8541ffa469647db1a7154f274fb2d55b4c127dcb.1773326442.git.mchehab+huawei@kernel.org>  
> > 
> > This is a combined effort to review this patch and to try out "b4 review",
> > we'll see how it goes :).
> >   
> >> diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> >> index 085b89a4547c0..7bed4e9a88108 100644
> >> --- a/tools/lib/python/kdoc/kdoc_re.py
> >> +++ b/tools/lib/python/kdoc/kdoc_re.py
> >> @@ -141,6 +141,240 @@ class KernRe:
> >> [ ... skip 4 lines ... ]
> >> +
> >> +    @staticmethod
> >> +    def __str__(val):
> >> +        """Return the name of an enum value"""
> >> +        return TokType._name_by_val.get(val, f"UNKNOWN({val})")
> >> +  
> > 
> > What is this class supposed to do?
> >   
> >> [ ... skip 27 lines ... ]
> >> +    _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> >> +
> >> +    # Dict to convert from string to an enum-like integer value.
> >> +    _name_to_val = {k: v for v, k in _name_by_val.items()}
> >> +
> >> +    @staticmethod  
> > 
> > This stuff strikes me as a bit overdone; _name_to_val is really just the
> > variable list for the class, right?
> >   
> >> [ ... skip 30 lines ... ]
> >> +               f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> >> +
> >> +#: Tokens to parse C code.
> >> +TOKEN_LIST = [
> >> +    (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
> >> +  
> > 
> > So these aren't "tokens", this is a list of regexes; how is it intended
> > to be used?
> >   
> >> +    (CToken.STRING,  r'"(?:\\.|[^"\\])*"'),
> >> +    (CToken.CHAR,    r"'(?:\\.|[^'\\])'"),
> >> +
> >> +    (CToken.NUMBER,  r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"  
> > 
> > How does "[\s\S]*" differ from plain old "*" ?
> >   
> >> [ ... skip 15 lines ... ]
> >> +    (CToken.STRUCT,  r"\bstruct\b"),
> >> +    (CToken.UNION,   r"\bunion\b"),
> >> +    (CToken.ENUM,    r"\benum\b"),
> >> +    (CToken.TYPEDEF, r"\bkinddef\b"),
> >> +
> >> +    (CToken.NAME,      r"[A-Za-z_][A-Za-z0-9_]*"),  
> > 
> > "-" and "!" never need to be escaped.
> >   
> >> +
> >> +    (CToken.SPACE,   r"[\s]+"),
> >> +
> >> +    (CToken.MISMATCH,r"."),
> >> +]
> >> +  
> > 
> > "kinddef" ?  
> 
> What does that refer to?
> 
> >   
> >> +#: Handle C continuation lines.
> >> +RE_CONT = KernRe(r"\\\n")
> >> +
> >> +RE_COMMENT_START = KernRe(r'/\*\s*')
> >> +  
> > 
> > Don't need the [brackets] here  
> 
> what brackets?
> 
> >   
> >> [ ... skip 6 lines ... ]
> >> +
> >> +    When converted to string, it drops comments and handle public/private
> >> +    values, respecting depth.
> >> +    """
> >> +
> >> +    # This class is inspired and follows the basic concepts of:  
> > 
> > That seems weird, why don't you just initialize it here?  
> 
> I can't tell what that comments refers to.
> 
> >> [ ... skip 14 lines ... ]
> >> +        source = RE_CONT.sub("", source)
> >> +
> >> +        brace_level = 0
> >> +        paren_level = 0
> >> +        bracket_level = 0
> >> +  
> > 
> > Do you mean "iterator" here?  
> 
> Ditto.
> 
> >> [ ... skip 33 lines ... ]
> >> +        in this particular case, it makes sense, as we can pick the name
> >> +        when matching a code via re_scanner().
> >> +        """
> >> +        global re_scanner
> >> +
> >> +        if not re_scanner:  
> > 
> > Putting __init__() first is fairly standard, methinks.
> >   
> >> [ ... skip 15 lines ... ]
> >> +
> >> +        for tok in self.tokens:
> >> +            if tok.kind == CToken.BEGIN:
> >> +                show_stack.append(show_stack[-1])
> >> +
> >> +            elif tok.kind == CToken.END:  
> > 
> > I still don't understand why you do this here - this is all constant, right?
> >   
> >> +                prev = show_stack[-1]
> >> +                if len(show_stack) > 1:
> >> +                    show_stack.pop()
> >> +
> >> +                if not prev and show_stack[-1]:  
> > 
> > So you create a nice iterator structure, then just put it all together into a
> > list anyway?
> >   
> 



Thanks,
Mauro

next prev parent reply	other threads:[~2026-03-17  7:03 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12 14:54 [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 01/28] docs: python: add helpers to run unit tests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 02/28] unittests: add a testbench to check public/private kdoc comments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 03/28] docs: kdoc: don't add broken comments inside prototypes Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 04/28] docs: kdoc: properly handle empty enum arguments Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer Mauro Carvalho Chehab
2026-03-16 23:01   ` Jonathan Corbet
2026-03-17  7:59     ` Mauro Carvalho Chehab
2026-03-16 23:03   ` Jonathan Corbet
2026-03-16 23:29     ` Randy Dunlap
2026-03-16 23:40       ` Jonathan Corbet
2026-03-17  8:21         ` Mauro Carvalho Chehab
2026-03-17 17:04           ` Jonathan Corbet
2026-03-17  7:03       ` Mauro Carvalho Chehab [this message]
2026-03-12 14:54 ` [PATCH v2 06/28] docs: kdoc: use tokenizer to handle comments on structs Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 07/28] docs: kdoc: move C Tokenizer to c_lex module Mauro Carvalho Chehab
2026-03-16 23:30   ` Jonathan Corbet
2026-03-17  8:02     ` Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 08/28] unittests: test_private: modify it to use CTokenizer directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 09/28] unittests: test_tokenizer: check if the tokenizer works Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 10/28] unittests: add a runner to execute all unittests Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 11/28] docs: kdoc: create a CMatch to match nested C blocks Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 12/28] tools: unittests: add tests for CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 13/28] docs: c_lex: properly implement a sub() method " Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 14/28] unittests: test_cmatch: add tests for sub() Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 15/28] docs: kdoc: replace NestedMatch with CMatch Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 16/28] docs: kdoc_re: get rid of NestedMatch class Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 17/28] docs: xforms_lists: handle struct_group directly Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 18/28] docs: xforms_lists: better evaluate struct_group macros Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 19/28] docs: c_lex: add support to work with pure name ids Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 20/28] docs: xforms_lists: use CMatch for all identifiers Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 21/28] docs: c_lex: add "@" operator Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 22/28] docs: c_lex: don't exclude an extra token Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 23/28] docs: c_lex: setup a logger to report tokenizer issues Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 24/28] docs: unittests: add and adjust tests to check for errors Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 25/28] docs: c_lex: better handle BEGIN/END at search Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 26/28] docs: kernel-doc.rst: document private: scope propagation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 27/28] docs: c_lex: produce a cleaner str() representation Mauro Carvalho Chehab
2026-03-12 14:54 ` [PATCH v2 28/28] unittests: test_cmatch: remove weird stuff from expected results Mauro Carvalho Chehab
2026-03-13  8:34 ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Mauro Carvalho Chehab
2026-03-13  8:34   ` [PATCH v2 30/28] docs: kdoc_parser: avoid tokenizing structs everytime Mauro Carvalho Chehab
2026-03-13 11:05     ` Loktionov, Aleksandr
2026-03-13 11:05   ` [PATCH v2 29/28] docs: kdoc: ensure that comments are dropped before calling split_struct_proto() Loktionov, Aleksandr
2026-03-13  9:17 ` [PATCH v2 00/28] kernel-doc: use a C lexical tokenizer for transforms Mauro Carvalho Chehab
2026-03-17 17:12 ` Jonathan Corbet
2026-03-17 18:00   ` Mauro Carvalho Chehab
2026-03-17 18:57   ` Mauro Carvalho Chehab

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260317080332.355d451c@foz.lan \
    --to=mchehab+huawei@kernel.org \
    --cc=aleksandr.loktionov@intel.com \
    --cc=corbet@lwn.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rdunlap@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.