public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Alexander Lobakin <aleksander.lobakin@intel.com>,
	Jonathan Corbet <corbet@lwn.net>, Kees Cook <kees@kernel.org>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	intel-wired-lan@lists.osuosl.org, linux-doc@vger.kernel.org,
	linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org,
	"Gustavo A. R. Silva" <gustavoars@kernel.org>,
	Aleksandr Loktionov <aleksandr.loktionov@intel.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Shuah Khan <skhan@linuxfoundation.org>
Subject: Re: [PATCH 00/38] docs: several improvements to kernel-doc
Date: Tue, 3 Mar 2026 15:53:10 +0100	[thread overview]
Message-ID: <20260303155310.5235b367@localhost> (raw)
In-Reply-To: <33d214091909b9a060637f56f81fb8f525cf433b@intel.com>

On Mon, 23 Feb 2026 15:47:00 +0200
Jani Nikula <jani.nikula@linux.intel.com> wrote:

> On Wed, 18 Feb 2026, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > As anyone that worked before with kernel-doc are aware, using regex to
> > handle C input is not great. Instead, we need something closer to how
> > C statements and declarations are handled.
> >
> > Yet, to avoid breaking  docs, I avoided touching the regex-based algorithms
> > inside it with one exception: struct_group logic was using very complex
> > regexes that are incompatible with Python internal "re" module.
> >
> > So, I came up with a different approach: NestedMatch. The logic inside
> > it is meant to properly handle brackets, square brackets and parenthesis,
> > which is closer to what C lexical parser does. On that time, I added
> > a TODO about the need to extend that.  
> 
> There's always the question, if you're putting a lot of effort into
> making kernel-doc closer to an actual C parser, why not put all that
> effort into using and adapting to, you know, an actual C parser?

Playing with this idea, it is not that hard to write an actual C
parser - or at least a tokenizer. There is already an example of it
at:

	https://docs.python.org/3/library/re.html

I did a quick implementation, and it seems to be able to do its job:

    $ ./tokenizer.py ./include/net/netlink.h
      1:  0  COMMENT       '/* SPDX-License-Identifier: GPL-2.0 */'
      2:  0  CPP           '#ifndef'
      2:  8  ID            '__NET_NETLINK_H'
      3:  0  CPP           '#define'
      3:  8  ID            '__NET_NETLINK_H'
      5:  0  CPP           '#include'
      5:  9  OP            '<'
      5: 10  ID            'linux'
      5: 15  OP            '/'
      5: 16  ID            'types'
      5: 21  PUNC          '.'
      5: 22  ID            'h'
      5: 23  OP            '>'
      6:  0  CPP           '#include'
      6:  9  OP            '<'
      6: 10  ID            'linux'
      6: 15  OP            '/'
      6: 16  ID            'netlink'
      6: 23  PUNC          '.'
      6: 24  ID            'h'
      6: 25  OP            '>'
      7:  0  CPP           '#include'
      7:  9  OP            '<'
      7: 10  ID            'linux'
      7: 15  OP            '/'
      7: 16  ID            'jiffies'
      7: 23  PUNC          '.'
      7: 24  ID            'h'
      7: 25  OP            '>'
      8:  0  CPP           '#include'
      8:  9  OP            '<'
      8: 10  ID            'linux'
      8: 15  OP            '/'
      8: 16  ID            'in6'
...
     12:  1  COMMENT       '/**\n  * Standard attribute types to specify validation policy\n  */'
     13:  0  ENUM          'enum'
     13:  5  PUNC          '{'
     14:  1  ID            'NLA_UNSPEC'
     14: 11  PUNC          ','
     15:  1  ID            'NLA_U8'
     15:  7  PUNC          ','
     16:  1  ID            'NLA_U16'
     16:  8  PUNC          ','
     17:  1  ID            'NLA_U32'
     17:  8  PUNC          ','
     18:  1  ID            'NLA_U64'
     18:  8  PUNC          ','
     19:  1  ID            'NLA_STRING'
     19: 11  PUNC          ','
     20:  1  ID            'NLA_FLAG'
...
     41:  0  STRUCT        'struct'
     41:  7  ID            'netlink_range_validation'
     41: 32  PUNC          '{'
     42:  1  ID            'u64'
     42:  5  ID            'min'
     42:  8  PUNC          ','
     42: 10  ID            'max'
     42: 13  PUNC          ';'
     43:  0  PUNC          '}'
     43:  1  PUNC          ';'
     45:  0  STRUCT        'struct'
     45:  7  ID            'netlink_range_validation_signed'
     45: 39  PUNC          '{'
     46:  1  ID            's64'
     46:  5  ID            'min'
     46:  8  PUNC          ','
     46: 10  ID            'max'
     46: 13  PUNC          ';'
     47:  0  PUNC          '}'
     47:  1  PUNC          ';'
     49:  0  ENUM          'enum'
     49:  5  ID            'nla_policy_validation'
     49: 27  PUNC          '{'
     50:  1  ID            'NLA_VALIDATE_NONE'
     50: 18  PUNC          ','
     51:  1  ID            'NLA_VALIDATE_RANGE'
     51: 19  PUNC          ','
     52:  1  ID            'NLA_VALIDATE_RANGE_WARN_TOO_LONG'
     52: 33  PUNC          ','
     53:  1  ID            'NLA_VALIDATE_MIN'
     53: 17  PUNC          ','
     54:  1  ID            'NLA_VALIDATE_MAX'
     54: 17  PUNC          ','
     55:  1  ID            'NLA_VALIDATE_MASK'
     55: 18  PUNC          ','
     56:  1  ID            'NLA_VALIDATE_RANGE_PTR'
     56: 23  PUNC          ','
     57:  1  ID            'NLA_VALIDATE_FUNCTION'
     57: 22  PUNC          ','
     58:  0  PUNC          '}'
     58:  1  PUNC          ';'

It sounds doable to use it, and, at least on this example, it
properly picked the IDs.

On the other hand, using it would require lots of changes at
kernel-doc. So, I guess I'll add a tokenizer to kernel-doc, but
we should likely start using it gradually.

Maybe starting with NestedSearch and with public/private
comment handling (which is currently half-broken).

As a reference, the above was generated with the code below,
which was based on the Python re documentation.

Comments?

---

One side note: right now, we're not using typing at kernel-doc,
nor really following a proper coding style.

I wanted to use it during the conversion, and place consts in
uppercase, as this is currently the best practices, but doing
it while converting from Perl were very annoying. So, I opted
to make things simpler. Now that we have it coded, perhaps it
is time to define a coding style and apply it to kernel-doc.

-- 
Thanks,
Mauro

#!/usr/bin/env python3

import sys
import re

class Token():
    def __init__(self, type, value, line, column):
        self.type = type
        self.value = value
        self.line = line
        self.column = column

class CTokenizer():
    C_KEYWORDS = {
        "struct", "union", "enum",
    }

    TOKEN_LIST = [
        ("COMMENT", r"//[^\n]*|/\*[\s\S]*?\*/"),

        ("STRING",  r'"(?:\\.|[^"\\])*"'),
        ("CHAR",    r"'(?:\\.|[^'\\])'"),

        ("NUMBER",  r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
                    r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),

        ("ID",      r"[A-Za-z_][A-Za-z0-9_]*"),

        ("OP",      r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
                    r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),

        ("PUNC",    r"[;,\.\[\]\(\)\{\}]"),

        ("CPP",     r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)"),

        ("HASH",    r"#"),

        ("NEWLINE", r"\n"),

        ("SKIP",    r"[\s]+"),

        ("MISMATCH",r"."),
    ]

    def __init__(self):
        re_tokens = []

        for name, pattern in self.TOKEN_LIST:
            re_tokens.append(f"(?P<{name}>{pattern})")

        self.re_scanner = re.compile("|".join(re_tokens),
                                     re.MULTILINE | re.DOTALL)

    def tokenize(self, code):
        # Handle continuation lines
        code = re.sub(r"\\\n", "", code)

        line_num = 1
        line_start = 0

        for match in self.re_scanner.finditer(code):
            kind   = match.lastgroup
            value  = match.group()
            column = match.start() - line_start

            if kind == "NEWLINE":
                line_start = match.end()
                line_num += 1
                continue

            if kind in {"SKIP"}:
                continue

            if kind == "MISMATCH":
                raise RuntimeError(f"Unexpected character {value!r} on line {line_num}")

            if kind == "ID" and value in self.C_KEYWORDS:
                kind = value.upper()

            # For all other tokens we keep the raw string value
            yield Token(kind, value, line_num, column)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print(f"Usage: python {sys.argv[0]} <fname>")
        sys.exit(1)

    fname = sys.argv[1]

    try:
        with open(fname, 'r', encoding='utf-8') as file:
            sample = file.read()
    except FileNotFoundError:
        print(f"Error: The file '{fname}' was not found.")
        sys.exit(1)
    except Exception as e:
        print(f"An error occurred while reading the file: {str(e)}")
        sys.exit(1)

    print(f"Tokens from {fname}:")

    for tok in CTokenizer().tokenize(sample):
        print(f"{tok.line:3d}:{tok.column:3d}  {tok.type:12}  {tok.value!r}")


  parent reply	other threads:[~2026-03-03 14:53 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-18 10:12 [PATCH 00/38] docs: several improvements to kernel-doc Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 01/38] docs: kdoc_re: add support for groups() Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 02/38] docs: kdoc_re: don't go past the end of a line Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 03/38] docs: kdoc_parser: move var transformers to the beginning Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 04/38] docs: kdoc_parser: don't mangle with function defines Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 05/38] docs: kdoc_parser: add functions support for NestedMatch Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 06/38] docs: kdoc_parser: use NestedMatch to handle __attribute__ on functions Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 07/38] docs: kdoc_parser: fix variable regexes to work with size_t Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 08/38] docs: kdoc_parser: fix the default_value logic for variables Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 09/38] docs: kdoc_parser: add some debug for variable parsing Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 10/38] docs: kdoc_parser: don't exclude defaults from prototype Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 11/38] docs: kdoc_parser: fix parser to support multi-word types Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 12/38] docs: kdoc_parser: ignore context analysis and lock attributes Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 13/38] docs: kdoc_parser: add support for LIST_HEAD Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 14/38] docs: kdoc_parser: handle struct member macro VIRTIO_DECLARE_FEATURES(name) Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 15/38] docs: kdoc_re: properly handle strings and escape chars on it Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 16/38] docs: kdoc_re: better show KernRe() at documentation Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 17/38] docs: kdoc_re: don't recompile NestedMatch regex every time Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 18/38] docs: kdoc_re: Change NestedMath args replacement to \0 Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 19/38] docs: kdoc_re: make NestedMatch use KernRe Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 20/38] docs: kdoc_re: add support on NestedMatch for argument replacement Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 21/38] docs: kdoc_parser: better handle struct_group macros Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 22/38] docs: kdoc_re: fix a parse bug on struct page_pool_params Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 23/38] docs: kdoc_re: add a helper class to declare C function matches Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 24/38] docs: kdoc_parser: use the new CFunction class Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 25/38] docs: kdoc_parser: minimize differences with struct_group_tagged Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 26/38] docs: kdoc_parser: move transform lists to a separate file Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 27/38] docs: kdoc_re: don't remove the trailing ";" with NestedMatch Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 28/38] docs: kdoc_re: prevent adding whitespaces on sub replacements Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 29/38] docs: xforms_lists.py: use CFuntion to handle all function macros Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 30/38] docs: kdoc_files: allows the caller to use a different xforms class Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 31/38] docs: kdoc_re: Fix NestedMatch.sub() which causes PDF builds to break Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 32/38] docs: kdoc_files: document KernelFiles() ABI Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 33/38] docs: kdoc_output: add optional args to ManOutput class Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 34/38] docs: sphinx-build-wrapper: better handle troff .TH markups Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 35/38] docs: kdoc_output: use a more standard order for .TH on man pages Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 36/38] docs: sphinx-build-wrapper: don't allow "/" on file names Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 37/38] docs: kdoc_output: describe the class init parameters Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 38/38] docs: kdoc_output: pick a better default for modulename Mauro Carvalho Chehab
2026-02-21  1:24 ` [PATCH 00/38] docs: several improvements to kernel-doc Randy Dunlap
2026-02-22  1:24   ` Randy Dunlap
2026-02-23 13:47 ` Jani Nikula
2026-02-23 15:02   ` Jonathan Corbet
2026-02-24 13:25     ` [Intel-wired-lan] " Mauro Carvalho Chehab
2026-03-04 10:07     ` Jani Nikula
2026-03-04 12:20       ` [Intel-wired-lan] " Mauro Carvalho Chehab
2026-03-04 22:34       ` Jonathan Corbet
2026-03-13 10:48       ` [Intel-wired-lan] " Mauro Carvalho Chehab
2026-03-03 14:53   ` Mauro Carvalho Chehab [this message]
2026-03-03 15:12     ` Loktionov, Aleksandr
2026-03-03 16:09       ` [Intel-wired-lan] " Mauro Carvalho Chehab
2026-03-04  9:51     ` Jani Nikula
2026-02-23 21:58 ` Jonathan Corbet
2026-03-02 15:54   ` [Intel-wired-lan] " Mauro Carvalho Chehab
2026-03-02 16:14     ` Jonathan Corbet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260303155310.5235b367@localhost \
    --to=mchehab+huawei@kernel.org \
    --cc=aleksander.lobakin@intel.com \
    --cc=aleksandr.loktionov@intel.com \
    --cc=corbet@lwn.net \
    --cc=gustavoars@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=kees@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=rdunlap@infradead.org \
    --cc=skhan@linuxfoundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox