From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Alexander Lobakin <aleksander.lobakin@intel.com>,
Jonathan Corbet <corbet@lwn.net>, Kees Cook <kees@kernel.org>,
Mauro Carvalho Chehab <mchehab@kernel.org>,
intel-wired-lan@lists.osuosl.org, linux-doc@vger.kernel.org,
linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org,
"Gustavo A. R. Silva" <gustavoars@kernel.org>,
Aleksandr Loktionov <aleksandr.loktionov@intel.com>,
Randy Dunlap <rdunlap@infradead.org>,
Shuah Khan <skhan@linuxfoundation.org>
Subject: Re: [PATCH 00/38] docs: several improvements to kernel-doc
Date: Tue, 3 Mar 2026 15:53:10 +0100 [thread overview]
Message-ID: <20260303155310.5235b367@localhost> (raw)
In-Reply-To: <33d214091909b9a060637f56f81fb8f525cf433b@intel.com>
On Mon, 23 Feb 2026 15:47:00 +0200
Jani Nikula <jani.nikula@linux.intel.com> wrote:
> On Wed, 18 Feb 2026, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> > As anyone that worked before with kernel-doc are aware, using regex to
> > handle C input is not great. Instead, we need something closer to how
> > C statements and declarations are handled.
> >
> > Yet, to avoid breaking docs, I avoided touching the regex-based algorithms
> > inside it with one exception: struct_group logic was using very complex
> > regexes that are incompatible with Python internal "re" module.
> >
> > So, I came up with a different approach: NestedMatch. The logic inside
> > it is meant to properly handle brackets, square brackets and parenthesis,
> > which is closer to what C lexical parser does. On that time, I added
> > a TODO about the need to extend that.
>
> There's always the question, if you're putting a lot of effort into
> making kernel-doc closer to an actual C parser, why not put all that
> effort into using and adapting to, you know, an actual C parser?
Playing with this idea, it is not that hard to write an actual C
parser - or at least a tokenizer. There is already an example of it
at:
https://docs.python.org/3/library/re.html
I did a quick implementation, and it seems to be able to do its job:
$ ./tokenizer.py ./include/net/netlink.h
1: 0 COMMENT '/* SPDX-License-Identifier: GPL-2.0 */'
2: 0 CPP '#ifndef'
2: 8 ID '__NET_NETLINK_H'
3: 0 CPP '#define'
3: 8 ID '__NET_NETLINK_H'
5: 0 CPP '#include'
5: 9 OP '<'
5: 10 ID 'linux'
5: 15 OP '/'
5: 16 ID 'types'
5: 21 PUNC '.'
5: 22 ID 'h'
5: 23 OP '>'
6: 0 CPP '#include'
6: 9 OP '<'
6: 10 ID 'linux'
6: 15 OP '/'
6: 16 ID 'netlink'
6: 23 PUNC '.'
6: 24 ID 'h'
6: 25 OP '>'
7: 0 CPP '#include'
7: 9 OP '<'
7: 10 ID 'linux'
7: 15 OP '/'
7: 16 ID 'jiffies'
7: 23 PUNC '.'
7: 24 ID 'h'
7: 25 OP '>'
8: 0 CPP '#include'
8: 9 OP '<'
8: 10 ID 'linux'
8: 15 OP '/'
8: 16 ID 'in6'
...
12: 1 COMMENT '/**\n * Standard attribute types to specify validation policy\n */'
13: 0 ENUM 'enum'
13: 5 PUNC '{'
14: 1 ID 'NLA_UNSPEC'
14: 11 PUNC ','
15: 1 ID 'NLA_U8'
15: 7 PUNC ','
16: 1 ID 'NLA_U16'
16: 8 PUNC ','
17: 1 ID 'NLA_U32'
17: 8 PUNC ','
18: 1 ID 'NLA_U64'
18: 8 PUNC ','
19: 1 ID 'NLA_STRING'
19: 11 PUNC ','
20: 1 ID 'NLA_FLAG'
...
41: 0 STRUCT 'struct'
41: 7 ID 'netlink_range_validation'
41: 32 PUNC '{'
42: 1 ID 'u64'
42: 5 ID 'min'
42: 8 PUNC ','
42: 10 ID 'max'
42: 13 PUNC ';'
43: 0 PUNC '}'
43: 1 PUNC ';'
45: 0 STRUCT 'struct'
45: 7 ID 'netlink_range_validation_signed'
45: 39 PUNC '{'
46: 1 ID 's64'
46: 5 ID 'min'
46: 8 PUNC ','
46: 10 ID 'max'
46: 13 PUNC ';'
47: 0 PUNC '}'
47: 1 PUNC ';'
49: 0 ENUM 'enum'
49: 5 ID 'nla_policy_validation'
49: 27 PUNC '{'
50: 1 ID 'NLA_VALIDATE_NONE'
50: 18 PUNC ','
51: 1 ID 'NLA_VALIDATE_RANGE'
51: 19 PUNC ','
52: 1 ID 'NLA_VALIDATE_RANGE_WARN_TOO_LONG'
52: 33 PUNC ','
53: 1 ID 'NLA_VALIDATE_MIN'
53: 17 PUNC ','
54: 1 ID 'NLA_VALIDATE_MAX'
54: 17 PUNC ','
55: 1 ID 'NLA_VALIDATE_MASK'
55: 18 PUNC ','
56: 1 ID 'NLA_VALIDATE_RANGE_PTR'
56: 23 PUNC ','
57: 1 ID 'NLA_VALIDATE_FUNCTION'
57: 22 PUNC ','
58: 0 PUNC '}'
58: 1 PUNC ';'
It sounds doable to use it, and, at least on this example, it
properly picked the IDs.
On the other hand, using it would require lots of changes at
kernel-doc. So, I guess I'll add a tokenizer to kernel-doc, but
we should likely start using it gradually.
Maybe starting with NestedSearch and with public/private
comment handling (which is currently half-broken).
As a reference, the above was generated with the code below,
which was based on the Python re documentation.
Comments?
---
One side note: right now, we're not using typing at kernel-doc,
nor really following a proper coding style.
I wanted to use it during the conversion, and place consts in
uppercase, as this is currently the best practices, but doing
it while converting from Perl were very annoying. So, I opted
to make things simpler. Now that we have it coded, perhaps it
is time to define a coding style and apply it to kernel-doc.
--
Thanks,
Mauro
#!/usr/bin/env python3
import sys
import re
class Token():
def __init__(self, type, value, line, column):
self.type = type
self.value = value
self.line = line
self.column = column
class CTokenizer():
C_KEYWORDS = {
"struct", "union", "enum",
}
TOKEN_LIST = [
("COMMENT", r"//[^\n]*|/\*[\s\S]*?\*/"),
("STRING", r'"(?:\\.|[^"\\])*"'),
("CHAR", r"'(?:\\.|[^'\\])'"),
("NUMBER", r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),
("ID", r"[A-Za-z_][A-Za-z0-9_]*"),
("OP", r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),
("PUNC", r"[;,\.\[\]\(\)\{\}]"),
("CPP", r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)"),
("HASH", r"#"),
("NEWLINE", r"\n"),
("SKIP", r"[\s]+"),
("MISMATCH",r"."),
]
def __init__(self):
re_tokens = []
for name, pattern in self.TOKEN_LIST:
re_tokens.append(f"(?P<{name}>{pattern})")
self.re_scanner = re.compile("|".join(re_tokens),
re.MULTILINE | re.DOTALL)
def tokenize(self, code):
# Handle continuation lines
code = re.sub(r"\\\n", "", code)
line_num = 1
line_start = 0
for match in self.re_scanner.finditer(code):
kind = match.lastgroup
value = match.group()
column = match.start() - line_start
if kind == "NEWLINE":
line_start = match.end()
line_num += 1
continue
if kind in {"SKIP"}:
continue
if kind == "MISMATCH":
raise RuntimeError(f"Unexpected character {value!r} on line {line_num}")
if kind == "ID" and value in self.C_KEYWORDS:
kind = value.upper()
# For all other tokens we keep the raw string value
yield Token(kind, value, line_num, column)
if __name__ == "__main__":
if len(sys.argv) != 2:
print(f"Usage: python {sys.argv[0]} <fname>")
sys.exit(1)
fname = sys.argv[1]
try:
with open(fname, 'r', encoding='utf-8') as file:
sample = file.read()
except FileNotFoundError:
print(f"Error: The file '{fname}' was not found.")
sys.exit(1)
except Exception as e:
print(f"An error occurred while reading the file: {str(e)}")
sys.exit(1)
print(f"Tokens from {fname}:")
for tok in CTokenizer().tokenize(sample):
print(f"{tok.line:3d}:{tok.column:3d} {tok.type:12} {tok.value!r}")
next prev parent reply other threads:[~2026-03-03 14:53 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-18 10:12 [PATCH 00/38] docs: several improvements to kernel-doc Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 01/38] docs: kdoc_re: add support for groups() Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 02/38] docs: kdoc_re: don't go past the end of a line Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 03/38] docs: kdoc_parser: move var transformers to the beginning Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 04/38] docs: kdoc_parser: don't mangle with function defines Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 05/38] docs: kdoc_parser: add functions support for NestedMatch Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 06/38] docs: kdoc_parser: use NestedMatch to handle __attribute__ on functions Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 07/38] docs: kdoc_parser: fix variable regexes to work with size_t Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 08/38] docs: kdoc_parser: fix the default_value logic for variables Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 09/38] docs: kdoc_parser: add some debug for variable parsing Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 10/38] docs: kdoc_parser: don't exclude defaults from prototype Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 11/38] docs: kdoc_parser: fix parser to support multi-word types Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 12/38] docs: kdoc_parser: ignore context analysis and lock attributes Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 13/38] docs: kdoc_parser: add support for LIST_HEAD Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 14/38] docs: kdoc_parser: handle struct member macro VIRTIO_DECLARE_FEATURES(name) Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 15/38] docs: kdoc_re: properly handle strings and escape chars on it Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 16/38] docs: kdoc_re: better show KernRe() at documentation Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 17/38] docs: kdoc_re: don't recompile NestedMatch regex every time Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 18/38] docs: kdoc_re: Change NestedMath args replacement to \0 Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 19/38] docs: kdoc_re: make NestedMatch use KernRe Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 20/38] docs: kdoc_re: add support on NestedMatch for argument replacement Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 21/38] docs: kdoc_parser: better handle struct_group macros Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 22/38] docs: kdoc_re: fix a parse bug on struct page_pool_params Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 23/38] docs: kdoc_re: add a helper class to declare C function matches Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 24/38] docs: kdoc_parser: use the new CFunction class Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 25/38] docs: kdoc_parser: minimize differences with struct_group_tagged Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 26/38] docs: kdoc_parser: move transform lists to a separate file Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 27/38] docs: kdoc_re: don't remove the trailing ";" with NestedMatch Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 28/38] docs: kdoc_re: prevent adding whitespaces on sub replacements Mauro Carvalho Chehab
2026-02-18 10:12 ` [PATCH 29/38] docs: xforms_lists.py: use CFuntion to handle all function macros Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 30/38] docs: kdoc_files: allows the caller to use a different xforms class Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 31/38] docs: kdoc_re: Fix NestedMatch.sub() which causes PDF builds to break Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 32/38] docs: kdoc_files: document KernelFiles() ABI Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 33/38] docs: kdoc_output: add optional args to ManOutput class Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 34/38] docs: sphinx-build-wrapper: better handle troff .TH markups Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 35/38] docs: kdoc_output: use a more standard order for .TH on man pages Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 36/38] docs: sphinx-build-wrapper: don't allow "/" on file names Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 37/38] docs: kdoc_output: describe the class init parameters Mauro Carvalho Chehab
2026-02-18 10:13 ` [PATCH 38/38] docs: kdoc_output: pick a better default for modulename Mauro Carvalho Chehab
2026-02-21 1:24 ` [PATCH 00/38] docs: several improvements to kernel-doc Randy Dunlap
2026-02-22 1:24 ` Randy Dunlap
2026-02-23 13:47 ` Jani Nikula
2026-02-23 15:02 ` Jonathan Corbet
2026-02-24 13:25 ` [Intel-wired-lan] " Mauro Carvalho Chehab
2026-03-04 10:07 ` Jani Nikula
2026-03-04 12:20 ` [Intel-wired-lan] " Mauro Carvalho Chehab
2026-03-04 22:34 ` Jonathan Corbet
2026-03-13 10:48 ` [Intel-wired-lan] " Mauro Carvalho Chehab
2026-03-03 14:53 ` Mauro Carvalho Chehab [this message]
2026-03-03 15:12 ` Loktionov, Aleksandr
2026-03-03 16:09 ` [Intel-wired-lan] " Mauro Carvalho Chehab
2026-03-04 9:51 ` Jani Nikula
2026-02-23 21:58 ` Jonathan Corbet
2026-03-02 15:54 ` [Intel-wired-lan] " Mauro Carvalho Chehab
2026-03-02 16:14 ` Jonathan Corbet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260303155310.5235b367@localhost \
--to=mchehab+huawei@kernel.org \
--cc=aleksander.lobakin@intel.com \
--cc=aleksandr.loktionov@intel.com \
--cc=corbet@lwn.net \
--cc=gustavoars@kernel.org \
--cc=intel-wired-lan@lists.osuosl.org \
--cc=jani.nikula@linux.intel.com \
--cc=kees@kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-hardening@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=rdunlap@infradead.org \
--cc=skhan@linuxfoundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox