* [PATCH 00/12] docs: kdoc: thrash up dump_struct()
@ 2025-08-01 0:13 Jonathan Corbet
2025-08-01 0:13 ` [PATCH 01/12] docs: kdoc: consolidate the stripping of private struct/union members Jonathan Corbet
` (12 more replies)
0 siblings, 13 replies; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
In my ongoing effort to truly understand our new kernel-doc, I continue to
make changes to improve the code, and to try to make the understanding task
easier for the next person. These patches focus on dump_struct() in
particular, which starts out at nearly 300 lines long - to much to fit into
my little brain anyway. Hopefully the result is easier to manage.
There are no changes in the rendered docs.
(At some point I think this code could benefit from a deeper rework. We
are essentially making three parsing passes over these declarations -
dump_struct(), create_parameter_list(), and push_parameter() for structs -
and it seems like we ought to be able to do better. But that's for another
day.)
(This is post-merge-window material, obviously).
Jonathan Corbet (12):
docs: kdoc: consolidate the stripping of private struct/union members
docs: kdoc: Move a regex line in dump_struct()
docs: kdoc: backslashectomy in kdoc_parser
docs: kdoc: move the prefix transforms out of dump_struct()
docs: kdoc: split top-level prototype parsing out of dump_struct()
docs: kdoc: split struct-member rewriting out of dump_struct()
docs: kdoc: rework the rewrite_struct_members() main loop
docs: kdoc: remove an extraneous strip() call
docs: kdoc: Some rewrite_struct_members() commenting
docs: kdoc: further rewrite_struct_members() cleanup
docs: kdoc: extract output formatting from dump_struct()
docs: kdoc: a few final dump_struct() touches
scripts/lib/kdoc/kdoc_parser.py | 516 +++++++++++++++++---------------
1 file changed, 267 insertions(+), 249 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 01/12] docs: kdoc: consolidate the stripping of private struct/union members
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 5:29 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 02/12] docs: kdoc: Move a regex line in dump_struct() Jonathan Corbet
` (11 subsequent siblings)
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
There were two locations duplicating the logic of stripping private members
and associated comments; coalesce them into one, and add some comments
describing what's going on.
Output change: we now no longer add extraneous white space around macro
definitions.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 40 ++++++++++++++++++---------------
1 file changed, 22 insertions(+), 18 deletions(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index c3fe4bd5eab4..93fcd8807aa8 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -81,6 +81,21 @@ multi_space = KernRe(r'\s\s+')
def trim_whitespace(s):
return multi_space.sub(' ', s.strip())
+#
+# Remove struct/enum members that have been marked "private".
+#
+def trim_private_members(text):
+ #
+ # First look for a "public:" block that ends a private region, then
+ # handle the "private until the end" case.
+ #
+ text = KernRe(r'/\*\s*private:.*?/\*\s*public:.*?\*/', flags=re.S).sub('', text)
+ text = KernRe(r'/\*\s*private:.*', flags=re.S).sub('', text)
+ #
+ # We needed the comments to do the above, but now we can take them out.
+ #
+ return KernRe(r'\s*/\*.*?\*/\s*', flags=re.S).sub('', text).strip()
+
class state:
"""
State machine enums
@@ -568,12 +583,6 @@ class KernelDoc:
args_pattern = r'([^,)]+)'
sub_prefixes = [
- (KernRe(r'\/\*\s*private:.*?\/\*\s*public:.*?\*\/', re.S | re.I), ''),
- (KernRe(r'\/\*\s*private:.*', re.S | re.I), ''),
-
- # Strip comments
- (KernRe(r'\/\*.*?\*\/', re.S), ''),
-
# Strip attributes
(attribute, ' '),
(KernRe(r'\s*__aligned\s*\([^;]*\)', re.S), ' '),
@@ -648,6 +657,7 @@ class KernelDoc:
(re.compile(r'\bSTRUCT_GROUP\('), r'\1'),
]
+ members = trim_private_members(members)
for search, sub in sub_prefixes:
members = search.sub(sub, members)
@@ -797,24 +807,18 @@ class KernelDoc:
"""
Stores an enum inside self.entries array.
"""
-
- # Ignore members marked private
- proto = KernRe(r'\/\*\s*private:.*?\/\*\s*public:.*?\*\/', flags=re.S).sub('', proto)
- proto = KernRe(r'\/\*\s*private:.*}', flags=re.S).sub('}', proto)
-
- # Strip comments
- proto = KernRe(r'\/\*.*?\*\/', flags=re.S).sub('', proto)
-
- # Strip #define macros inside enums
+ #
+ # Strip preprocessor directives. Note that this depends on the
+ # trailing semicolon we added in process_proto_type().
+ #
proto = KernRe(r'#\s*((define|ifdef|if)\s+|endif)[^;]*;', flags=re.S).sub('', proto)
-
#
# Parse out the name and members of the enum. Typedef form first.
#
r = KernRe(r'typedef\s+enum\s*\{(.*)\}\s*(\w*)\s*;')
if r.search(proto):
declaration_name = r.group(2)
- members = r.group(1).rstrip()
+ members = trim_private_members(r.group(1))
#
# Failing that, look for a straight enum
#
@@ -822,7 +826,7 @@ class KernelDoc:
r = KernRe(r'enum\s+(\w*)\s*\{(.*)\}')
if r.match(proto):
declaration_name = r.group(1)
- members = r.group(2).rstrip()
+ members = trim_private_members(r.group(2))
#
# OK, this isn't going to work.
#
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 02/12] docs: kdoc: Move a regex line in dump_struct()
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
2025-08-01 0:13 ` [PATCH 01/12] docs: kdoc: consolidate the stripping of private struct/union members Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 5:29 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 03/12] docs: kdoc: backslashectomy in kdoc_parser Jonathan Corbet
` (10 subsequent siblings)
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
The complex struct_members regex was defined far from its use; bring the
two together. Remove some extraneous backslashes while making the move.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index 93fcd8807aa8..9948ede739a5 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -551,7 +551,6 @@ class KernelDoc:
]
definition_body = r'\{(.*)\}\s*' + "(?:" + '|'.join(qualifiers) + ")?"
- struct_members = KernRe(type_pattern + r'([^\{\};]+)(\{)([^\{\}]*)(\})([^\{\}\;]*)(\;)')
# Extract struct/union definition
members = None
@@ -683,6 +682,7 @@ class KernelDoc:
# So, we need to have an extra loop on Python to override such
# re limitation.
+ struct_members = KernRe(type_pattern + r'([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
while True:
tuples = struct_members.findall(members)
if not tuples:
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 03/12] docs: kdoc: backslashectomy in kdoc_parser
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
2025-08-01 0:13 ` [PATCH 01/12] docs: kdoc: consolidate the stripping of private struct/union members Jonathan Corbet
2025-08-01 0:13 ` [PATCH 02/12] docs: kdoc: Move a regex line in dump_struct() Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 4:27 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 04/12] docs: kdoc: move the prefix transforms out of dump_struct() Jonathan Corbet
` (9 subsequent siblings)
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
A lot of the regular expressions in this file have extraneous backslashes
that may have been needed in Perl, but aren't helpful here. Take them out
to reduce slightly the visual noise.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 40 ++++++++++++++++-----------------
1 file changed, 20 insertions(+), 20 deletions(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index 9948ede739a5..e1efa65a3480 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -46,7 +46,7 @@ doc_decl = doc_com + KernRe(r'(\w+)', cache=False)
known_section_names = 'description|context|returns?|notes?|examples?'
known_sections = KernRe(known_section_names, flags = re.I)
doc_sect = doc_com + \
- KernRe(r'\s*(\@[.\w]+|\@\.\.\.|' + known_section_names + r')\s*:([^:].*)?$',
+ KernRe(r'\s*(@[.\w]+|@\.\.\.|' + known_section_names + r')\s*:([^:].*)?$',
flags=re.I, cache=False)
doc_content = doc_com_body + KernRe(r'(.*)', cache=False)
@@ -60,7 +60,7 @@ attribute = KernRe(r"__attribute__\s*\(\([a-z0-9,_\*\s\(\)]*\)\)",
export_symbol = KernRe(r'^\s*EXPORT_SYMBOL(_GPL)?\s*\(\s*(\w+)\s*\)\s*', cache=False)
export_symbol_ns = KernRe(r'^\s*EXPORT_SYMBOL_NS(_GPL)?\s*\(\s*(\w+)\s*,\s*"\S+"\)\s*', cache=False)
-type_param = KernRe(r"\@(\w*((\.\w+)|(->\w+))*(\.\.\.)?)", cache=False)
+type_param = KernRe(r"@(\w*((\.\w+)|(->\w+))*(\.\.\.)?)", cache=False)
#
# Tests for the beginning of a kerneldoc block in its various forms.
@@ -331,7 +331,7 @@ class KernelDoc:
self.entry.anon_struct_union = False
- param = KernRe(r'[\[\)].*').sub('', param, count=1)
+ param = KernRe(r'[)[].*').sub('', param, count=1)
if dtype == "" and param.endswith("..."):
if KernRe(r'\w\.\.\.$').search(param):
@@ -405,7 +405,7 @@ class KernelDoc:
for arg in args.split(splitter):
# Strip comments
- arg = KernRe(r'\/\*.*\*\/').sub('', arg)
+ arg = KernRe(r'/\*.*\*/').sub('', arg)
# Ignore argument attributes
arg = KernRe(r'\sPOS0?\s').sub(' ', arg)
@@ -428,14 +428,14 @@ class KernelDoc:
arg = arg.replace('#', ',')
- r = KernRe(r'[^\(]+\(\*?\s*([\w\[\]\.]*)\s*\)')
+ r = KernRe(r'[^(]+\(\*?\s*([\w[\].]*)\s*\)')
if r.match(arg):
param = r.group(1)
else:
self.emit_msg(ln, f"Invalid param: {arg}")
param = arg
- dtype = KernRe(r'([^\(]+\(\*?)\s*' + re.escape(param)).sub(r'\1', arg)
+ dtype = KernRe(r'([^(]+\(\*?)\s*' + re.escape(param)).sub(r'\1', arg)
self.push_parameter(ln, decl_type, param, dtype,
arg, declaration_name)
@@ -443,14 +443,14 @@ class KernelDoc:
# Array-of-pointers
arg = arg.replace('#', ',')
- r = KernRe(r'[^\(]+\(\s*\*\s*([\w\[\]\.]*?)\s*(\s*\[\s*[\w]+\s*\]\s*)*\)')
+ r = KernRe(r'[^(]+\(\s*\*\s*([\w[\].]*?)\s*(\s*\[\s*[\w]+\s*\]\s*)*\)')
if r.match(arg):
param = r.group(1)
else:
self.emit_msg(ln, f"Invalid param: {arg}")
param = arg
- dtype = KernRe(r'([^\(]+\(\*?)\s*' + re.escape(param)).sub(r'\1', arg)
+ dtype = KernRe(r'([^(]+\(\*?)\s*' + re.escape(param)).sub(r'\1', arg)
self.push_parameter(ln, decl_type, param, dtype,
arg, declaration_name)
@@ -637,8 +637,8 @@ class KernelDoc:
# it is better to also move those to the NestedMatch logic,
# to ensure that parenthesis will be properly matched.
- (KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
- (KernRe(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
+ (KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^)]+)\)', re.S), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
+ (KernRe(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^)]+)\)', re.S), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
(KernRe(r'DECLARE_BITMAP\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[BITS_TO_LONGS(\2)]'),
(KernRe(r'DECLARE_HASHTABLE\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[1 << ((\2) - 1)]'),
(KernRe(r'DECLARE_KFIFO\s*\(' + args_pattern + r',\s*' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\2 *\1'),
@@ -700,7 +700,7 @@ class KernelDoc:
s_id = s_id.strip()
newmember += f"{maintype} {s_id}; "
- s_id = KernRe(r'[:\[].*').sub('', s_id)
+ s_id = KernRe(r'[:[].*').sub('', s_id)
s_id = KernRe(r'^\s*\**(\S+)\s*').sub(r'\1', s_id)
for arg in content.split(';'):
@@ -709,7 +709,7 @@ class KernelDoc:
if not arg:
continue
- r = KernRe(r'^([^\(]+\(\*?\s*)([\w\.]*)(\s*\).*)')
+ r = KernRe(r'^([^(]+\(\*?\s*)([\w.]*)(\s*\).*)')
if r.match(arg):
# Pointer-to-function
dtype = r.group(1)
@@ -767,12 +767,12 @@ class KernelDoc:
self.check_sections(ln, declaration_name, decl_type)
# Adjust declaration for better display
- declaration = KernRe(r'([\{;])').sub(r'\1\n', declaration)
+ declaration = KernRe(r'([{;])').sub(r'\1\n', declaration)
declaration = KernRe(r'\}\s+;').sub('};', declaration)
# Better handle inlined enums
while True:
- r = KernRe(r'(enum\s+\{[^\}]+),([^\n])')
+ r = KernRe(r'(enum\s+\{[^}]+),([^\n])')
if not r.search(declaration):
break
@@ -969,8 +969,8 @@ class KernelDoc:
# - pci_match_device, __copy_to_user (long return type)
name = r'[a-zA-Z0-9_~:]+'
- prototype_end1 = r'[^\(]*'
- prototype_end2 = r'[^\{]*'
+ prototype_end1 = r'[^(]*'
+ prototype_end2 = r'[^{]*'
prototype_end = fr'\(({prototype_end1}|{prototype_end2})\)'
# Besides compiling, Perl qr{[\w\s]+} works as a non-capturing group.
@@ -1044,7 +1044,7 @@ class KernelDoc:
Stores a typedef inside self.entries array.
"""
- typedef_type = r'((?:\s+[\w\*]+\b){0,7}\s+(?:\w+\b|\*+))\s*'
+ typedef_type = r'((?:\s+[\w*]+\b){0,7}\s+(?:\w+\b|\*+))\s*'
typedef_ident = r'\*?\s*(\w\S+)\s*'
typedef_args = r'\s*\((.*)\);'
@@ -1265,7 +1265,7 @@ class KernelDoc:
self.dump_section()
# Look for doc_com + <text> + doc_end:
- r = KernRe(r'\s*\*\s*[a-zA-Z_0-9:\.]+\*/')
+ r = KernRe(r'\s*\*\s*[a-zA-Z_0-9:.]+\*/')
if r.match(line):
self.emit_msg(ln, f"suspicious ending line: {line}")
@@ -1476,14 +1476,14 @@ class KernelDoc:
"""Ancillary routine to process a function prototype"""
# strip C99-style comments to end of line
- line = KernRe(r"\/\/.*$", re.S).sub('', line)
+ line = KernRe(r"//.*$", re.S).sub('', line)
#
# Soak up the line's worth of prototype text, stopping at { or ; if present.
#
if KernRe(r'\s*#\s*define').match(line):
self.entry.prototype = line
elif not line.startswith('#'): # skip other preprocessor stuff
- r = KernRe(r'([^\{]*)')
+ r = KernRe(r'([^{]*)')
if r.match(line):
self.entry.prototype += r.group(1) + " "
#
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 04/12] docs: kdoc: move the prefix transforms out of dump_struct()
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
` (2 preceding siblings ...)
2025-08-01 0:13 ` [PATCH 03/12] docs: kdoc: backslashectomy in kdoc_parser Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 5:28 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 05/12] docs: kdoc: split top-level prototype parsing " Jonathan Corbet
` (8 subsequent siblings)
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
dump_struct is one of the longest functions in the kdoc_parser class,
making it hard to read and reason about. Move the definition of the prefix
transformations out of the function, join them with the definition of
"attribute" (which was defined at the top of the file but only used here),
and reformat the code slightly for shorter line widths.
Just code movement in the end.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 178 +++++++++++++++++---------------
1 file changed, 96 insertions(+), 82 deletions(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index e1efa65a3480..5e375318df9c 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -54,8 +54,6 @@ doc_inline_start = KernRe(r'^\s*/\*\*\s*$', cache=False)
doc_inline_sect = KernRe(r'\s*\*\s*(@\s*[\w][\w\.]*\s*):(.*)', cache=False)
doc_inline_end = KernRe(r'^\s*\*/\s*$', cache=False)
doc_inline_oneline = KernRe(r'^\s*/\*\*\s*(@[\w\s]+):\s*(.*)\s*\*/\s*$', cache=False)
-attribute = KernRe(r"__attribute__\s*\(\([a-z0-9,_\*\s\(\)]*\)\)",
- flags=re.I | re.S, cache=False)
export_symbol = KernRe(r'^\s*EXPORT_SYMBOL(_GPL)?\s*\(\s*(\w+)\s*\)\s*', cache=False)
export_symbol_ns = KernRe(r'^\s*EXPORT_SYMBOL_NS(_GPL)?\s*\(\s*(\w+)\s*,\s*"\S+"\)\s*', cache=False)
@@ -74,6 +72,97 @@ doc_begin_func = KernRe(str(doc_com) + # initial " * '
r'(?:[-:].*)?$', # description (not captured)
cache = False)
+#
+# Here begins a long set of transformations to turn structure member prefixes
+# and macro invocations into something we can parse and generate kdoc for.
+#
+struct_attribute = KernRe(r"__attribute__\s*\(\([a-z0-9,_\*\s\(\)]*\)\)",
+ flags=re.I | re.S, cache=False)
+struct_args_pattern = r'([^,)]+)'
+
+struct_prefixes = [
+ # Strip attributes
+ (struct_attribute, ' '),
+ (KernRe(r'\s*__aligned\s*\([^;]*\)', re.S), ' '),
+ (KernRe(r'\s*__counted_by\s*\([^;]*\)', re.S), ' '),
+ (KernRe(r'\s*__counted_by_(le|be)\s*\([^;]*\)', re.S), ' '),
+ (KernRe(r'\s*__packed\s*', re.S), ' '),
+ (KernRe(r'\s*CRYPTO_MINALIGN_ATTR', re.S), ' '),
+ (KernRe(r'\s*____cacheline_aligned_in_smp', re.S), ' '),
+ (KernRe(r'\s*____cacheline_aligned', re.S), ' '),
+ #
+ # Unwrap struct_group macros based on this definition:
+ # __struct_group(TAG, NAME, ATTRS, MEMBERS...)
+ # which has variants like: struct_group(NAME, MEMBERS...)
+ # Only MEMBERS arguments require documentation.
+ #
+ # Parsing them happens on two steps:
+ #
+ # 1. drop struct group arguments that aren't at MEMBERS,
+ # storing them as STRUCT_GROUP(MEMBERS)
+ #
+ # 2. remove STRUCT_GROUP() ancillary macro.
+ #
+ # The original logic used to remove STRUCT_GROUP() using an
+ # advanced regex:
+ #
+ # \bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;
+ #
+ # with two patterns that are incompatible with
+ # Python re module, as it has:
+ #
+ # - a recursive pattern: (?1)
+ # - an atomic grouping: (?>...)
+ #
+ # I tried a simpler version: but it didn't work either:
+ # \bSTRUCT_GROUP\(([^\)]+)\)[^;]*;
+ #
+ # As it doesn't properly match the end parenthesis on some cases.
+ #
+ # So, a better solution was crafted: there's now a NestedMatch
+ # class that ensures that delimiters after a search are properly
+ # matched. So, the implementation to drop STRUCT_GROUP() will be
+ # handled in separate.
+ #
+ (KernRe(r'\bstruct_group\s*\(([^,]*,)', re.S), r'STRUCT_GROUP('),
+ (KernRe(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S), r'STRUCT_GROUP('),
+ (KernRe(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S), r'struct \1 \2; STRUCT_GROUP('),
+ (KernRe(r'\b__struct_group\s*\(([^,]*,){3}', re.S), r'STRUCT_GROUP('),
+ #
+ # Replace macros
+ #
+ # TODO: use NestedMatch for FOO($1, $2, ...) matches
+ #
+ # it is better to also move those to the NestedMatch logic,
+ # to ensure that parenthesis will be properly matched.
+ #
+ (KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^)]+)\)', re.S),
+ r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
+ (KernRe(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^)]+)\)', re.S),
+ r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
+ (KernRe(r'DECLARE_BITMAP\s*\(' + struct_args_pattern + r',\s*' + struct_args_pattern + r'\)',
+ re.S), r'unsigned long \1[BITS_TO_LONGS(\2)]'),
+ (KernRe(r'DECLARE_HASHTABLE\s*\(' + struct_args_pattern + r',\s*' + struct_args_pattern + r'\)',
+ re.S), r'unsigned long \1[1 << ((\2) - 1)]'),
+ (KernRe(r'DECLARE_KFIFO\s*\(' + struct_args_pattern + r',\s*' + struct_args_pattern +
+ r',\s*' + struct_args_pattern + r'\)', re.S), r'\2 *\1'),
+ (KernRe(r'DECLARE_KFIFO_PTR\s*\(' + struct_args_pattern + r',\s*' +
+ struct_args_pattern + r'\)', re.S), r'\2 *\1'),
+ (KernRe(r'(?:__)?DECLARE_FLEX_ARRAY\s*\(' + struct_args_pattern + r',\s*' +
+ struct_args_pattern + r'\)', re.S), r'\1 \2[]'),
+ (KernRe(r'DEFINE_DMA_UNMAP_ADDR\s*\(' + struct_args_pattern + r'\)', re.S), r'dma_addr_t \1'),
+ (KernRe(r'DEFINE_DMA_UNMAP_LEN\s*\(' + struct_args_pattern + r'\)', re.S), r'__u32 \1'),
+]
+#
+# Regexes here are guaranteed to have the end limiter matching
+# the start delimiter. Yet, right now, only one replace group
+# is allowed.
+#
+struct_nested_prefixes = [
+ (re.compile(r'\bSTRUCT_GROUP\('), r'\1'),
+]
+
+
#
# A little helper to get rid of excess white space
#
@@ -579,90 +668,15 @@ class KernelDoc:
f"expecting prototype for {decl_type} {self.entry.identifier}. Prototype was for {decl_type} {declaration_name} instead\n")
return
- args_pattern = r'([^,)]+)'
-
- sub_prefixes = [
- # Strip attributes
- (attribute, ' '),
- (KernRe(r'\s*__aligned\s*\([^;]*\)', re.S), ' '),
- (KernRe(r'\s*__counted_by\s*\([^;]*\)', re.S), ' '),
- (KernRe(r'\s*__counted_by_(le|be)\s*\([^;]*\)', re.S), ' '),
- (KernRe(r'\s*__packed\s*', re.S), ' '),
- (KernRe(r'\s*CRYPTO_MINALIGN_ATTR', re.S), ' '),
- (KernRe(r'\s*____cacheline_aligned_in_smp', re.S), ' '),
- (KernRe(r'\s*____cacheline_aligned', re.S), ' '),
-
- # Unwrap struct_group macros based on this definition:
- # __struct_group(TAG, NAME, ATTRS, MEMBERS...)
- # which has variants like: struct_group(NAME, MEMBERS...)
- # Only MEMBERS arguments require documentation.
- #
- # Parsing them happens on two steps:
- #
- # 1. drop struct group arguments that aren't at MEMBERS,
- # storing them as STRUCT_GROUP(MEMBERS)
- #
- # 2. remove STRUCT_GROUP() ancillary macro.
- #
- # The original logic used to remove STRUCT_GROUP() using an
- # advanced regex:
- #
- # \bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;
- #
- # with two patterns that are incompatible with
- # Python re module, as it has:
- #
- # - a recursive pattern: (?1)
- # - an atomic grouping: (?>...)
- #
- # I tried a simpler version: but it didn't work either:
- # \bSTRUCT_GROUP\(([^\)]+)\)[^;]*;
- #
- # As it doesn't properly match the end parenthesis on some cases.
- #
- # So, a better solution was crafted: there's now a NestedMatch
- # class that ensures that delimiters after a search are properly
- # matched. So, the implementation to drop STRUCT_GROUP() will be
- # handled in separate.
-
- (KernRe(r'\bstruct_group\s*\(([^,]*,)', re.S), r'STRUCT_GROUP('),
- (KernRe(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S), r'STRUCT_GROUP('),
- (KernRe(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S), r'struct \1 \2; STRUCT_GROUP('),
- (KernRe(r'\b__struct_group\s*\(([^,]*,){3}', re.S), r'STRUCT_GROUP('),
-
- # Replace macros
- #
- # TODO: use NestedMatch for FOO($1, $2, ...) matches
- #
- # it is better to also move those to the NestedMatch logic,
- # to ensure that parenthesis will be properly matched.
-
- (KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^)]+)\)', re.S), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
- (KernRe(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^)]+)\)', re.S), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
- (KernRe(r'DECLARE_BITMAP\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[BITS_TO_LONGS(\2)]'),
- (KernRe(r'DECLARE_HASHTABLE\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[1 << ((\2) - 1)]'),
- (KernRe(r'DECLARE_KFIFO\s*\(' + args_pattern + r',\s*' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\2 *\1'),
- (KernRe(r'DECLARE_KFIFO_PTR\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\2 *\1'),
- (KernRe(r'(?:__)?DECLARE_FLEX_ARRAY\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\1 \2[]'),
- (KernRe(r'DEFINE_DMA_UNMAP_ADDR\s*\(' + args_pattern + r'\)', re.S), r'dma_addr_t \1'),
- (KernRe(r'DEFINE_DMA_UNMAP_LEN\s*\(' + args_pattern + r'\)', re.S), r'__u32 \1'),
- ]
-
- # Regexes here are guaranteed to have the end limiter matching
- # the start delimiter. Yet, right now, only one replace group
- # is allowed.
-
- sub_nested_prefixes = [
- (re.compile(r'\bSTRUCT_GROUP\('), r'\1'),
- ]
-
+ #
+ # Go through the list of members applying all of our transformations.
+ #
members = trim_private_members(members)
- for search, sub in sub_prefixes:
+ for search, sub in struct_prefixes:
members = search.sub(sub, members)
nested = NestedMatch()
-
- for search, sub in sub_nested_prefixes:
+ for search, sub in struct_nested_prefixes:
members = nested.sub(search, sub, members)
# Keeps the original declaration as-is
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 05/12] docs: kdoc: split top-level prototype parsing out of dump_struct()
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
` (3 preceding siblings ...)
2025-08-01 0:13 ` [PATCH 04/12] docs: kdoc: move the prefix transforms out of dump_struct() Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 5:34 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 06/12] docs: kdoc: split struct-member rewriting " Jonathan Corbet
` (7 subsequent siblings)
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
Move the initial split of the prototype into its own function in the
ongoing effort to cut dump_struct() down to size.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 44 +++++++++++++++------------------
1 file changed, 20 insertions(+), 24 deletions(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index 5e375318df9c..2bb0da22048f 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -624,13 +624,11 @@ class KernelDoc:
self.emit_msg(ln,
f"No description found for return value of '{declaration_name}'")
- def dump_struct(self, ln, proto):
- """
- Store an entry for an struct or union
- """
-
+ #
+ # Split apart a structure prototype; returns (struct|union, name, members) or None
+ #
+ def split_struct_proto(self, proto):
type_pattern = r'(struct|union)'
-
qualifiers = [
"__attribute__",
"__packed",
@@ -638,36 +636,34 @@ class KernelDoc:
"____cacheline_aligned_in_smp",
"____cacheline_aligned",
]
-
definition_body = r'\{(.*)\}\s*' + "(?:" + '|'.join(qualifiers) + ")?"
- # Extract struct/union definition
- members = None
- declaration_name = None
- decl_type = None
-
r = KernRe(type_pattern + r'\s+(\w+)\s*' + definition_body)
if r.search(proto):
- decl_type = r.group(1)
- declaration_name = r.group(2)
- members = r.group(3)
+ return (r.group(1), r.group(2), r.group(3))
else:
r = KernRe(r'typedef\s+' + type_pattern + r'\s*' + definition_body + r'\s*(\w+)\s*;')
-
if r.search(proto):
- decl_type = r.group(1)
- declaration_name = r.group(3)
- members = r.group(2)
+ return (r.group(1), r.group(3), r.group(2))
+ return None
- if not members:
+ def dump_struct(self, ln, proto):
+ """
+ Store an entry for an struct or union
+ """
+ #
+ # Do the basic parse to get the pieces of the declaration.
+ #
+ struct_parts = self.split_struct_proto(proto)
+ if not struct_parts:
self.emit_msg(ln, f"{proto} error: Cannot parse struct or union!")
return
+ decl_type, declaration_name, members = struct_parts
if self.entry.identifier != declaration_name:
- self.emit_msg(ln,
- f"expecting prototype for {decl_type} {self.entry.identifier}. Prototype was for {decl_type} {declaration_name} instead\n")
+ self.emit_msg(ln, f"expecting prototype for {decl_type} {self.entry.identifier}. "
+ f"Prototype was for {decl_type} {declaration_name} instead\n")
return
-
#
# Go through the list of members applying all of our transformations.
#
@@ -696,7 +692,7 @@ class KernelDoc:
# So, we need to have an extra loop on Python to override such
# re limitation.
- struct_members = KernRe(type_pattern + r'([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
+ struct_members = KernRe(r'(struct|union)([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
while True:
tuples = struct_members.findall(members)
if not tuples:
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 06/12] docs: kdoc: split struct-member rewriting out of dump_struct()
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
` (4 preceding siblings ...)
2025-08-01 0:13 ` [PATCH 05/12] docs: kdoc: split top-level prototype parsing " Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 5:37 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 07/12] docs: kdoc: rework the rewrite_struct_members() main loop Jonathan Corbet
` (6 subsequent siblings)
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
The massive loop that massages struct members shares no data with the rest
of dump_struct(); split it out into its own function. Code movement only,
no other changes.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 65 +++++++++++++++++----------------
1 file changed, 34 insertions(+), 31 deletions(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index 2bb0da22048f..5c4ad8febb9f 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -647,37 +647,7 @@ class KernelDoc:
return (r.group(1), r.group(3), r.group(2))
return None
- def dump_struct(self, ln, proto):
- """
- Store an entry for an struct or union
- """
- #
- # Do the basic parse to get the pieces of the declaration.
- #
- struct_parts = self.split_struct_proto(proto)
- if not struct_parts:
- self.emit_msg(ln, f"{proto} error: Cannot parse struct or union!")
- return
- decl_type, declaration_name, members = struct_parts
-
- if self.entry.identifier != declaration_name:
- self.emit_msg(ln, f"expecting prototype for {decl_type} {self.entry.identifier}. "
- f"Prototype was for {decl_type} {declaration_name} instead\n")
- return
- #
- # Go through the list of members applying all of our transformations.
- #
- members = trim_private_members(members)
- for search, sub in struct_prefixes:
- members = search.sub(sub, members)
-
- nested = NestedMatch()
- for search, sub in struct_nested_prefixes:
- members = nested.sub(search, sub, members)
-
- # Keeps the original declaration as-is
- declaration = members
-
+ def rewrite_struct_members(self, members):
# Split nested struct/union elements
#
# This loop was simpler at the original kernel-doc perl version, as
@@ -768,6 +738,39 @@ class KernelDoc:
newmember += f"{dtype} {s_id}.{name}; "
members = members.replace(oldmember, newmember)
+ return members
+
+ def dump_struct(self, ln, proto):
+ """
+ Store an entry for an struct or union
+ """
+ #
+ # Do the basic parse to get the pieces of the declaration.
+ #
+ struct_parts = self.split_struct_proto(proto)
+ if not struct_parts:
+ self.emit_msg(ln, f"{proto} error: Cannot parse struct or union!")
+ return
+ decl_type, declaration_name, members = struct_parts
+
+ if self.entry.identifier != declaration_name:
+ self.emit_msg(ln, f"expecting prototype for {decl_type} {self.entry.identifier}. "
+ f"Prototype was for {decl_type} {declaration_name} instead\n")
+ return
+ #
+ # Go through the list of members applying all of our transformations.
+ #
+ members = trim_private_members(members)
+ for search, sub in struct_prefixes:
+ members = search.sub(sub, members)
+
+ nested = NestedMatch()
+ for search, sub in struct_nested_prefixes:
+ members = nested.sub(search, sub, members)
+
+ # Keeps the original declaration as-is
+ declaration = members
+ members = self.rewrite_struct_members(members)
# Ignore other nested elements, like enums
members = re.sub(r'(\{[^\{\}]*\})', '', members)
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 07/12] docs: kdoc: rework the rewrite_struct_members() main loop
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
` (5 preceding siblings ...)
2025-08-01 0:13 ` [PATCH 06/12] docs: kdoc: split struct-member rewriting " Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 5:42 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 08/12] docs: kdoc: remove an extraneous strip() call Jonathan Corbet
` (5 subsequent siblings)
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
Adopt a more Pythonic form for the main loop of this function, getting rid
of the "while True:" construction and making the actual loop invariant
explicit.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index 5c4ad8febb9f..efc5888fcc74 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -663,11 +663,8 @@ class KernelDoc:
# re limitation.
struct_members = KernRe(r'(struct|union)([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
- while True:
- tuples = struct_members.findall(members)
- if not tuples:
- break
-
+ tuples = struct_members.findall(members)
+ while tuples:
for t in tuples:
newmember = ""
maintype = t[0]
@@ -738,6 +735,7 @@ class KernelDoc:
newmember += f"{dtype} {s_id}.{name}; "
members = members.replace(oldmember, newmember)
+ tuples = struct_members.findall(members)
return members
def dump_struct(self, ln, proto):
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 08/12] docs: kdoc: remove an extraneous strip() call
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
` (6 preceding siblings ...)
2025-08-01 0:13 ` [PATCH 07/12] docs: kdoc: rework the rewrite_struct_members() main loop Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 5:45 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 09/12] docs: kdoc: Some rewrite_struct_members() commenting Jonathan Corbet
` (4 subsequent siblings)
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
...the variable in question was already strip()ed at the top of the loop.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 1 -
1 file changed, 1 deletion(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index efc5888fcc74..b751fa8edde7 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -703,7 +703,6 @@ class KernelDoc:
newmember += f"{dtype}{s_id}.{name}{extra}; "
else:
- arg = arg.strip()
# Handle bitmaps
arg = KernRe(r':\s*\d+\s*').sub('', arg)
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 09/12] docs: kdoc: Some rewrite_struct_members() commenting
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
` (7 preceding siblings ...)
2025-08-01 0:13 ` [PATCH 08/12] docs: kdoc: remove an extraneous strip() call Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 5:50 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup Jonathan Corbet
` (3 subsequent siblings)
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
Add comments to rewrite_struct_members() describing what it is actually
doing, and reformat/comment the main struct_members regex so that it is
(more) comprehensible to humans.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index b751fa8edde7..20e0a2abe13b 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -647,22 +647,28 @@ class KernelDoc:
return (r.group(1), r.group(3), r.group(2))
return None
+ #
+ # Rewrite the members of a structure or union for easier formatting later on.
+ # Among other things, this function will turn a member like:
+ #
+ # struct { inner_members; } foo;
+ #
+ # into:
+ #
+ # struct foo; inner_members;
+ #
def rewrite_struct_members(self, members):
- # Split nested struct/union elements
- #
- # This loop was simpler at the original kernel-doc perl version, as
- # while ($members =~ m/$struct_members/) { ... }
- # reads 'members' string on each interaction.
#
- # Python behavior is different: it parses 'members' only once,
- # creating a list of tuples from the first interaction.
+ # Process struct/union members from the most deeply nested outward. The
+ # trick is in the ^{ below - it prevents a match of an outer struct/union
+ # until the inner one has been munged (removing the "{" in the process).
#
- # On other words, this won't get nested structs.
- #
- # So, we need to have an extra loop on Python to override such
- # re limitation.
-
- struct_members = KernRe(r'(struct|union)([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
+ struct_members = KernRe(r'(struct|union)' # 0: declaration type
+ r'([^{};]+)' # 1: possible name
+ r'(\{)'
+ r'([^{}]*)' # 3: Contents of declaration
+ r'(\})'
+ r'([^{};]*)(;)') # 5: Remaining stuff after declaration
tuples = struct_members.findall(members)
while tuples:
for t in tuples:
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
` (8 preceding siblings ...)
2025-08-01 0:13 ` [PATCH 09/12] docs: kdoc: Some rewrite_struct_members() commenting Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 6:07 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 11/12] docs: kdoc: extract output formatting from dump_struct() Jonathan Corbet
` (2 subsequent siblings)
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
Get rid of some single-use variables and redundant checks, and generally
tighten up the code; no logical change.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 89 ++++++++++++++++-----------------
1 file changed, 42 insertions(+), 47 deletions(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index 20e0a2abe13b..2b7d7e646367 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -673,73 +673,68 @@ class KernelDoc:
while tuples:
for t in tuples:
newmember = ""
- maintype = t[0]
- s_ids = t[5]
- content = t[3]
-
- oldmember = "".join(t)
-
- for s_id in s_ids.split(','):
+ oldmember = "".join(t) # Reconstruct the original formatting
+ #
+ # Pass through each field name, normalizing the form and formatting.
+ #
+ for s_id in t[5].split(','):
s_id = s_id.strip()
- newmember += f"{maintype} {s_id}; "
+ newmember += f"{t[0]} {s_id}; "
+ #
+ # Remove bitfield/array/pointer info, getting the bare name.
+ #
s_id = KernRe(r'[:[].*').sub('', s_id)
s_id = KernRe(r'^\s*\**(\S+)\s*').sub(r'\1', s_id)
-
- for arg in content.split(';'):
+ #
+ # Pass through the members of this inner structure/union.
+ #
+ for arg in t[3].split(';'):
arg = arg.strip()
-
- if not arg:
- continue
-
+ #
+ # Look for (type)(*name)(args) - pointer to function
+ #
r = KernRe(r'^([^(]+\(\*?\s*)([\w.]*)(\s*\).*)')
if r.match(arg):
# Pointer-to-function
- dtype = r.group(1)
- name = r.group(2)
- extra = r.group(3)
-
- if not name:
- continue
-
if not s_id:
# Anonymous struct/union
- newmember += f"{dtype}{name}{extra}; "
+ newmember += f"{r.group(1)}{r.group(2)}{r.group(3)}; "
else:
- newmember += f"{dtype}{s_id}.{name}{extra}; "
-
+ newmember += f"{r.group(1)}{s_id}.{r.group(2)}{r.group(3)}; "
+ #
+ # Otherwise a non-function member.
+ #
else:
- # Handle bitmaps
+ #
+ # Remove bitmap and array portions and spaces around commas
+ #
arg = KernRe(r':\s*\d+\s*').sub('', arg)
-
- # Handle arrays
arg = KernRe(r'\[.*\]').sub('', arg)
-
- # Handle multiple IDs
arg = KernRe(r'\s*,\s*').sub(',', arg)
-
+ #
+ # Look for a normal decl - "type name[,name...]"
+ #
r = KernRe(r'(.*)\s+([\S+,]+)')
-
if r.search(arg):
- dtype = r.group(1)
- names = r.group(2)
+ for name in r.group(2).split(','):
+ name = KernRe(r'^\s*\**(\S+)\s*').sub(r'\1', name)
+ if not s_id:
+ # Anonymous struct/union
+ newmember += f"{r.group(1)} {name}; "
+ else:
+ newmember += f"{r.group(1)} {s_id}.{name}; "
else:
newmember += f"{arg}; "
- continue
-
- for name in names.split(','):
- name = KernRe(r'^\s*\**(\S+)\s*').sub(r'\1', name).strip()
-
- if not name:
- continue
-
- if not s_id:
- # Anonymous struct/union
- newmember += f"{dtype} {name}; "
- else:
- newmember += f"{dtype} {s_id}.{name}; "
-
+ #
+ # At the end of the s_id loop, replace the original declaration with
+ # the munged version.
+ #
members = members.replace(oldmember, newmember)
+ #
+ # End of the tuple loop - search again and see if there are outer members
+ # that now turn up.
+ #
tuples = struct_members.findall(members)
return members
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 11/12] docs: kdoc: extract output formatting from dump_struct()
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
` (9 preceding siblings ...)
2025-08-01 0:13 ` [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 6:09 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 12/12] docs: kdoc: a few final dump_struct() touches Jonathan Corbet
2025-08-01 6:23 ` [PATCH 00/12] docs: kdoc: thrash up dump_struct() Mauro Carvalho Chehab
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
The last thing done in dump_struct() is to format the structure for
printing. That, too, is a separate activity; split it out into its own
function.
dump_struct() now fits in a single, full-hight editor screen.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 72 +++++++++++++++++----------------
1 file changed, 37 insertions(+), 35 deletions(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index 2b7d7e646367..131956d89f84 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -738,6 +738,42 @@ class KernelDoc:
tuples = struct_members.findall(members)
return members
+ #
+ # Format the struct declaration into a standard form for inclusion in the
+ # resulting docs.
+ #
+ def format_struct_decl(self, declaration):
+ #
+ # Insert newlines, get rid of extra spaces.
+ #
+ declaration = KernRe(r'([{;])').sub(r'\1\n', declaration)
+ declaration = KernRe(r'\}\s+;').sub('};', declaration)
+ #
+ # Format inline enums with each member on its own line.
+ #
+ r = KernRe(r'(enum\s+\{[^}]+),([^\n])')
+ while r.search(declaration):
+ declaration = r.sub(r'\1,\n\2', declaration)
+ #
+ # Now go through and supply the right number of tabs
+ # for each line.
+ #
+ def_args = declaration.split('\n')
+ level = 1
+ declaration = ""
+ for clause in def_args:
+ clause = KernRe(r'\s+').sub(' ', clause.strip(), count=1)
+ if clause:
+ if '}' in clause and level > 1:
+ level -= 1
+ if not clause.startswith('#'):
+ declaration += "\t" * level
+ declaration += "\t" + clause + "\n"
+ if "{" in clause and "}" not in clause:
+ level += 1
+ return declaration
+
+
def dump_struct(self, ln, proto):
"""
Store an entry for an struct or union
@@ -776,42 +812,8 @@ class KernelDoc:
self.create_parameter_list(ln, decl_type, members, ';',
declaration_name)
self.check_sections(ln, declaration_name, decl_type)
-
- # Adjust declaration for better display
- declaration = KernRe(r'([{;])').sub(r'\1\n', declaration)
- declaration = KernRe(r'\}\s+;').sub('};', declaration)
-
- # Better handle inlined enums
- while True:
- r = KernRe(r'(enum\s+\{[^}]+),([^\n])')
- if not r.search(declaration):
- break
-
- declaration = r.sub(r'\1,\n\2', declaration)
-
- def_args = declaration.split('\n')
- level = 1
- declaration = ""
- for clause in def_args:
-
- clause = clause.strip()
- clause = KernRe(r'\s+').sub(' ', clause, count=1)
-
- if not clause:
- continue
-
- if '}' in clause and level > 1:
- level -= 1
-
- if not KernRe(r'^\s*#').match(clause):
- declaration += "\t" * level
-
- declaration += "\t" + clause + "\n"
- if "{" in clause and "}" not in clause:
- level += 1
-
self.output_declaration(decl_type, declaration_name,
- definition=declaration,
+ definition=self.format_struct_decl(declaration),
purpose=self.entry.declaration_purpose)
def dump_enum(self, ln, proto):
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH 12/12] docs: kdoc: a few final dump_struct() touches
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
` (10 preceding siblings ...)
2025-08-01 0:13 ` [PATCH 11/12] docs: kdoc: extract output formatting from dump_struct() Jonathan Corbet
@ 2025-08-01 0:13 ` Jonathan Corbet
2025-08-01 6:10 ` Mauro Carvalho Chehab
2025-08-01 6:23 ` [PATCH 00/12] docs: kdoc: thrash up dump_struct() Mauro Carvalho Chehab
12 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 0:13 UTC (permalink / raw)
To: linux-doc
Cc: linux-kernel, Mauro Carvalho Chehab, Akira Yokosawa,
Jonathan Corbet
Add a couple more comments so that each phase of the process is
now clearly marked.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
---
scripts/lib/kdoc/kdoc_parser.py | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
index 131956d89f84..fa2041276f8c 100644
--- a/scripts/lib/kdoc/kdoc_parser.py
+++ b/scripts/lib/kdoc/kdoc_parser.py
@@ -801,14 +801,15 @@ class KernelDoc:
nested = NestedMatch()
for search, sub in struct_nested_prefixes:
members = nested.sub(search, sub, members)
-
- # Keeps the original declaration as-is
+ #
+ # Deal with embedded struct and union members, and drop enums entirely.
+ #
declaration = members
members = self.rewrite_struct_members(members)
-
- # Ignore other nested elements, like enums
members = re.sub(r'(\{[^\{\}]*\})', '', members)
-
+ #
+ # Output the result and we are done.
+ #
self.create_parameter_list(ln, decl_type, members, ';',
declaration_name)
self.check_sections(ln, declaration_name, decl_type)
--
2.50.1
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH 03/12] docs: kdoc: backslashectomy in kdoc_parser
2025-08-01 0:13 ` [PATCH 03/12] docs: kdoc: backslashectomy in kdoc_parser Jonathan Corbet
@ 2025-08-01 4:27 ` Mauro Carvalho Chehab
2025-08-01 14:21 ` Jonathan Corbet
0 siblings, 1 reply; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 4:27 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:17 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> A lot of the regular expressions in this file have extraneous backslashes
This one is a bit scary... It could actually cause issues somewhere.
Also, IMHO, some expressions look worse on my eyes ;-)
> that may have been needed in Perl, but aren't helpful here. Take them out
> to reduce slightly the visual noise.
No idea if Perl actually requires, but, at least for me, I do prefer to
see all special characters properly escaped with a backslash. This way,
it is a lot clearer that what it is expecting is a string, instead of
using something that may affect regex processing.
This is specially important for my eyes when expecting for dots,
parenthesis and brackets.
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
> ---
> scripts/lib/kdoc/kdoc_parser.py | 40 ++++++++++++++++-----------------
> 1 file changed, 20 insertions(+), 20 deletions(-)
>
> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> index 9948ede739a5..e1efa65a3480 100644
> --- a/scripts/lib/kdoc/kdoc_parser.py
> +++ b/scripts/lib/kdoc/kdoc_parser.py
> @@ -46,7 +46,7 @@ doc_decl = doc_com + KernRe(r'(\w+)', cache=False)
> known_section_names = 'description|context|returns?|notes?|examples?'
> known_sections = KernRe(known_section_names, flags = re.I)
> doc_sect = doc_com + \
> - KernRe(r'\s*(\@[.\w]+|\@\.\.\.|' + known_section_names + r')\s*:([^:].*)?$',
> + KernRe(r'\s*(@[.\w]+|@\.\.\.|' + known_section_names + r')\s*:([^:].*)?$',
> flags=re.I, cache=False)
>
> doc_content = doc_com_body + KernRe(r'(.*)', cache=False)
> @@ -60,7 +60,7 @@ attribute = KernRe(r"__attribute__\s*\(\([a-z0-9,_\*\s\(\)]*\)\)",
> export_symbol = KernRe(r'^\s*EXPORT_SYMBOL(_GPL)?\s*\(\s*(\w+)\s*\)\s*', cache=False)
> export_symbol_ns = KernRe(r'^\s*EXPORT_SYMBOL_NS(_GPL)?\s*\(\s*(\w+)\s*,\s*"\S+"\)\s*', cache=False)
>
> -type_param = KernRe(r"\@(\w*((\.\w+)|(->\w+))*(\.\.\.)?)", cache=False)
> +type_param = KernRe(r"@(\w*((\.\w+)|(->\w+))*(\.\.\.)?)", cache=False)
>
> #
> # Tests for the beginning of a kerneldoc block in its various forms.
> @@ -331,7 +331,7 @@ class KernelDoc:
>
> self.entry.anon_struct_union = False
>
> - param = KernRe(r'[\[\)].*').sub('', param, count=1)
> + param = KernRe(r'[)[].*').sub('', param, count=1)
This one, for instance, IMHO looks a lot worse for my eyes to understand
that there is a "[" that it is not an operator, but instead a string.
The open close parenthesis also looks weird. My regex-trained eyes think
that this would be part of a capture group.
>
> if dtype == "" and param.endswith("..."):
> if KernRe(r'\w\.\.\.$').search(param):
> @@ -405,7 +405,7 @@ class KernelDoc:
>
> for arg in args.split(splitter):
> # Strip comments
> - arg = KernRe(r'\/\*.*\*\/').sub('', arg)
> + arg = KernRe(r'/\*.*\*/').sub('', arg)
A pattern like /..../ is a standard way to pass search group with Regex
on many languages and utils that accept regular expressions like the
sed command. Dropping the backslash here IMHO makes it confusing ;-)
>
> # Ignore argument attributes
> arg = KernRe(r'\sPOS0?\s').sub(' ', arg)
> @@ -428,14 +428,14 @@ class KernelDoc:
>
> arg = arg.replace('#', ',')
>
> - r = KernRe(r'[^\(]+\(\*?\s*([\w\[\]\.]*)\s*\)')
> + r = KernRe(r'[^(]+\(\*?\s*([\w[\].]*)\s*\)')
Here, [.] is also a lot more confusing for me than [\.]
Ok, both works the same way on all implementations I know, but, as a doc
means any character, I need to re-read this two or three times to understand
that here it is waiting for a dot character instead of any character.
----
Here, I became too tired of reading regular expressions... better
stop to avoid headaches ;-)
Seriously, IMHO this patch makes a lot worse to understand what brackets,
parenthesis and dots are strings, and which ones are part of the regex
syntax.
> if r.match(arg):
> param = r.group(1)
> else:
> self.emit_msg(ln, f"Invalid param: {arg}")
> param = arg
>
> - dtype = KernRe(r'([^\(]+\(\*?)\s*' + re.escape(param)).sub(r'\1', arg)
> + dtype = KernRe(r'([^(]+\(\*?)\s*' + re.escape(param)).sub(r'\1', arg)
> self.push_parameter(ln, decl_type, param, dtype,
> arg, declaration_name)
>
> @@ -443,14 +443,14 @@ class KernelDoc:
> # Array-of-pointers
>
> arg = arg.replace('#', ',')
> - r = KernRe(r'[^\(]+\(\s*\*\s*([\w\[\]\.]*?)\s*(\s*\[\s*[\w]+\s*\]\s*)*\)')
> + r = KernRe(r'[^(]+\(\s*\*\s*([\w[\].]*?)\s*(\s*\[\s*[\w]+\s*\]\s*)*\)')
> if r.match(arg):
> param = r.group(1)
> else:
> self.emit_msg(ln, f"Invalid param: {arg}")
> param = arg
>
> - dtype = KernRe(r'([^\(]+\(\*?)\s*' + re.escape(param)).sub(r'\1', arg)
> + dtype = KernRe(r'([^(]+\(\*?)\s*' + re.escape(param)).sub(r'\1', arg)
>
> self.push_parameter(ln, decl_type, param, dtype,
> arg, declaration_name)
> @@ -637,8 +637,8 @@ class KernelDoc:
> # it is better to also move those to the NestedMatch logic,
> # to ensure that parenthesis will be properly matched.
>
> - (KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
> - (KernRe(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
> + (KernRe(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^)]+)\)', re.S), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
> + (KernRe(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^)]+)\)', re.S), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
> (KernRe(r'DECLARE_BITMAP\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[BITS_TO_LONGS(\2)]'),
> (KernRe(r'DECLARE_HASHTABLE\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[1 << ((\2) - 1)]'),
> (KernRe(r'DECLARE_KFIFO\s*\(' + args_pattern + r',\s*' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'\2 *\1'),
> @@ -700,7 +700,7 @@ class KernelDoc:
> s_id = s_id.strip()
>
> newmember += f"{maintype} {s_id}; "
> - s_id = KernRe(r'[:\[].*').sub('', s_id)
> + s_id = KernRe(r'[:[].*').sub('', s_id)
> s_id = KernRe(r'^\s*\**(\S+)\s*').sub(r'\1', s_id)
>
> for arg in content.split(';'):
> @@ -709,7 +709,7 @@ class KernelDoc:
> if not arg:
> continue
>
> - r = KernRe(r'^([^\(]+\(\*?\s*)([\w\.]*)(\s*\).*)')
> + r = KernRe(r'^([^(]+\(\*?\s*)([\w.]*)(\s*\).*)')
> if r.match(arg):
> # Pointer-to-function
> dtype = r.group(1)
> @@ -767,12 +767,12 @@ class KernelDoc:
> self.check_sections(ln, declaration_name, decl_type)
>
> # Adjust declaration for better display
> - declaration = KernRe(r'([\{;])').sub(r'\1\n', declaration)
> + declaration = KernRe(r'([{;])').sub(r'\1\n', declaration)
> declaration = KernRe(r'\}\s+;').sub('};', declaration)
>
> # Better handle inlined enums
> while True:
> - r = KernRe(r'(enum\s+\{[^\}]+),([^\n])')
> + r = KernRe(r'(enum\s+\{[^}]+),([^\n])')
> if not r.search(declaration):
> break
>
> @@ -969,8 +969,8 @@ class KernelDoc:
> # - pci_match_device, __copy_to_user (long return type)
>
> name = r'[a-zA-Z0-9_~:]+'
> - prototype_end1 = r'[^\(]*'
> - prototype_end2 = r'[^\{]*'
> + prototype_end1 = r'[^(]*'
> + prototype_end2 = r'[^{]*'
> prototype_end = fr'\(({prototype_end1}|{prototype_end2})\)'
>
> # Besides compiling, Perl qr{[\w\s]+} works as a non-capturing group.
> @@ -1044,7 +1044,7 @@ class KernelDoc:
> Stores a typedef inside self.entries array.
> """
>
> - typedef_type = r'((?:\s+[\w\*]+\b){0,7}\s+(?:\w+\b|\*+))\s*'
> + typedef_type = r'((?:\s+[\w*]+\b){0,7}\s+(?:\w+\b|\*+))\s*'
> typedef_ident = r'\*?\s*(\w\S+)\s*'
> typedef_args = r'\s*\((.*)\);'
>
> @@ -1265,7 +1265,7 @@ class KernelDoc:
> self.dump_section()
>
> # Look for doc_com + <text> + doc_end:
> - r = KernRe(r'\s*\*\s*[a-zA-Z_0-9:\.]+\*/')
> + r = KernRe(r'\s*\*\s*[a-zA-Z_0-9:.]+\*/')
> if r.match(line):
> self.emit_msg(ln, f"suspicious ending line: {line}")
>
> @@ -1476,14 +1476,14 @@ class KernelDoc:
> """Ancillary routine to process a function prototype"""
>
> # strip C99-style comments to end of line
> - line = KernRe(r"\/\/.*$", re.S).sub('', line)
> + line = KernRe(r"//.*$", re.S).sub('', line)
> #
> # Soak up the line's worth of prototype text, stopping at { or ; if present.
> #
> if KernRe(r'\s*#\s*define').match(line):
> self.entry.prototype = line
> elif not line.startswith('#'): # skip other preprocessor stuff
> - r = KernRe(r'([^\{]*)')
> + r = KernRe(r'([^{]*)')
> if r.match(line):
> self.entry.prototype += r.group(1) + " "
> #
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 04/12] docs: kdoc: move the prefix transforms out of dump_struct()
2025-08-01 0:13 ` [PATCH 04/12] docs: kdoc: move the prefix transforms out of dump_struct() Jonathan Corbet
@ 2025-08-01 5:28 ` Mauro Carvalho Chehab
2025-08-01 5:35 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 5:28 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:18 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> dump_struct is one of the longest functions in the kdoc_parser class,
> making it hard to read and reason about. Move the definition of the prefix
> transformations out of the function, join them with the definition of
> "attribute" (which was defined at the top of the file but only used here),
> and reformat the code slightly for shorter line widths.
>
> Just code movement in the end.
This patch itself LGTM:
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
but see my notes below:
> +struct_prefixes = [
> + # Strip attributes
> + (struct_attribute, ' '),
> + (KernRe(r'\s*__aligned\s*\([^;]*\)', re.S), ' '),
> + (KernRe(r'\s*__counted_by\s*\([^;]*\)', re.S), ' '),
> + (KernRe(r'\s*__counted_by_(le|be)\s*\([^;]*\)', re.S), ' '),
> + (KernRe(r'\s*__packed\s*', re.S), ' '),
> + (KernRe(r'\s*CRYPTO_MINALIGN_ATTR', re.S), ' '),
> + (KernRe(r'\s*____cacheline_aligned_in_smp', re.S), ' '),
> + (KernRe(r'\s*____cacheline_aligned', re.S), ' '),
> + #
> + # Unwrap struct_group macros based on this definition:
> + # __struct_group(TAG, NAME, ATTRS, MEMBERS...)
> + # which has variants like: struct_group(NAME, MEMBERS...)
> + # Only MEMBERS arguments require documentation.
> + #
> + # Parsing them happens on two steps:
> + #
> + # 1. drop struct group arguments that aren't at MEMBERS,
> + # storing them as STRUCT_GROUP(MEMBERS)
> + #
> + # 2. remove STRUCT_GROUP() ancillary macro.
> + #
> + # The original logic used to remove STRUCT_GROUP() using an
> + # advanced regex:
> + #
> + # \bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;
> + #
> + # with two patterns that are incompatible with
> + # Python re module, as it has:
> + #
> + # - a recursive pattern: (?1)
> + # - an atomic grouping: (?>...)
> + #
> + # I tried a simpler version: but it didn't work either:
> + # \bSTRUCT_GROUP\(([^\)]+)\)[^;]*;
> + #
> + # As it doesn't properly match the end parenthesis on some cases.
> + #
> + # So, a better solution was crafted: there's now a NestedMatch
> + # class that ensures that delimiters after a search are properly
> + # matched. So, the implementation to drop STRUCT_GROUP() will be
> + # handled in separate.
> + #
> + (KernRe(r'\bstruct_group\s*\(([^,]*,)', re.S), r'STRUCT_GROUP('),
> + (KernRe(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S), r'STRUCT_GROUP('),
> + (KernRe(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S), r'struct \1 \2; STRUCT_GROUP('),
> + (KernRe(r'\b__struct_group\s*\(([^,]*,){3}', re.S), r'STRUCT_GROUP('),
> + #
> + # Replace macros
> + #
> + # TODO: use NestedMatch for FOO($1, $2, ...) matches
This comment is actually related to patch 03/12: regex cleanups:
If you want to simplify a lot the regular expressions here, the best
is to take a look at the NestedMatch class and improve it. There are lots
of regular expressions here that are very complex because they try
to ensure that something like these:
1. function(<arg1>)
2. function(<arg1>, <arg2>,<arg3>,...)
are properly parsed[1], but if we turn it into something that handle (2) as
well, we could use it like:
match = NestedMatch.search("function", string)
# or, alternatively:
# match = NestedMatch.search("function($1, $2, $3)", string)
if match:
arg1 = match.group(1)
arg2 = match.group(2)
arg3 = match.group(3)
or even do more complex changes like:
NestedMatch.sub("foo($1, $2)", "new_name($2)", string)
A class implementing that will help to transform all sorts of functions
and simplify the more complex regexes on kernel-doc. Doing that will
very likely simplify a lot the struct_prefixes, replacing it by something
a lot more easier to understand:
# Nice and simpler set of replacement rules
struct_nested_matches = [
("__aligned", ""),
("__counted_by", ""),
("__counted_by_(be|le)", ""),
...
# Picked those from stddef.h macro replacement rules
("struct_group(NAME, MEMBERS...)", "__struct_group(, NAME, , MEMBERS)"),
("struct_group(TAG, NAME, ATTRS, MEMBERS...)",
""" __struct_group(TAG, NAME, ATTRS, MEMBERS...)
union {
struct { MEMBERS } ATTRS;
struct __struct_group_tag(TAG) { MEMBERS } ATTRS NAME;
} ATTRS"""),
...
]
members = trim_private_members(members)
for from, to in struct_nested_matches:
members = NestedMatch.sub(from, to, members)
Granted, wiring this up takes some time and lots of testing - we should
likely have some unit tests to catch issues there - but IMO it is
worth the effort.
-
[1] NestedMatch() is currently limited to match function(<args>), as it was
written to replace really complex regular expressions with
recursive patterns and atomic grouping, that were used only to
capture macro calls for:
STRUCT_GROUP(...)
I might have used instead "import regex", but I didn't want to add the
extra dependency of a non-standard Python library at the Kernel build.
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 01/12] docs: kdoc: consolidate the stripping of private struct/union members
2025-08-01 0:13 ` [PATCH 01/12] docs: kdoc: consolidate the stripping of private struct/union members Jonathan Corbet
@ 2025-08-01 5:29 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 5:29 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:15 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> There were two locations duplicating the logic of stripping private members
> and associated comments; coalesce them into one, and add some comments
> describing what's going on.
>
> Output change: we now no longer add extraneous white space around macro
> definitions.
>
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
> scripts/lib/kdoc/kdoc_parser.py | 40 ++++++++++++++++++---------------
> 1 file changed, 22 insertions(+), 18 deletions(-)
>
> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> index c3fe4bd5eab4..93fcd8807aa8 100644
> --- a/scripts/lib/kdoc/kdoc_parser.py
> +++ b/scripts/lib/kdoc/kdoc_parser.py
> @@ -81,6 +81,21 @@ multi_space = KernRe(r'\s\s+')
> def trim_whitespace(s):
> return multi_space.sub(' ', s.strip())
>
> +#
> +# Remove struct/enum members that have been marked "private".
> +#
> +def trim_private_members(text):
> + #
> + # First look for a "public:" block that ends a private region, then
> + # handle the "private until the end" case.
> + #
> + text = KernRe(r'/\*\s*private:.*?/\*\s*public:.*?\*/', flags=re.S).sub('', text)
> + text = KernRe(r'/\*\s*private:.*', flags=re.S).sub('', text)
> + #
> + # We needed the comments to do the above, but now we can take them out.
> + #
> + return KernRe(r'\s*/\*.*?\*/\s*', flags=re.S).sub('', text).strip()
> +
> class state:
> """
> State machine enums
> @@ -568,12 +583,6 @@ class KernelDoc:
> args_pattern = r'([^,)]+)'
>
> sub_prefixes = [
> - (KernRe(r'\/\*\s*private:.*?\/\*\s*public:.*?\*\/', re.S | re.I), ''),
> - (KernRe(r'\/\*\s*private:.*', re.S | re.I), ''),
> -
> - # Strip comments
> - (KernRe(r'\/\*.*?\*\/', re.S), ''),
> -
> # Strip attributes
> (attribute, ' '),
> (KernRe(r'\s*__aligned\s*\([^;]*\)', re.S), ' '),
> @@ -648,6 +657,7 @@ class KernelDoc:
> (re.compile(r'\bSTRUCT_GROUP\('), r'\1'),
> ]
>
> + members = trim_private_members(members)
> for search, sub in sub_prefixes:
> members = search.sub(sub, members)
>
> @@ -797,24 +807,18 @@ class KernelDoc:
> """
> Stores an enum inside self.entries array.
> """
> -
> - # Ignore members marked private
> - proto = KernRe(r'\/\*\s*private:.*?\/\*\s*public:.*?\*\/', flags=re.S).sub('', proto)
> - proto = KernRe(r'\/\*\s*private:.*}', flags=re.S).sub('}', proto)
> -
> - # Strip comments
> - proto = KernRe(r'\/\*.*?\*\/', flags=re.S).sub('', proto)
> -
> - # Strip #define macros inside enums
> + #
> + # Strip preprocessor directives. Note that this depends on the
> + # trailing semicolon we added in process_proto_type().
> + #
> proto = KernRe(r'#\s*((define|ifdef|if)\s+|endif)[^;]*;', flags=re.S).sub('', proto)
> -
> #
> # Parse out the name and members of the enum. Typedef form first.
> #
> r = KernRe(r'typedef\s+enum\s*\{(.*)\}\s*(\w*)\s*;')
> if r.search(proto):
> declaration_name = r.group(2)
> - members = r.group(1).rstrip()
> + members = trim_private_members(r.group(1))
> #
> # Failing that, look for a straight enum
> #
> @@ -822,7 +826,7 @@ class KernelDoc:
> r = KernRe(r'enum\s+(\w*)\s*\{(.*)\}')
> if r.match(proto):
> declaration_name = r.group(1)
> - members = r.group(2).rstrip()
> + members = trim_private_members(r.group(2))
> #
> # OK, this isn't going to work.
> #
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 02/12] docs: kdoc: Move a regex line in dump_struct()
2025-08-01 0:13 ` [PATCH 02/12] docs: kdoc: Move a regex line in dump_struct() Jonathan Corbet
@ 2025-08-01 5:29 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 5:29 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:16 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> The complex struct_members regex was defined far from its use; bring the
> two together. Remove some extraneous backslashes while making the move.
>
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
> scripts/lib/kdoc/kdoc_parser.py | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> index 93fcd8807aa8..9948ede739a5 100644
> --- a/scripts/lib/kdoc/kdoc_parser.py
> +++ b/scripts/lib/kdoc/kdoc_parser.py
> @@ -551,7 +551,6 @@ class KernelDoc:
> ]
>
> definition_body = r'\{(.*)\}\s*' + "(?:" + '|'.join(qualifiers) + ")?"
> - struct_members = KernRe(type_pattern + r'([^\{\};]+)(\{)([^\{\}]*)(\})([^\{\}\;]*)(\;)')
>
> # Extract struct/union definition
> members = None
> @@ -683,6 +682,7 @@ class KernelDoc:
> # So, we need to have an extra loop on Python to override such
> # re limitation.
>
> + struct_members = KernRe(type_pattern + r'([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
> while True:
> tuples = struct_members.findall(members)
> if not tuples:
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 05/12] docs: kdoc: split top-level prototype parsing out of dump_struct()
2025-08-01 0:13 ` [PATCH 05/12] docs: kdoc: split top-level prototype parsing " Jonathan Corbet
@ 2025-08-01 5:34 ` Mauro Carvalho Chehab
2025-08-01 14:10 ` Jonathan Corbet
0 siblings, 1 reply; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 5:34 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:19 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> Move the initial split of the prototype into its own function in the
> ongoing effort to cut dump_struct() down to size.
>
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
> ---
> scripts/lib/kdoc/kdoc_parser.py | 44 +++++++++++++++------------------
> 1 file changed, 20 insertions(+), 24 deletions(-)
>
> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> index 5e375318df9c..2bb0da22048f 100644
> --- a/scripts/lib/kdoc/kdoc_parser.py
> +++ b/scripts/lib/kdoc/kdoc_parser.py
> @@ -624,13 +624,11 @@ class KernelDoc:
> self.emit_msg(ln,
> f"No description found for return value of '{declaration_name}'")
>
> - def dump_struct(self, ln, proto):
> - """
> - Store an entry for an struct or union
> - """
> -
> + #
> + # Split apart a structure prototype; returns (struct|union, name, members) or None
> + #
> + def split_struct_proto(self, proto):
> type_pattern = r'(struct|union)'
> -
> qualifiers = [
> "__attribute__",
> "__packed",
> @@ -638,36 +636,34 @@ class KernelDoc:
> "____cacheline_aligned_in_smp",
> "____cacheline_aligned",
> ]
> -
> definition_body = r'\{(.*)\}\s*' + "(?:" + '|'.join(qualifiers) + ")?"
>
> - # Extract struct/union definition
> - members = None
> - declaration_name = None
> - decl_type = None
> -
> r = KernRe(type_pattern + r'\s+(\w+)\s*' + definition_body)
> if r.search(proto):
> - decl_type = r.group(1)
> - declaration_name = r.group(2)
> - members = r.group(3)
> + return (r.group(1), r.group(2), r.group(3))
> else:
> r = KernRe(r'typedef\s+' + type_pattern + r'\s*' + definition_body + r'\s*(\w+)\s*;')
> -
> if r.search(proto):
> - decl_type = r.group(1)
> - declaration_name = r.group(3)
> - members = r.group(2)
> + return (r.group(1), r.group(3), r.group(2))
> + return None
>
> - if not members:
> + def dump_struct(self, ln, proto):
> + """
> + Store an entry for an struct or union
> + """
> + #
> + # Do the basic parse to get the pieces of the declaration.
> + #
> + struct_parts = self.split_struct_proto(proto)
> + if not struct_parts:
> self.emit_msg(ln, f"{proto} error: Cannot parse struct or union!")
> return
> + decl_type, declaration_name, members = struct_parts
>
> if self.entry.identifier != declaration_name:
> - self.emit_msg(ln,
> - f"expecting prototype for {decl_type} {self.entry.identifier}. Prototype was for {decl_type} {declaration_name} instead\n")
> + self.emit_msg(ln, f"expecting prototype for {decl_type} {self.entry.identifier}. "
> + f"Prototype was for {decl_type} {declaration_name} instead\n")
> return
> -
> #
> # Go through the list of members applying all of our transformations.
> #
> @@ -696,7 +692,7 @@ class KernelDoc:
> # So, we need to have an extra loop on Python to override such
> # re limitation.
>
> - struct_members = KernRe(type_pattern + r'([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
> + struct_members = KernRe(r'(struct|union)([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
I would prefer keeping type_pattern here.
With that:
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> while True:
> tuples = struct_members.findall(members)
> if not tuples:
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 04/12] docs: kdoc: move the prefix transforms out of dump_struct()
2025-08-01 5:28 ` Mauro Carvalho Chehab
@ 2025-08-01 5:35 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 5:35 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Fri, 1 Aug 2025 07:28:41 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:
> Em Thu, 31 Jul 2025 18:13:18 -0600
> Jonathan Corbet <corbet@lwn.net> escreveu:
>
> > dump_struct is one of the longest functions in the kdoc_parser class,
> > making it hard to read and reason about. Move the definition of the prefix
> > transformations out of the function, join them with the definition of
> > "attribute" (which was defined at the top of the file but only used here),
> > and reformat the code slightly for shorter line widths.
> >
> > Just code movement in the end.
>
> This patch itself LGTM:
>
> Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
In time, my R-B from patch 4 and above assumes that patch 3 is
dropped, as I'm not re-checking the regular expressions.
>
> but see my notes below:
>
> > +struct_prefixes = [
> > + # Strip attributes
> > + (struct_attribute, ' '),
> > + (KernRe(r'\s*__aligned\s*\([^;]*\)', re.S), ' '),
> > + (KernRe(r'\s*__counted_by\s*\([^;]*\)', re.S), ' '),
> > + (KernRe(r'\s*__counted_by_(le|be)\s*\([^;]*\)', re.S), ' '),
> > + (KernRe(r'\s*__packed\s*', re.S), ' '),
> > + (KernRe(r'\s*CRYPTO_MINALIGN_ATTR', re.S), ' '),
> > + (KernRe(r'\s*____cacheline_aligned_in_smp', re.S), ' '),
> > + (KernRe(r'\s*____cacheline_aligned', re.S), ' '),
> > + #
> > + # Unwrap struct_group macros based on this definition:
> > + # __struct_group(TAG, NAME, ATTRS, MEMBERS...)
> > + # which has variants like: struct_group(NAME, MEMBERS...)
> > + # Only MEMBERS arguments require documentation.
> > + #
> > + # Parsing them happens on two steps:
> > + #
> > + # 1. drop struct group arguments that aren't at MEMBERS,
> > + # storing them as STRUCT_GROUP(MEMBERS)
> > + #
> > + # 2. remove STRUCT_GROUP() ancillary macro.
> > + #
> > + # The original logic used to remove STRUCT_GROUP() using an
> > + # advanced regex:
> > + #
> > + # \bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;
> > + #
> > + # with two patterns that are incompatible with
> > + # Python re module, as it has:
> > + #
> > + # - a recursive pattern: (?1)
> > + # - an atomic grouping: (?>...)
> > + #
> > + # I tried a simpler version: but it didn't work either:
> > + # \bSTRUCT_GROUP\(([^\)]+)\)[^;]*;
> > + #
> > + # As it doesn't properly match the end parenthesis on some cases.
> > + #
> > + # So, a better solution was crafted: there's now a NestedMatch
> > + # class that ensures that delimiters after a search are properly
> > + # matched. So, the implementation to drop STRUCT_GROUP() will be
> > + # handled in separate.
> > + #
> > + (KernRe(r'\bstruct_group\s*\(([^,]*,)', re.S), r'STRUCT_GROUP('),
> > + (KernRe(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S), r'STRUCT_GROUP('),
> > + (KernRe(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S), r'struct \1 \2; STRUCT_GROUP('),
> > + (KernRe(r'\b__struct_group\s*\(([^,]*,){3}', re.S), r'STRUCT_GROUP('),
> > + #
> > + # Replace macros
> > + #
> > + # TODO: use NestedMatch for FOO($1, $2, ...) matches
>
> This comment is actually related to patch 03/12: regex cleanups:
>
> If you want to simplify a lot the regular expressions here, the best
> is to take a look at the NestedMatch class and improve it. There are lots
> of regular expressions here that are very complex because they try
> to ensure that something like these:
>
> 1. function(<arg1>)
> 2. function(<arg1>, <arg2>,<arg3>,...)
>
> are properly parsed[1], but if we turn it into something that handle (2) as
> well, we could use it like:
>
> match = NestedMatch.search("function", string)
> # or, alternatively:
> # match = NestedMatch.search("function($1, $2, $3)", string)
>
> if match:
> arg1 = match.group(1)
> arg2 = match.group(2)
> arg3 = match.group(3)
>
> or even do more complex changes like:
>
> NestedMatch.sub("foo($1, $2)", "new_name($2)", string)
>
> A class implementing that will help to transform all sorts of functions
> and simplify the more complex regexes on kernel-doc. Doing that will
> very likely simplify a lot the struct_prefixes, replacing it by something
> a lot more easier to understand:
>
> # Nice and simpler set of replacement rules
> struct_nested_matches = [
> ("__aligned", ""),
> ("__counted_by", ""),
> ("__counted_by_(be|le)", ""),
> ...
> # Picked those from stddef.h macro replacement rules
> ("struct_group(NAME, MEMBERS...)", "__struct_group(, NAME, , MEMBERS)"),
> ("struct_group(TAG, NAME, ATTRS, MEMBERS...)",
> """ __struct_group(TAG, NAME, ATTRS, MEMBERS...)
> union {
> struct { MEMBERS } ATTRS;
> struct __struct_group_tag(TAG) { MEMBERS } ATTRS NAME;
> } ATTRS"""),
> ...
> ]
>
> members = trim_private_members(members)
> for from, to in struct_nested_matches:
> members = NestedMatch.sub(from, to, members)
>
> Granted, wiring this up takes some time and lots of testing - we should
> likely have some unit tests to catch issues there - but IMO it is
> worth the effort.
>
> -
>
> [1] NestedMatch() is currently limited to match function(<args>), as it was
> written to replace really complex regular expressions with
> recursive patterns and atomic grouping, that were used only to
> capture macro calls for:
>
> STRUCT_GROUP(...)
>
> I might have used instead "import regex", but I didn't want to add the
> extra dependency of a non-standard Python library at the Kernel build.
>
> Thanks,
> Mauro
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 06/12] docs: kdoc: split struct-member rewriting out of dump_struct()
2025-08-01 0:13 ` [PATCH 06/12] docs: kdoc: split struct-member rewriting " Jonathan Corbet
@ 2025-08-01 5:37 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 5:37 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:20 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> The massive loop that massages struct members shares no data with the rest
> of dump_struct(); split it out into its own function. Code movement only,
> no other changes.
>
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
LGTM.
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
> scripts/lib/kdoc/kdoc_parser.py | 65 +++++++++++++++++----------------
> 1 file changed, 34 insertions(+), 31 deletions(-)
>
> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> index 2bb0da22048f..5c4ad8febb9f 100644
> --- a/scripts/lib/kdoc/kdoc_parser.py
> +++ b/scripts/lib/kdoc/kdoc_parser.py
> @@ -647,37 +647,7 @@ class KernelDoc:
> return (r.group(1), r.group(3), r.group(2))
> return None
>
> - def dump_struct(self, ln, proto):
> - """
> - Store an entry for an struct or union
> - """
> - #
> - # Do the basic parse to get the pieces of the declaration.
> - #
> - struct_parts = self.split_struct_proto(proto)
> - if not struct_parts:
> - self.emit_msg(ln, f"{proto} error: Cannot parse struct or union!")
> - return
> - decl_type, declaration_name, members = struct_parts
> -
> - if self.entry.identifier != declaration_name:
> - self.emit_msg(ln, f"expecting prototype for {decl_type} {self.entry.identifier}. "
> - f"Prototype was for {decl_type} {declaration_name} instead\n")
> - return
> - #
> - # Go through the list of members applying all of our transformations.
> - #
> - members = trim_private_members(members)
> - for search, sub in struct_prefixes:
> - members = search.sub(sub, members)
> -
> - nested = NestedMatch()
> - for search, sub in struct_nested_prefixes:
> - members = nested.sub(search, sub, members)
> -
> - # Keeps the original declaration as-is
> - declaration = members
> -
> + def rewrite_struct_members(self, members):
> # Split nested struct/union elements
> #
> # This loop was simpler at the original kernel-doc perl version, as
> @@ -768,6 +738,39 @@ class KernelDoc:
> newmember += f"{dtype} {s_id}.{name}; "
>
> members = members.replace(oldmember, newmember)
> + return members
> +
> + def dump_struct(self, ln, proto):
> + """
> + Store an entry for an struct or union
> + """
> + #
> + # Do the basic parse to get the pieces of the declaration.
> + #
> + struct_parts = self.split_struct_proto(proto)
> + if not struct_parts:
> + self.emit_msg(ln, f"{proto} error: Cannot parse struct or union!")
> + return
> + decl_type, declaration_name, members = struct_parts
> +
> + if self.entry.identifier != declaration_name:
> + self.emit_msg(ln, f"expecting prototype for {decl_type} {self.entry.identifier}. "
> + f"Prototype was for {decl_type} {declaration_name} instead\n")
> + return
> + #
> + # Go through the list of members applying all of our transformations.
> + #
> + members = trim_private_members(members)
> + for search, sub in struct_prefixes:
> + members = search.sub(sub, members)
> +
> + nested = NestedMatch()
> + for search, sub in struct_nested_prefixes:
> + members = nested.sub(search, sub, members)
> +
> + # Keeps the original declaration as-is
> + declaration = members
> + members = self.rewrite_struct_members(members)
>
> # Ignore other nested elements, like enums
> members = re.sub(r'(\{[^\{\}]*\})', '', members)
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 07/12] docs: kdoc: rework the rewrite_struct_members() main loop
2025-08-01 0:13 ` [PATCH 07/12] docs: kdoc: rework the rewrite_struct_members() main loop Jonathan Corbet
@ 2025-08-01 5:42 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 5:42 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:21 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> Adopt a more Pythonic form for the main loop of this function, getting rid
> of the "while True:" construction and making the actual loop invariant
> explicit.
LGTM.
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
>
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
> ---
> scripts/lib/kdoc/kdoc_parser.py | 8 +++-----
> 1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> index 5c4ad8febb9f..efc5888fcc74 100644
> --- a/scripts/lib/kdoc/kdoc_parser.py
> +++ b/scripts/lib/kdoc/kdoc_parser.py
> @@ -663,11 +663,8 @@ class KernelDoc:
> # re limitation.
>
> struct_members = KernRe(r'(struct|union)([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
> - while True:
> - tuples = struct_members.findall(members)
> - if not tuples:
> - break
> -
> + tuples = struct_members.findall(members)
> + while tuples:
> for t in tuples:
> newmember = ""
> maintype = t[0]
> @@ -738,6 +735,7 @@ class KernelDoc:
> newmember += f"{dtype} {s_id}.{name}; "
>
> members = members.replace(oldmember, newmember)
> + tuples = struct_members.findall(members)
> return members
>
> def dump_struct(self, ln, proto):
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 08/12] docs: kdoc: remove an extraneous strip() call
2025-08-01 0:13 ` [PATCH 08/12] docs: kdoc: remove an extraneous strip() call Jonathan Corbet
@ 2025-08-01 5:45 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 5:45 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:22 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> ...the variable in question was already strip()ed at the top of the loop.
Good catch. This was probably a left-over from some code changes.
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
> ---
> scripts/lib/kdoc/kdoc_parser.py | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> index efc5888fcc74..b751fa8edde7 100644
> --- a/scripts/lib/kdoc/kdoc_parser.py
> +++ b/scripts/lib/kdoc/kdoc_parser.py
> @@ -703,7 +703,6 @@ class KernelDoc:
> newmember += f"{dtype}{s_id}.{name}{extra}; "
>
> else:
> - arg = arg.strip()
> # Handle bitmaps
> arg = KernRe(r':\s*\d+\s*').sub('', arg)
>
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 09/12] docs: kdoc: Some rewrite_struct_members() commenting
2025-08-01 0:13 ` [PATCH 09/12] docs: kdoc: Some rewrite_struct_members() commenting Jonathan Corbet
@ 2025-08-01 5:50 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 5:50 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:23 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> Add comments to rewrite_struct_members() describing what it is actually
> doing, and reformat/comment the main struct_members regex so that it is
> (more) comprehensible to humans.
>
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
> ---
> scripts/lib/kdoc/kdoc_parser.py | 32 +++++++++++++++++++-------------
> 1 file changed, 19 insertions(+), 13 deletions(-)
>
> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> index b751fa8edde7..20e0a2abe13b 100644
> --- a/scripts/lib/kdoc/kdoc_parser.py
> +++ b/scripts/lib/kdoc/kdoc_parser.py
> @@ -647,22 +647,28 @@ class KernelDoc:
> return (r.group(1), r.group(3), r.group(2))
> return None
>
> + #
> + # Rewrite the members of a structure or union for easier formatting later on.
> + # Among other things, this function will turn a member like:
> + #
> + # struct { inner_members; } foo;
> + #
> + # into:
> + #
> + # struct foo; inner_members;
> + #
> def rewrite_struct_members(self, members):
> - # Split nested struct/union elements
> - #
> - # This loop was simpler at the original kernel-doc perl version, as
> - # while ($members =~ m/$struct_members/) { ... }
> - # reads 'members' string on each interaction.
> #
> - # Python behavior is different: it parses 'members' only once,
> - # creating a list of tuples from the first interaction.
> + # Process struct/union members from the most deeply nested outward. The
> + # trick is in the ^{ below - it prevents a match of an outer struct/union
> + # until the inner one has been munged (removing the "{" in the process).
> #
> - # On other words, this won't get nested structs.
> - #
> - # So, we need to have an extra loop on Python to override such
> - # re limitation.
> -
> - struct_members = KernRe(r'(struct|union)([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
> + struct_members = KernRe(r'(struct|union)' # 0: declaration type
> + r'([^{};]+)' # 1: possible name
> + r'(\{)'
> + r'([^{}]*)' # 3: Contents of declaration
> + r'(\})'
> + r'([^{};]*)(;)') # 5: Remaining stuff after declaration
I liked breaking it like these, but I do miss backslashes before some
'{' and '}' to make this actually more readable on my eyes.
Re-adding that, you can add:
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup
2025-08-01 0:13 ` [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup Jonathan Corbet
@ 2025-08-01 6:07 ` Mauro Carvalho Chehab
2025-08-01 22:52 ` Jonathan Corbet
0 siblings, 1 reply; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 6:07 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:24 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> Get rid of some single-use variables and redundant checks, and generally
> tighten up the code; no logical change.
>
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
> ---
> scripts/lib/kdoc/kdoc_parser.py | 89 ++++++++++++++++-----------------
> 1 file changed, 42 insertions(+), 47 deletions(-)
>
> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> index 20e0a2abe13b..2b7d7e646367 100644
> --- a/scripts/lib/kdoc/kdoc_parser.py
> +++ b/scripts/lib/kdoc/kdoc_parser.py
> @@ -673,73 +673,68 @@ class KernelDoc:
> while tuples:
> for t in tuples:
> newmember = ""
> - maintype = t[0]
> - s_ids = t[5]
> - content = t[3]
The reason I opted for this particular approach...
> -
> - oldmember = "".join(t)
> -
> - for s_id in s_ids.split(','):
> + oldmember = "".join(t) # Reconstruct the original formatting
> + #
> + # Pass through each field name, normalizing the form and formatting.
> + #
> + for s_id in t[5].split(','):
... is that it is easier to understand and to maintain:
for s_id in s_ids.split(','):
than when magic numbers like this are used:
for s_id in t[5].split(','):
> s_id = s_id.strip()
>
> - newmember += f"{maintype} {s_id}; "
> + newmember += f"{t[0]} {s_id}; "
> + #
> + # Remove bitfield/array/pointer info, getting the bare name.
> + #
> s_id = KernRe(r'[:[].*').sub('', s_id)
> s_id = KernRe(r'^\s*\**(\S+)\s*').sub(r'\1', s_id)
> -
> - for arg in content.split(';'):
> + #
> + # Pass through the members of this inner structure/union.
> + #
> + for arg in t[3].split(';'):
Here, for example, we're far away from the tuple definition... I can't
recall anymore what "3" magic number means ;-)
> arg = arg.strip()
> -
> - if not arg:
> - continue
> -
> + #
> + # Look for (type)(*name)(args) - pointer to function
> + #
> r = KernRe(r'^([^(]+\(\*?\s*)([\w.]*)(\s*\).*)')
> if r.match(arg):
> # Pointer-to-function
> - dtype = r.group(1)
> - name = r.group(2)
> - extra = r.group(3)
Same applies here. Having a named var makes easier to understand/maintain
rest of the code. If you're willing to do something like that, better to
use named capture groups, like:
r = KernRe(r'^(?P<dtype>[^(]+\(\*?\s*)'
r'(?P<name>[\w.]*)'
r'(?P<extra>\s*\).*)')
together with a syntax using match.group(group_name)
I'm not a particular fan of named groups, as it adds a lot more stuff
at regexes. They're already hard enough to understand without ?P<name>,
but at least match.group('dtype'), match.group('name'), match.group('extra')
inside the next calls would be easier to maintain than when using magic
numbers.
Same comments apply to other changes below.
> -
> - if not name:
> - continue
> -
> if not s_id:
> # Anonymous struct/union
> - newmember += f"{dtype}{name}{extra}; "
> + newmember += f"{r.group(1)}{r.group(2)}{r.group(3)}; "
> else:
> - newmember += f"{dtype}{s_id}.{name}{extra}; "
> -
> + newmember += f"{r.group(1)}{s_id}.{r.group(2)}{r.group(3)}; "
> + #
> + # Otherwise a non-function member.
> + #
> else:
> - # Handle bitmaps
> + #
> + # Remove bitmap and array portions and spaces around commas
> + #
> arg = KernRe(r':\s*\d+\s*').sub('', arg)
> -
> - # Handle arrays
> arg = KernRe(r'\[.*\]').sub('', arg)
> -
> - # Handle multiple IDs
> arg = KernRe(r'\s*,\s*').sub(',', arg)
> -
> + #
> + # Look for a normal decl - "type name[,name...]"
> + #
> r = KernRe(r'(.*)\s+([\S+,]+)')
> -
> if r.search(arg):
> - dtype = r.group(1)
> - names = r.group(2)
> + for name in r.group(2).split(','):
> + name = KernRe(r'^\s*\**(\S+)\s*').sub(r'\1', name)
> + if not s_id:
> + # Anonymous struct/union
> + newmember += f"{r.group(1)} {name}; "
> + else:
> + newmember += f"{r.group(1)} {s_id}.{name}; "
> else:
> newmember += f"{arg}; "
> - continue
> -
> - for name in names.split(','):
> - name = KernRe(r'^\s*\**(\S+)\s*').sub(r'\1', name).strip()
> -
> - if not name:
> - continue
> -
> - if not s_id:
> - # Anonymous struct/union
> - newmember += f"{dtype} {name}; "
> - else:
> - newmember += f"{dtype} {s_id}.{name}; "
> -
> + #
> + # At the end of the s_id loop, replace the original declaration with
> + # the munged version.
> + #
> members = members.replace(oldmember, newmember)
> + #
> + # End of the tuple loop - search again and see if there are outer members
> + # that now turn up.
> + #
> tuples = struct_members.findall(members)
> return members
>
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 11/12] docs: kdoc: extract output formatting from dump_struct()
2025-08-01 0:13 ` [PATCH 11/12] docs: kdoc: extract output formatting from dump_struct() Jonathan Corbet
@ 2025-08-01 6:09 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 6:09 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:25 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> The last thing done in dump_struct() is to format the structure for
> printing. That, too, is a separate activity; split it out into its own
> function.
>
> dump_struct() now fits in a single, full-hight editor screen.
>
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Splitting it on a separate function LGTM.
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
> scripts/lib/kdoc/kdoc_parser.py | 72 +++++++++++++++++----------------
> 1 file changed, 37 insertions(+), 35 deletions(-)
>
> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> index 2b7d7e646367..131956d89f84 100644
> --- a/scripts/lib/kdoc/kdoc_parser.py
> +++ b/scripts/lib/kdoc/kdoc_parser.py
> @@ -738,6 +738,42 @@ class KernelDoc:
> tuples = struct_members.findall(members)
> return members
>
> + #
> + # Format the struct declaration into a standard form for inclusion in the
> + # resulting docs.
> + #
> + def format_struct_decl(self, declaration):
> + #
> + # Insert newlines, get rid of extra spaces.
> + #
> + declaration = KernRe(r'([{;])').sub(r'\1\n', declaration)
> + declaration = KernRe(r'\}\s+;').sub('};', declaration)
> + #
> + # Format inline enums with each member on its own line.
> + #
> + r = KernRe(r'(enum\s+\{[^}]+),([^\n])')
> + while r.search(declaration):
> + declaration = r.sub(r'\1,\n\2', declaration)
> + #
> + # Now go through and supply the right number of tabs
> + # for each line.
> + #
> + def_args = declaration.split('\n')
> + level = 1
> + declaration = ""
> + for clause in def_args:
> + clause = KernRe(r'\s+').sub(' ', clause.strip(), count=1)
> + if clause:
> + if '}' in clause and level > 1:
> + level -= 1
> + if not clause.startswith('#'):
> + declaration += "\t" * level
> + declaration += "\t" + clause + "\n"
> + if "{" in clause and "}" not in clause:
> + level += 1
> + return declaration
> +
> +
> def dump_struct(self, ln, proto):
> """
> Store an entry for an struct or union
> @@ -776,42 +812,8 @@ class KernelDoc:
> self.create_parameter_list(ln, decl_type, members, ';',
> declaration_name)
> self.check_sections(ln, declaration_name, decl_type)
> -
> - # Adjust declaration for better display
> - declaration = KernRe(r'([{;])').sub(r'\1\n', declaration)
> - declaration = KernRe(r'\}\s+;').sub('};', declaration)
> -
> - # Better handle inlined enums
> - while True:
> - r = KernRe(r'(enum\s+\{[^}]+),([^\n])')
> - if not r.search(declaration):
> - break
> -
> - declaration = r.sub(r'\1,\n\2', declaration)
> -
> - def_args = declaration.split('\n')
> - level = 1
> - declaration = ""
> - for clause in def_args:
> -
> - clause = clause.strip()
> - clause = KernRe(r'\s+').sub(' ', clause, count=1)
> -
> - if not clause:
> - continue
> -
> - if '}' in clause and level > 1:
> - level -= 1
> -
> - if not KernRe(r'^\s*#').match(clause):
> - declaration += "\t" * level
> -
> - declaration += "\t" + clause + "\n"
> - if "{" in clause and "}" not in clause:
> - level += 1
> -
> self.output_declaration(decl_type, declaration_name,
> - definition=declaration,
> + definition=self.format_struct_decl(declaration),
> purpose=self.entry.declaration_purpose)
>
> def dump_enum(self, ln, proto):
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 12/12] docs: kdoc: a few final dump_struct() touches
2025-08-01 0:13 ` [PATCH 12/12] docs: kdoc: a few final dump_struct() touches Jonathan Corbet
@ 2025-08-01 6:10 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 6:10 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:26 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> Add a couple more comments so that each phase of the process is
> now clearly marked.
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
>
> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
> ---
> scripts/lib/kdoc/kdoc_parser.py | 11 ++++++-----
> 1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> index 131956d89f84..fa2041276f8c 100644
> --- a/scripts/lib/kdoc/kdoc_parser.py
> +++ b/scripts/lib/kdoc/kdoc_parser.py
> @@ -801,14 +801,15 @@ class KernelDoc:
> nested = NestedMatch()
> for search, sub in struct_nested_prefixes:
> members = nested.sub(search, sub, members)
> -
> - # Keeps the original declaration as-is
> + #
> + # Deal with embedded struct and union members, and drop enums entirely.
> + #
> declaration = members
> members = self.rewrite_struct_members(members)
> -
> - # Ignore other nested elements, like enums
> members = re.sub(r'(\{[^\{\}]*\})', '', members)
> -
> + #
> + # Output the result and we are done.
> + #
> self.create_parameter_list(ln, decl_type, members, ';',
> declaration_name)
> self.check_sections(ln, declaration_name, decl_type)
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 00/12] docs: kdoc: thrash up dump_struct()
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
` (11 preceding siblings ...)
2025-08-01 0:13 ` [PATCH 12/12] docs: kdoc: a few final dump_struct() touches Jonathan Corbet
@ 2025-08-01 6:23 ` Mauro Carvalho Chehab
12 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-01 6:23 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Thu, 31 Jul 2025 18:13:14 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> In my ongoing effort to truly understand our new kernel-doc, I continue to
> make changes to improve the code, and to try to make the understanding task
> easier for the next person. These patches focus on dump_struct() in
> particular, which starts out at nearly 300 lines long - to much to fit into
> my little brain anyway. Hopefully the result is easier to manage.
>
> There are no changes in the rendered docs.
>
> (At some point I think this code could benefit from a deeper rework. We
> are essentially making three parsing passes over these declarations -
> dump_struct(), create_parameter_list(), and push_parameter() for structs -
> and it seems like we ought to be able to do better. But that's for another
> day.)
True. I tried not to do too much optimizations during conversion, as it
would make harder to compare with kernel_doc.pl, but yeah, the entire
logic around parsing structs and functions has always been a nightmare.
My understanding is that the original Perl code was written this way to
make easier to handle typedefs and structs the same way. So, common
code was placed at create_parameter_list(). The push_parameter() is
there to have some common code used on several parts of
create_parameter_list() on a single place.
If I were designing it from scratch with no strings attached to Perl,
I would probably create a separate class just to manage struct
parameters - or alternatively, to deal with structs as a hole.
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 05/12] docs: kdoc: split top-level prototype parsing out of dump_struct()
2025-08-01 5:34 ` Mauro Carvalho Chehab
@ 2025-08-01 14:10 ` Jonathan Corbet
2025-08-04 12:20 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 14:10 UTC (permalink / raw)
To: Mauro Carvalho Chehab; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> Em Thu, 31 Jul 2025 18:13:19 -0600
> Jonathan Corbet <corbet@lwn.net> escreveu:
>>
>> - struct_members = KernRe(type_pattern + r'([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
>> + struct_members = KernRe(r'(struct|union)([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
>
> I would prefer keeping type_pattern here.
The problem is that type_pattern no longer exists in that function. I'd
have to redefine it, or make it global. It seems like a rather trivial
thing to make global (and, as a result, make people go to the top of the
file to figure out what it really is).
jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 03/12] docs: kdoc: backslashectomy in kdoc_parser
2025-08-01 4:27 ` Mauro Carvalho Chehab
@ 2025-08-01 14:21 ` Jonathan Corbet
2025-08-04 12:58 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 14:21 UTC (permalink / raw)
To: Mauro Carvalho Chehab; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> Em Thu, 31 Jul 2025 18:13:17 -0600
> Jonathan Corbet <corbet@lwn.net> escreveu:
>
>> A lot of the regular expressions in this file have extraneous backslashes
>
> This one is a bit scary... It could actually cause issues somewhere.
What kind of issues?
> Also, IMHO, some expressions look worse on my eyes ;-)
Here I think we're going to disagree. The extra backslashes are really
just visual noise as far as I'm concerned.
>> that may have been needed in Perl, but aren't helpful here. Take them out
>> to reduce slightly the visual noise.
>
> No idea if Perl actually requires, but, at least for me, I do prefer to
> see all special characters properly escaped with a backslash. This way,
> it is a lot clearer that what it is expecting is a string, instead of
> using something that may affect regex processing.
I guess my point is that, in the given cases, the characters in question
*aren't* special.
>> - param = KernRe(r'[\[\)].*').sub('', param, count=1)
>> + param = KernRe(r'[)[].*').sub('', param, count=1)
>
> This one, for instance, IMHO looks a lot worse for my eyes to understand
> that there is a "[" that it is not an operator, but instead a string.
> The open close parenthesis also looks weird. My regex-trained eyes think
> that this would be part of a capture group.
...and mine say "that's in [brackets] why are you escaping it?" :)
>> if dtype == "" and param.endswith("..."):
>> if KernRe(r'\w\.\.\.$').search(param):
>> @@ -405,7 +405,7 @@ class KernelDoc:
>>
>> for arg in args.split(splitter):
>> # Strip comments
>> - arg = KernRe(r'\/\*.*\*\/').sub('', arg)
>> + arg = KernRe(r'/\*.*\*/').sub('', arg)
>
> A pattern like /..../ is a standard way to pass search group with Regex
> on many languages and utils that accept regular expressions like the
> sed command. Dropping the backslash here IMHO makes it confusing ;-)
...but it is definitely not any such in Python and never has been, so
escaping slashes looks weird and makes the reader wonder what they are
missing.
> Seriously, IMHO this patch makes a lot worse to understand what brackets,
> parenthesis and dots are strings, and which ones are part of the regex
> syntax.
So I guess I won't fight this one to the death, but I really do
disagree. Writing regexes in a non-canonical style just makes it harder
for anybody else who comes along to figure out what is going on; it
certainly made it harder for me.
Thanks,
jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup
2025-08-01 6:07 ` Mauro Carvalho Chehab
@ 2025-08-01 22:52 ` Jonathan Corbet
2025-08-04 13:15 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-01 22:52 UTC (permalink / raw)
To: Mauro Carvalho Chehab; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> Em Thu, 31 Jul 2025 18:13:24 -0600
> Jonathan Corbet <corbet@lwn.net> escreveu:
>
>> Get rid of some single-use variables and redundant checks, and generally
>> tighten up the code; no logical change.
>>
>> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
>> ---
>> scripts/lib/kdoc/kdoc_parser.py | 89 ++++++++++++++++-----------------
>> 1 file changed, 42 insertions(+), 47 deletions(-)
>>
>> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
>> index 20e0a2abe13b..2b7d7e646367 100644
>> --- a/scripts/lib/kdoc/kdoc_parser.py
>> +++ b/scripts/lib/kdoc/kdoc_parser.py
>> @@ -673,73 +673,68 @@ class KernelDoc:
>> while tuples:
>> for t in tuples:
>> newmember = ""
>> - maintype = t[0]
>> - s_ids = t[5]
>> - content = t[3]
>
> The reason I opted for this particular approach...
>> -
>> - oldmember = "".join(t)
>> -
>> - for s_id in s_ids.split(','):
>> + oldmember = "".join(t) # Reconstruct the original formatting
>> + #
>> + # Pass through each field name, normalizing the form and formatting.
>> + #
>> + for s_id in t[5].split(','):
>
> ... is that it is easier to understand and to maintain:
>
> for s_id in s_ids.split(','):
>
> than when magic numbers like this are used:
>
> for s_id in t[5].split(','):
Coming into this code, I had a different experience, and found the
variables to just be a layer of indirection I had to pass through to get
to the capture groups and see what was really going on. That was part
of why I put the group numbers in the comments next to that gnarly
regex, to make that mapping more direct and easier to understand.
I will not insist on this change either - at least not indefinitely :)
I do feel, though, that adding a step between the regex and its use just
serves to obscure things.
(And yes, I don't really think that named groups make things better.
I've found those useful in situations where multiple regexes are in use
and the ordering of the groups may vary, but otherwise have generally
avoided them).
Thanks,
jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 05/12] docs: kdoc: split top-level prototype parsing out of dump_struct()
2025-08-01 14:10 ` Jonathan Corbet
@ 2025-08-04 12:20 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-04 12:20 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Fri, 01 Aug 2025 08:10:05 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
>
> > Em Thu, 31 Jul 2025 18:13:19 -0600
> > Jonathan Corbet <corbet@lwn.net> escreveu:
> >>
> >> - struct_members = KernRe(type_pattern + r'([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
> >> + struct_members = KernRe(r'(struct|union)([^{};]+)(\{)([^{}]*)(\})([^{};]*)(;)')
> >
> > I would prefer keeping type_pattern here.
>
> The problem is that type_pattern no longer exists in that function.
Ah, I see. If this is the only place now where we have this, then it
sounds OK to have it like that.
Feel free to add my R-B:
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> I'd
> have to redefine it, or make it global. It seems like a rather trivial
> thing to make global (and, as a result, make people go to the top of the
> file to figure out what it really is).
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 03/12] docs: kdoc: backslashectomy in kdoc_parser
2025-08-01 14:21 ` Jonathan Corbet
@ 2025-08-04 12:58 ` Mauro Carvalho Chehab
2025-08-04 16:00 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-04 12:58 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Fri, 01 Aug 2025 08:21:49 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
>
> > Em Thu, 31 Jul 2025 18:13:17 -0600
> > Jonathan Corbet <corbet@lwn.net> escreveu:
> >
> >> A lot of the regular expressions in this file have extraneous backslashes
> >
> > This one is a bit scary... It could actually cause issues somewhere.
>
> What kind of issues?
I caught several issues in the past due to the lack of it. Don't
recall the specific cases, but using reserved symbols without
backslashes have giving me enough headaches.
Yet, see POSIX rules for some cases:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03
like this one:
"The character sequences "[.", "[=", and "[:" shall be special
inside a bracket expression"
Basically, if you don't know exactly what you're doing, and just
place special characters there without extra case, you may be
in serious troubles. And see, this is just for BRE (basic regular
expressions). There are also other weirdness with ERE (extended
regular expressions):
"The <period>, <left-square-bracket>, <backslash>, and
<left-parenthesis> shall be special except when used
in a bracket expression"
> > Also, IMHO, some expressions look worse on my eyes ;-)
>
> Here I think we're going to disagree. The extra backslashes are really
> just visual noise as far as I'm concerned.
>
> >> that may have been needed in Perl, but aren't helpful here. Take them out
> >> to reduce slightly the visual noise.
> >
> > No idea if Perl actually requires, but, at least for me, I do prefer to
> > see all special characters properly escaped with a backslash. This way,
> > it is a lot clearer that what it is expecting is a string, instead of
> > using something that may affect regex processing.
>
> I guess my point is that, in the given cases, the characters in question
> *aren't* special.
They are special in the sense that we're using characters that
have meanings in regular expressions and even placing them on
a random order may cause POSIX violations (and eventually cause
troubles if, for instance, we need to use "regex" instead of "re",
or if someone fixes python native "re" to be more POSIX compliant.
> >> - param = KernRe(r'[\[\)].*').sub('', param, count=1)
> >> + param = KernRe(r'[)[].*').sub('', param, count=1)
> >
> > This one, for instance, IMHO looks a lot worse for my eyes to understand
> > that there is a "[" that it is not an operator, but instead a string.
> > The open close parenthesis also looks weird. My regex-trained eyes think
> > that this would be part of a capture group.
>
> ...and mine say "that's in [brackets] why are you escaping it?" :)
Heh, all those years writing and reviewing kernel code, for me
seeing unmatched parenthesis/brackets really bugs me... perhaps
it starts some sort of TOC syndrome ;-)
Perhaps one alternative would be to have a separate var, like:
# Before touching this, see:
# https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04
# As some char sequences inside brackets have special meanings
escape_chars = ")["
param = KernRe(rf'[{escape_chars}].*').sub('', param, count=1)
or to use re_escape().
> >> if dtype == "" and param.endswith("..."):
> >> if KernRe(r'\w\.\.\.$').search(param):
> >> @@ -405,7 +405,7 @@ class KernelDoc:
> >>
> >> for arg in args.split(splitter):
> >> # Strip comments
> >> - arg = KernRe(r'\/\*.*\*\/').sub('', arg)
> >> + arg = KernRe(r'/\*.*\*/').sub('', arg)
> >
> > A pattern like /..../ is a standard way to pass search group with Regex
> > on many languages and utils that accept regular expressions like the
> > sed command. Dropping the backslash here IMHO makes it confusing ;-)
>
> ...but it is definitely not any such in Python and never has been, so
> escaping slashes looks weird and makes the reader wonder what they are
> missing.
After re-reading, this specific change is actually ok, but yeah, I
still need to read it twice or three times, as on sed, perl and other
languages that are more POSIX compliant, /re/ means a regex delimiter:
https://en.wikipedia.org/wiki/Regular_expression
> > Seriously, IMHO this patch makes a lot worse to understand what brackets,
> > parenthesis and dots are strings, and which ones are part of the regex
> > syntax.
>
> So I guess I won't fight this one to the death, but I really do
> disagree. Writing regexes in a non-canonical style just makes it harder
> for anybody else who comes along to figure out what is going on; it
> certainly made it harder for me.
Heh, for me, my main concerns are:
- unmatched brackets/parenthesis
- POSIX violations - it may work today, but future Python versions
that fix "re" module will cause regressions. It is also annoying
to write/understand regex that only works on Python.
I can live with the other ones.
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup
2025-08-01 22:52 ` Jonathan Corbet
@ 2025-08-04 13:15 ` Mauro Carvalho Chehab
2025-08-05 22:46 ` Jonathan Corbet
0 siblings, 1 reply; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-04 13:15 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Fri, 01 Aug 2025 16:52:33 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
>
> > Em Thu, 31 Jul 2025 18:13:24 -0600
> > Jonathan Corbet <corbet@lwn.net> escreveu:
> >
> >> Get rid of some single-use variables and redundant checks, and generally
> >> tighten up the code; no logical change.
> >>
> >> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
> >> ---
> >> scripts/lib/kdoc/kdoc_parser.py | 89 ++++++++++++++++-----------------
> >> 1 file changed, 42 insertions(+), 47 deletions(-)
> >>
> >> diff --git a/scripts/lib/kdoc/kdoc_parser.py b/scripts/lib/kdoc/kdoc_parser.py
> >> index 20e0a2abe13b..2b7d7e646367 100644
> >> --- a/scripts/lib/kdoc/kdoc_parser.py
> >> +++ b/scripts/lib/kdoc/kdoc_parser.py
> >> @@ -673,73 +673,68 @@ class KernelDoc:
> >> while tuples:
> >> for t in tuples:
> >> newmember = ""
> >> - maintype = t[0]
> >> - _ids = t[5]s
> >> - content = t[3]
> >
> > The reason I opted for this particular approach...
> >> -
> >> - oldmember = "".join(t)
> >> -
> >> - for s_id in s_ids.split(','):
> >> + oldmember = "".join(t) # Reconstruct the original formatting
> >> + #
> >> + # Pass through each field name, normalizing the form and formatting.
> >> + #
> >> + for s_id in t[5].split(','):
> >
> > ... is that it is easier to understand and to maintain:
> >
> > for s_id in s_ids.split(','):
> >
> > than when magic numbers like this are used:
> >
> > for s_id in t[5].split(','):
>
> Coming into this code, I had a different experience, and found the
> variables to just be a layer of indirection I had to pass through to get
> to the capture groups and see what was really going on. That was part
> of why I put the group numbers in the comments next to that gnarly
> regex, to make that mapping more direct and easier to understand.
>
> I will not insist on this change either - at least not indefinitely :)
> I do feel, though, that adding a step between the regex and its use just
> serves to obscure things.
>
> (And yes, I don't really think that named groups make things better.
> I've found those useful in situations where multiple regexes are in use
> and the ordering of the groups may vary, but otherwise have generally
> avoided them).
I'd say that, when the magic number is within up to 3-lines hunk
distance - e.g. if all of them will appear at the same hunk, it is
probably safe to use, but when it gets far away, it makes more harm
than good.
Perhaps one alternative would do something like:
tuples = struct_members.findall(members)
if not tuples:
break
maintype, -, -, content, -, s_ids = tuples
(assuming that we don't need t[1], t[2] and t[4] here)
Btw, on this specific case, better to use non-capture group matches
to avoid those "empty" spaces, e.g. (if I got it right):
# Curly Brackets are not captured
struct_members = KernRe(type_pattern + # Capture main type
r'([^\{\};]+)' +
r'(?:\{)' +
r'(?:[^\{\}]*)' + # Capture content
r'(?:\})' +
r'([^\{\}\;]*)(\;)') # Capture IDs
...
tuples = struct_members.findall(members)
if not tuples:
break
maintype, content, s_ids = tuples
Btw, a cleanup like the above is, IMHO, another good reason why not using
magic numbers: people may end fixing match groups to use non-capture
matches, but end forgetting to fix some hidden magic numbers. It is hard
for a reviewer to see it if the affected magic numbers aren't within the
3 lines range of a default unified diff, and may introduce hidden bugs.
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 03/12] docs: kdoc: backslashectomy in kdoc_parser
2025-08-04 12:58 ` Mauro Carvalho Chehab
@ 2025-08-04 16:00 ` Mauro Carvalho Chehab
2025-08-04 18:29 ` Jonathan Corbet
0 siblings, 1 reply; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-04 16:00 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Mon, 4 Aug 2025 14:58:18 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:
> Em Fri, 01 Aug 2025 08:21:49 -0600
> Jonathan Corbet <corbet@lwn.net> escreveu:
>
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> >
> > > Em Thu, 31 Jul 2025 18:13:17 -0600
> > > Jonathan Corbet <corbet@lwn.net> escreveu:
> > >
> > >> A lot of the regular expressions in this file have extraneous backslashes
> > >
> > > This one is a bit scary... It could actually cause issues somewhere.
> >
> > What kind of issues?
>
> I caught several issues in the past due to the lack of it. Don't
> recall the specific cases, but using reserved symbols without
> backslashes have giving me enough headaches.
>
> Yet, see POSIX rules for some cases:
>
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03
>
> like this one:
>
> "The character sequences "[.", "[=", and "[:" shall be special
> inside a bracket expression"
>
> Basically, if you don't know exactly what you're doing, and just
> place special characters there without extra case, you may be
> in serious troubles. And see, this is just for BRE (basic regular
> expressions). There are also other weirdness with ERE (extended
> regular expressions):
>
> "The <period>, <left-square-bracket>, <backslash>, and
> <left-parenthesis> shall be special except when used
> in a bracket expression"
>
> > > Also, IMHO, some expressions look worse on my eyes ;-)
> >
> > Here I think we're going to disagree. The extra backslashes are really
> > just visual noise as far as I'm concerned.
> >
> > >> that may have been needed in Perl, but aren't helpful here. Take them out
> > >> to reduce slightly the visual noise.
> > >
> > > No idea if Perl actually requires, but, at least for me, I do prefer to
> > > see all special characters properly escaped with a backslash. This way,
> > > it is a lot clearer that what it is expecting is a string, instead of
> > > using something that may affect regex processing.
> >
> > I guess my point is that, in the given cases, the characters in question
> > *aren't* special.
>
> They are special in the sense that we're using characters that
> have meanings in regular expressions and even placing them on
> a random order may cause POSIX violations (and eventually cause
> troubles if, for instance, we need to use "regex" instead of "re",
> or if someone fixes python native "re" to be more POSIX compliant.
>
> > >> - param = KernRe(r'[\[\)].*').sub('', param, count=1)
> > >> + param = KernRe(r'[)[].*').sub('', param, count=1)
> > >
> > > This one, for instance, IMHO looks a lot worse for my eyes to understand
> > > that there is a "[" that it is not an operator, but instead a string.
> > > The open close parenthesis also looks weird. My regex-trained eyes think
> > > that this would be part of a capture group.
> >
> > ...and mine say "that's in [brackets] why are you escaping it?" :)
>
> Heh, all those years writing and reviewing kernel code, for me
> seeing unmatched parenthesis/brackets really bugs me... perhaps
> it starts some sort of TOC syndrome ;-)
>
> Perhaps one alternative would be to have a separate var, like:
>
> # Before touching this, see:
> # https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04
> # As some char sequences inside brackets have special meanings
> escape_chars = ")["
>
> param = KernRe(rf'[{escape_chars}].*').sub('', param, count=1)
>
> or to use re_escape().
>
> > >> if dtype == "" and param.endswith("..."):
> > >> if KernRe(r'\w\.\.\.$').search(param):
> > >> @@ -405,7 +405,7 @@ class KernelDoc:
> > >>
> > >> for arg in args.split(splitter):
> > >> # Strip comments
> > >> - arg = KernRe(r'\/\*.*\*\/').sub('', arg)
> > >> + arg = KernRe(r'/\*.*\*/').sub('', arg)
> > >
> > > A pattern like /..../ is a standard way to pass search group with Regex
> > > on many languages and utils that accept regular expressions like the
> > > sed command. Dropping the backslash here IMHO makes it confusing ;-)
> >
> > ...but it is definitely not any such in Python and never has been, so
> > escaping slashes looks weird and makes the reader wonder what they are
> > missing.
>
> After re-reading, this specific change is actually ok, but yeah, I
> still need to read it twice or three times, as on sed, perl and other
> languages that are more POSIX compliant, /re/ means a regex delimiter:
>
> https://en.wikipedia.org/wiki/Regular_expression
>
> > > Seriously, IMHO this patch makes a lot worse to understand what brackets,
> > > parenthesis and dots are strings, and which ones are part of the regex
> > > syntax.
> >
> > So I guess I won't fight this one to the death, but I really do
> > disagree. Writing regexes in a non-canonical style just makes it harder
> > for anybody else who comes along to figure out what is going on; it
> > certainly made it harder for me.
>
> Heh, for me, my main concerns are:
> - unmatched brackets/parenthesis
> - POSIX violations - it may work today, but future Python versions
> that fix "re" module will cause regressions. It is also annoying
> to write/understand regex that only works on Python.
In time: I mean *possible* POSIX violations.
I very much prefer more backslashs than needed or use re.escape()
than to read thoughtfully POSIX and Python-specific specific rules.
Python, in particular, is not very reliable between versions - each
new version comes with a set of incompatible changes.
Regex in Python for instance had incompatible changes in 3.6, 3.7, 3.11
and 3.13 (according with a LLM query I did). None relevant for our cases,
but there were incompatible changes on 3.6, 3.7 and 3.13 on common
patterns like \b (affecting re.split) and \w (affecting utf). The number
of escaped chars on re.escape() also increased on 3.7.
In summary, if you agree with always escape brackets, curly brackets and
parenthesis inside brackets on kernel-doc, we should be free of not
opened/not closed "symbols" with is an annoyance at least for me, and
we should be freed of possible POSIX issues and undefined behavior(*).
On such case, feel free to add my Reviewed-by.
Regards,
Mauro
(*) Python spec:
https://docs.python.org/3/library/re.html#regular-expression-syntax
Says:
"Backslash either escapes characters which have special meaning in
a set such as '-', ']', '^' and '\\' itself..."
Here, "such as" is vague: it doesn't say anything about "[[]" or "[\[]".
As not escaping "[" is a POSIX violation, I'd say that this is undefined
behavior that could change with time specially if they want to stick
with POSIX and/or implement POSIX ERE in the future.
---
Btw, I asked a LLM (*) to generate a list of special chars inside brackets,
listing how such characters are handled. Neither Deepseek nor Chatgpt
considered brackets without escape as valid or good practice.
That's the Deepseek version:
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| | Python re | Perl | POSIX ERE | POSIX BRE | grep -E | sed (BRE) |
+==================+===========+===========+===========+===========+===========+===========+
| ] (not first) | \] | \] | \] | \] | \] | \] |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| ] (first) | ] | ] | ] | ] | ] | ] |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| [ | \[ | \[ | \[ | \[ | \[ | \[ |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| - (middle) | \-, a-z | \-, a-z | \-, a-z | \-, a-z | \-, a-z | \- |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| - (first/last) | - | - | - | - | - | - |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| ^ (first) | ^ | ^ | ^ | ^ | ^ | \^ |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| ^ (not first) | ^ | ^ | \^ | \^ | \^ | \^ |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| \ | \\ | \\ | \\ | \\ | \\ | \\ |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| POSIX classes | [:alpha:] | [:alpha:] | [:alpha:] | [:alpha:] | [:alpha:] | [:alpha:] |
| (e.g. [:alpha:]) | | | | | | |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| Special POSIX | - | - | [.ch.] | [.ch.] | - | [.ch.] |
| collating [.ch.] | | | [=a=] | [=a=] | | [=a=] |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| * | * | * | * | \* | * | \* |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| $ | $ | $ | $ | \$ | $ | \$ |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
| . | . | . | . | \. | . | \. |
+------------------+-----------+-----------+-----------+-----------+-----------+-----------+
And using the LLM query produced by Deepseek, (see below), that's Chatgpt results:
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| Feature | Python re | Perl | POSIX ERE | POSIX BRE | grep -E | sed (BRE) |
+====================================+============+========+=============+=============+===========+=============+
| `]` (first position) | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `]` (non-first position) | \] | \] | \] | \] | \] | \] |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `[` | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `-` (between literals) | range `a-z`| range | range | range | range | range |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `-` (start or end) | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `^` (first position) | negation | negation| negation | negation | negation | negation |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `^` (non-first position) | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `\` | \\ | \\ | \\ | \\ | \\ | \\ |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| POSIX class `[:alpha:]` | literal | literal| supported | supported | supported | supported |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| POSIX special `[.ch.]` | - | - | supported | supported | supported | supported |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| POSIX equivalence `[=a=]` | - | - | supported | supported | supported | supported |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `*` | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `$` | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `.` | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
Yet, Chatgpt original result was wrong. I had to do an extra query
there:
- Please check again for [ inside brackets. POSIX spec mentions some sequences starting with it that aren't literal. Re-check it and update the table.
Chatgpt now gave a better answer:
You're correct to call this out — in **POSIX**, an unescaped `[` inside a character class can indeed initiate **special bracket expressions** (`[:class:]`, `[=equiv=]`, `[.collate.]`). This makes it **not always treated as literal**, depending on **context and position**.
Here is the **updated full table** in **strict ReST ASCII format**, reflecting that correction:
```
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| Feature | Python re | Perl | POSIX ERE | POSIX BRE | grep -E | sed (BRE) |
+====================================+============+========+=============+=============+===========+=============+
| `]` (first position) | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `]` (non-first position) | \] | \] | \] | \] | \] | \] |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `[` | literal | literal| special* | special* | special* | special* |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `-` (between literals) | range `a-z`| range | range | range | range | range |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `-` (start or end) | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `^` (first position) | negation | negation| negation | negation | negation | negation |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `^` (non-first position) | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `\` | \\ | \\ | \\ | \\ | \\ | \\ |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| POSIX class `[:alpha:]` | literal | literal| supported | supported | supported | supported |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| POSIX special `[.ch.]` | - | - | supported | supported | supported | supported |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| POSIX equivalence `[=a=]` | - | - | supported | supported | supported | supported |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `*` | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `$` | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
| `.` | literal | literal| literal | literal | literal | literal |
+------------------------------------+------------+--------+-------------+-------------+-----------+-------------+
```
### Footnote:
* \* In **POSIX ERE**, **POSIX BRE**, **`grep -E`**, and **`sed`**, an unescaped `[` **immediately followed by `:`, `=`, or `.`** at the start of a character class (e.g., `[[:`, `[.=`, `[==`) initiates a **special POSIX expression** and is **not a literal**. To match a literal `[`, avoid placing it in this position or use placement later in the class. Escaping `[` (`\[`), while common in practice, is not guaranteed portable in POSIX.
Let me know if you'd like to add more edge cases (e.g., `[[:[:alpha:]:]]` or nested misuse).
-
LLM queries to produce the above result with Deepseek:
- list on an ascii table python changes to re and regex modules that caused backward-compatibility issues.
- for internal "re" module, compare changes for "[]" brakets that contain characters on it. what chars/sequences of special characters can cause troubles on Python? What's the difference in terms of needing backslashes when compared with POSIX, Perl, sed and grep (with -E)? Ignore unicode. We're interested just in ascii special chars.
- get the table (3) and transpose it. check if different versions do different cases. Include both POSIX ERE and BRE. Add special sequences. for instance, POSIX have special meanings for having "[" followed by some other special characters. Show only the table
To allow you to reproduce, I also asked:
- I want to pass to someone else the LLM requests for him to produce exactly the same table.
It produced the following LLM query:
<snip>
Create a comparison table in ReST format showing escaping rules for special ASCII characters inside regex character classes `[]`. Follow these specifications:
1. **Tools**: Compare Python `re`, Perl, POSIX ERE, POSIX BRE, `grep -E`, and `sed` (BRE)
2. **Structure**: Transposed layout (features as rows, tools as columns)
3. **Content**: Include all these cases:
- `]` (first vs. non-first position)
- `[`, `-`, `^`, `\`
- POSIX classes (`[:alpha:]`)
- POSIX specials (`[.ch.]`, `[=a=]`)
- `*`, `$`, `.`
4. **Rules**:
- Mark required escapes with `\`
- Show range syntax (`a-z`) where applicable
- Indicate "literal" when no escape needed
5. **Format**: Strict ReST ASCII table with grid lines
6. **Notes**:
- Exclude Unicode
- Highlight version differences in footnotes
- Keep all tool columns even if identical
- Use `-` for unsupported features
Provide only the final table with no commentary.
</snip>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 03/12] docs: kdoc: backslashectomy in kdoc_parser
2025-08-04 16:00 ` Mauro Carvalho Chehab
@ 2025-08-04 18:29 ` Jonathan Corbet
0 siblings, 0 replies; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-04 18:29 UTC (permalink / raw)
To: Mauro Carvalho Chehab; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> In time: I mean *possible* POSIX violations.
>
> I very much prefer more backslashs than needed or use re.escape()
> than to read thoughtfully POSIX and Python-specific specific rules.
>
> Python, in particular, is not very reliable between versions - each
> new version comes with a set of incompatible changes.
People like to complain about that ... but the LWN site code has been in
Python since 2002, and with the notable exception of the Python 3
transition, it has been really painless.
FWIW, the 3.13 re module will warn if it sees a construction like "[[]"
- evidently there is a possible future change that might make "[[" mean
something special and new. So I avoided that combination.
> In summary, if you agree with always escape brackets, curly brackets and
> parenthesis inside brackets on kernel-doc, we should be free of not
> opened/not closed "symbols" with is an annoyance at least for me, and
> we should be freed of possible POSIX issues and undefined behavior(*).
It shall be a mighty struggle, but I think I can find a way to live with
that... :)
Thanks,
jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup
2025-08-04 13:15 ` Mauro Carvalho Chehab
@ 2025-08-05 22:46 ` Jonathan Corbet
2025-08-06 9:05 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-05 22:46 UTC (permalink / raw)
To: Mauro Carvalho Chehab; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> Perhaps one alternative would do something like:
>
> tuples = struct_members.findall(members)
> if not tuples:
> break
>
> maintype, -, -, content, -, s_ids = tuples
>
> (assuming that we don't need t[1], t[2] and t[4] here)
>
> Btw, on this specific case, better to use non-capture group matches
> to avoid those "empty" spaces, e.g. (if I got it right):
The problem is this line here:
oldmember = "".join(t) # Reconstruct the original formatting
The regex *has* to capture the entire match string so that it can be
reconstructed back to its original form, which we need to edit the full
list of members later on.
This code could use a deep rethink, but it works for now :)
Thanks,
jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup
2025-08-05 22:46 ` Jonathan Corbet
@ 2025-08-06 9:05 ` Mauro Carvalho Chehab
2025-08-06 13:00 ` Jonathan Corbet
0 siblings, 1 reply; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-06 9:05 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Tue, 05 Aug 2025 16:46:10 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
>
> > Perhaps one alternative would do something like:
> >
> > tuples = struct_members.findall(members)
> > if not tuples:
> > break
> >
> > maintype, -, -, content, -, s_ids = tuples
> >
> > (assuming that we don't need t[1], t[2] and t[4] here)
> >
> > Btw, on this specific case, better to use non-capture group matches
> > to avoid those "empty" spaces, e.g. (if I got it right):
>
> The problem is this line here:
>
> oldmember = "".join(t) # Reconstruct the original formatting
>
> The regex *has* to capture the entire match string so that it can be
> reconstructed back to its original form, which we need to edit the full
> list of members later on.
>
> This code could use a deep rethink, but it works for now :)
well, we can still do:
for t in tuples:
maintype, -, -, content, -, s_ids = t
oldmember = "".join(t)
this way, we'll be naming the relevant parameters and reconstructing
the the original form.
IMO, this is a lot better than using t[0], t[3], t[5] at the code,
as the names makes it clear what each one actually captured.
-
Btw, while re.findall() has an API that doesn't return match
objects which is incoherent with the normal re API, while looking
at the specs today(*), there is an alternative: re.finditer().
We could add it to KernRE cass and use it on a way that it will use
a Match instance. Something like:
# Original regex expression
res = Re.finditer(...)
# Not much difference here. Probably not worh using it
for match in res:
oldmember = "".join(match.groups())
maintype, -, -, content, -, s_ids = match.groups()
Or alternatively:
res = Re.finditer(...)
# Not much difference here. Probably not worth using it
for match in res:
oldmember = "".join(match.groups())
# replace at the code below:
# maintype -> match.group('maintype')
# content -> match.group('content')
# s_ids -> match.group('s_ids')
No idea about performance differences between findall and finditer.
(*) https://docs.python.org/3/library/re.html
btw, one possibility that would avoid having tuples and t is
to do something like:
struct_members = KernRe("(" + # group 1: the entire pattern
type_pattern + # Capture main type
r'([^\{\};]+)' +
r'(?:\{)' +
r'(?:[^\{\}]*)' + # Capture content
r'(?:\})' +
r'([^\{\};]*)(;)') # Capture IDs
")")
match = struct_members.finditer(line)
for match in res:
oldmember, maintype, content, s_ids = match.groups()
(disclaimer notice: none of the above was tested)
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup
2025-08-06 9:05 ` Mauro Carvalho Chehab
@ 2025-08-06 13:00 ` Jonathan Corbet
2025-08-06 21:27 ` Mauro Carvalho Chehab
0 siblings, 1 reply; 39+ messages in thread
From: Jonathan Corbet @ 2025-08-06 13:00 UTC (permalink / raw)
To: Mauro Carvalho Chehab; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
>> > Btw, on this specific case, better to use non-capture group matches
>> > to avoid those "empty" spaces, e.g. (if I got it right):
>>
>> The problem is this line here:
>>
>> oldmember = "".join(t) # Reconstruct the original formatting
>>
>> The regex *has* to capture the entire match string so that it can be
>> reconstructed back to its original form, which we need to edit the full
>> list of members later on.
>>
>> This code could use a deep rethink, but it works for now :)
>
> well, we can still do:
>
> for t in tuples:
> maintype, -, -, content, -, s_ids = t
> oldmember = "".join(t)
>
> this way, we'll be naming the relevant parameters and reconstructing
> the the original form.
I've already made a change much like that (the "-" syntax doesn't work,
of course); I hope to post the updates series today, but it's going to
be busy.
> Btw, while re.findall() has an API that doesn't return match
> objects which is incoherent with the normal re API, while looking
> at the specs today(*), there is an alternative: re.finditer().
> We could add it to KernRE cass and use it on a way that it will use
> a Match instance. Something like:
>
> # Original regex expression
> res = Re.finditer(...)
> [...]
A definite possible improvement for later... :)
Thanks,
jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup
2025-08-06 13:00 ` Jonathan Corbet
@ 2025-08-06 21:27 ` Mauro Carvalho Chehab
0 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2025-08-06 21:27 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, linux-kernel, Akira Yokosawa
Em Wed, 06 Aug 2025 07:00:19 -0600
Jonathan Corbet <corbet@lwn.net> escreveu:
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
>
> >> > Btw, on this specific case, better to use non-capture group matches
> >> > to avoid those "empty" spaces, e.g. (if I got it right):
> >>
> >> The problem is this line here:
> >>
> >> oldmember = "".join(t) # Reconstruct the original formatting
> >>
> >> The regex *has* to capture the entire match string so that it can be
> >> reconstructed back to its original form, which we need to edit the full
> >> list of members later on.
> >>
> >> This code could use a deep rethink, but it works for now :)
> >
> > well, we can still do:
> >
> > for t in tuples:
> > maintype, -, -, content, -, s_ids = t
> > oldmember = "".join(t)
> >
> > this way, we'll be naming the relevant parameters and reconstructing
> > the the original form.
>
> I've already made a change much like that (the "-" syntax doesn't work,
> of course);
Gah! sorry for the typo. I meant to say:
maintype, _, _, content, _, s_ids = t
> I hope to post the updates series today, but it's going to
> be busy.
>
> > Btw, while re.findall() has an API that doesn't return match
> > objects which is incoherent with the normal re API, while looking
> > at the specs today(*), there is an alternative: re.finditer().
> > We could add it to KernRE cass and use it on a way that it will use
> > a Match instance. Something like:
> >
> > # Original regex expression
> > res = Re.finditer(...)
> > [...]
>
> A definite possible improvement for later... :)
Ok.
>
> Thanks,
>
> jon
Thanks,
Mauro
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2025-08-06 21:27 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-01 0:13 [PATCH 00/12] docs: kdoc: thrash up dump_struct() Jonathan Corbet
2025-08-01 0:13 ` [PATCH 01/12] docs: kdoc: consolidate the stripping of private struct/union members Jonathan Corbet
2025-08-01 5:29 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 02/12] docs: kdoc: Move a regex line in dump_struct() Jonathan Corbet
2025-08-01 5:29 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 03/12] docs: kdoc: backslashectomy in kdoc_parser Jonathan Corbet
2025-08-01 4:27 ` Mauro Carvalho Chehab
2025-08-01 14:21 ` Jonathan Corbet
2025-08-04 12:58 ` Mauro Carvalho Chehab
2025-08-04 16:00 ` Mauro Carvalho Chehab
2025-08-04 18:29 ` Jonathan Corbet
2025-08-01 0:13 ` [PATCH 04/12] docs: kdoc: move the prefix transforms out of dump_struct() Jonathan Corbet
2025-08-01 5:28 ` Mauro Carvalho Chehab
2025-08-01 5:35 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 05/12] docs: kdoc: split top-level prototype parsing " Jonathan Corbet
2025-08-01 5:34 ` Mauro Carvalho Chehab
2025-08-01 14:10 ` Jonathan Corbet
2025-08-04 12:20 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 06/12] docs: kdoc: split struct-member rewriting " Jonathan Corbet
2025-08-01 5:37 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 07/12] docs: kdoc: rework the rewrite_struct_members() main loop Jonathan Corbet
2025-08-01 5:42 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 08/12] docs: kdoc: remove an extraneous strip() call Jonathan Corbet
2025-08-01 5:45 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 09/12] docs: kdoc: Some rewrite_struct_members() commenting Jonathan Corbet
2025-08-01 5:50 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 10/12] docs: kdoc: further rewrite_struct_members() cleanup Jonathan Corbet
2025-08-01 6:07 ` Mauro Carvalho Chehab
2025-08-01 22:52 ` Jonathan Corbet
2025-08-04 13:15 ` Mauro Carvalho Chehab
2025-08-05 22:46 ` Jonathan Corbet
2025-08-06 9:05 ` Mauro Carvalho Chehab
2025-08-06 13:00 ` Jonathan Corbet
2025-08-06 21:27 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 11/12] docs: kdoc: extract output formatting from dump_struct() Jonathan Corbet
2025-08-01 6:09 ` Mauro Carvalho Chehab
2025-08-01 0:13 ` [PATCH 12/12] docs: kdoc: a few final dump_struct() touches Jonathan Corbet
2025-08-01 6:10 ` Mauro Carvalho Chehab
2025-08-01 6:23 ` [PATCH 00/12] docs: kdoc: thrash up dump_struct() Mauro Carvalho Chehab
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).