Re: [RFC v2 18/38] docs: sphinx/kernel_abi: use AbiParser directly

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Jonathan Corbet <corbet@lwn.net>
Cc: Linux Doc Mailing List <linux-doc@vger.kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC v2 18/38] docs: sphinx/kernel_abi: use AbiParser directly
Date: Sun, 16 Feb 2025 13:53:56 +0100	[thread overview]
Message-ID: <20250216135356.188be652@foz.lan> (raw)
In-Reply-To: <20250210170354.18c04f7c@sal.lan>

Em Mon, 10 Feb 2025 17:03:54 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:

> Em Mon, 10 Feb 2025 07:40:02 -0700
> Jonathan Corbet <corbet@lwn.net> escreveu:
> 
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> >   
> > > I took a look on Markus work: it was licensed under GPL 3.0 and it was
> > > written in 2016. There were several changes on kerneldoc since them,
> > > including the addition of a regex that it is not compatible with
> > > Python re[1]:
> > >
> > >         $members =~ s/\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;/$2/gos;
> > >
> > > This one use:
> > >
> > > 	- recursive patterns: ?1
> > > 	- atomic grouping (?>...)
> > >
> > > Also, it is hard to map what he does with the existing script. I'm
> > > opting to write a new script from scratch.    
> > 
> > That's fine, I just wanted to be sure you'd had a chance to look at
> > it... 
> >   
> > >     Another option would be to re-implement such regexes without using
> > >     such advanced patterns.    
> > 
> > Seems like a preferred option if that can be done.  Banging one's head
> > against all those regexes is often the hardest part of dealing with that
> > script; anything that makes it simpler is welcome.  
> 
> Agreed. This one, in special, is very hard for me to understand, as I
> never used recursive patterns or atomic grouping. The net result of
> the struct_group*() handling is that it removes some parameters when
> generating the function prototype. This is done using a complex logic
> on two steps:
> 
>        # unwrap struct_group():
>         # - first eat non-declaration parameters and rewrite for final match
>         # - then remove macro, outer parens, and trailing semicolon
>         $members =~ s/\bstruct_group\s*\(([^,]*,)/STRUCT_GROUP(/gos;
>         $members =~ s/\bstruct_group_attr\s*\(([^,]*,){2}/STRUCT_GROUP(/gos;
>         $members =~ s/\bstruct_group_tagged\s*\(([^,]*),([^,]*),/struct $1 $2; STRUCT_GROUP(/gos;
>         $members =~ s/\b__struct_group\s*\(([^,]*,){3}/STRUCT_GROUP(/gos;
>         $members =~ s/\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;/$2/gos;
> 
> 
> The first step basically eliminates some members of the function. At the
> places I checked, the second step was just removing parenthesis from the
> macro (and the STRUCT_GROUP name).
> 
> I suspect that the same result could be done with a much simpler expression
> like:
> 
> 	$members =~ s/\bSTRUCT_GROUP\((.*)\)[^;]*;/$2/gos;
> 
> But maybe there are some corner cases that would cause such simpler
> regex to fail.

A simpler regex didn't work.

Instead of using Python regex module and keep such complex regular
expressions, I opted to implement it on a completely different way.
See patch below.

Basically, I implemented a new class (NestedMatch) just to
handle patterns like:

	FOO(.*)

converting them into:

	.*

I expect that the new logic there would work properly for any
number of [], {} and () paired delimiters.

I ended writing such class from scratch, based on a suggestion from
stackoverflow.

Using such logic, replacing STRUCT_GROUP(.*) by .* is as simple
as:

        sub_nested_prefixes = [
            (re.compile(r'\bSTRUCT_GROUP'),  r'\1'),
        ]

        nested = NestedMatch()

        for search, sub in sub_nested_prefixes:
            members = nested.sub(search, sub, members)

This should probably help cleaning up other similar patterns.
Let's see. Anyway, I expect that this would make the C parsing
part easier to maintain.

For the first version, I'll use the new way only when I notice
discrepancies between Perl and Python versions.

Thanks,
Mauro

---

[PATCH] scripts/kernel-doc.py: properly handle struct_group macros

Handing nested parenthesis with regular expressions is not an
easy task. It is even harder with Python's re module, as it
has a limited subset of regular expressions, missing more
advanced features.

We might use instead Python regex module, but still the
regular expressions are very hard to understand. So, instead,
add a logic to properly match delimiters.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

diff --git a/scripts/kernel-doc.py b/scripts/kernel-doc.py
index 9b10fd05b6af..df824692c4e7 100755
--- a/scripts/kernel-doc.py
+++ b/scripts/kernel-doc.py
@@ -92,6 +92,138 @@ class Re:
     def group(self, num):
         return self.last_match.group(num)
 
+class NestedMatch:
+    """
+    Finding nested delimiters is hard with regular expressions. It is
+    even harder on Python with its normal re module, as there are several
+    advanced regular expressions that are missing.
+
+    This is the case of this pattern:
+
+            '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;'
+
+    which is used to properly match open/close parenthesis of the
+    string search STRUCT_GROUP(),
+
+    Add a class that counts pairs of delimiters, using it to match and
+    replace nested expressions.
+
+    The original approach was suggested by:
+        https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
+
+    Although I re-implemented it to make it more generic and match 3 types
+    of delimiters. The logic checks if delimiters are paired. If not, it
+    will ignore the search string.
+    """
+
+    DELIMITER_PAIRS = {
+        '{': '}',
+        '(': ')',
+        '[': ']',
+    }
+
+    RE_DELIM = re.compile(r'[\{\}\[\]\(\)]')
+
+    def _search(self, regex, line):
+        """
+        Finds paired blocks.
+
+        If a given regex is not empty, the logic will first seek for its
+        position, and then handles the next delimiters afterwards
+
+        The suggestion of using finditer to match pairs came from:
+        https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
+        but I ended using a different implementation to align all three types
+        of delimiters and seek for an initial regular expression.
+
+        The algorithm seeks for open/close paired delimiters and place them
+        into a stack, yielding a start/stop position of each match  when the
+        stack is zeroed.
+
+        The algorithm shoud work fine for properly paired lines, but will
+        silently ignore end delimiters that preceeds an start delimiter.
+        This should be OK for kernel-doc parser, as unaligned delimiters
+        would cause compilation errors. So, we don't need to rise exceptions
+        to cover such issues.
+        """
+
+        stack = []
+
+        for match_re in regex.finditer(line):
+            offset = match_re.end()
+
+            for match in self.RE_DELIM.finditer(line[offset:]):
+                pos = match.start() + offset
+
+                d = line[pos]
+
+                if d in self.DELIMITER_PAIRS.keys():
+                    end = self.DELIMITER_PAIRS[d]
+
+                    stack.append(end)
+                    continue
+
+                # Does the end delimiter match what it is expected?
+                if stack and d == stack[-1]:
+                    stack.pop()
+
+                    if not stack:
+                        yield offset, pos + 1
+
+    def search(self, regex, line):
+        """
+        This is similar to re.search:
+
+        It matches a regex that it is followed by a delimiter,
+        returning occurrences only if all delimiters are paired.
+        """
+
+        for prev_pos, pos in self._search(regex, line):
+
+            yield line[prev_pos:pos]
+
+    def sub(self, regex, sub, line, count=0):
+        """
+        This is similar to re.sub:
+
+        It matches a regex that it is followed by a delimiter,
+        replacing occurrences only if all delimiters are paired.
+
+        if r'\1' is used, it works just like re: it places there the
+        matched paired data with the delimiter stripped.
+
+        If count is different than zero, it will replace at most count
+        items.
+        """
+        out = ""
+
+        cur_pos = 0
+        n = 0
+
+        for start, end in self._search(regex, line):
+            out += line[cur_pos:start]
+
+            # Value, ignoring start/end delimiters
+            value = line[start + 1:end - 1]
+
+            # replaces \1 at the sub string, if \1 is used there
+            new_sub = sub
+            new_sub = new_sub.replace(r'\1', value)
+
+            out += new_sub
+
+            cur_pos = end
+            n += 1
+
+            if count and count >= n:
+                break
+
+        # Append the remaining string
+        l = len(line)
+        out += line[cur_pos:l]
+
+        return out
+
 #
 # Regular expressions used to parse kernel-doc markups at KernelDoc class.
 #
@@ -667,19 +799,40 @@ class KernelDoc:
             # Unwrap struct_group() based on this definition:
             # __struct_group(TAG, NAME, ATTRS, MEMBERS...)
             # which has variants like: struct_group(NAME, MEMBERS...)
+            #
+            # Parsing them are done on two steps:
+            # 1. drop arguments that aren't struct names;
+            # 2. remove STRUCT_GROUP() ancillary macro.
+            #
+            # NOTE: the original logic which replaces STRUCT_GROUP(.*) by .*
+            # is incompatible with Python re, as it uses:
+            #
+            #   - a recursive pattern: (?1)
+            #   - an atomic grouping: (?>...)
+            #
+            # This is the original expression:
+            #   '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;'
+            #
+            # I tried a simpler version: but it didn't work either:
+            #   \bSTRUCT_GROUP\(([^\)]+)\)[^;]*;
+            #
+            # As it doesn't properly match the end parenthesis.
+            #
+            # So, a better solution was crafted: there's now a NestedMatch
+            # class that ensures that delimiters after a search are properly
+            # matched. So, the implementation to drop STRUCT_GROUP() will be
+            # handled in separate.
 
             (Re(r'\bstruct_group\s*\(([^,]*,)', re.S),  r'STRUCT_GROUP('),
             (Re(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S),  r'STRUCT_GROUP('),
             (Re(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S),  r'struct \1 \2; STRUCT_GROUP('),
             (Re(r'\b__struct_group\s*\(([^,]*,){3}', re.S),  r'STRUCT_GROUP('),
 
-            # This is incompatible with Python re, as it uses:
-            #  recursive patterns ((?1)) and atomic grouping ((?>...)):
-            #   '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;'
-            # Let's see if this works instead:
-            (Re(r'\bSTRUCT_GROUP\(([^\)]+)\)[^;]*;', re.S),  r'\1'),
-
             # Replace macros
+            #
+            # TODO: it is better to also move those to the NestedMatch logic,
+            # to endure that parenthesis will be properly matched.
+
             (Re(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S),  r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
             (Re(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S),  r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
             (Re(r'DECLARE_BITMAP\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S),  r'unsigned long \1[BITS_TO_LONGS(\2)]'),
@@ -691,9 +844,18 @@ class KernelDoc:
             (Re(r'DEFINE_DMA_UNMAP_LEN\s*\(' + args_pattern + r'\)', re.S),  r'__u32 \1'),
         ]
 
+        sub_nested_prefixes = [
+            (re.compile(r'\bSTRUCT_GROUP'),  r'\1'),
+        ]
+
         for search, sub in sub_prefixes:
             members = search.sub(sub, members)
 
+        nested = NestedMatch()
+
+        for search, sub in sub_nested_prefixes:
+            members = nested.sub(search, sub, members)
+
         # Keeps the original declaration as-is
         declaration = members

next prev parent reply	other threads:[~2025-02-16 12:54 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-28  0:05 [RFC v2 00/38] Improve ABI documentation generation Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 01/38] docs: power: video.rst: fix a footnote reference Mauro Carvalho Chehab
2025-01-28 22:10   ` Jonathan Corbet
2025-01-28  0:05 ` [RFC v2 02/38] docs: media: ipu3: fix two footnote references Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 03/38] docs: block: ublk.rst: remove a reference from a dropped text Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 04/38] docs: sphinx: remove kernellog.py file Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 05/38] docs: sphinx/kernel_abi: adjust coding style Mauro Carvalho Chehab
2025-02-04 17:19   ` Jonathan Corbet
2025-01-28  0:05 ` [RFC v2 06/38] docs: admin-guide: abi: add SPDX tags to ABI files Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 07/38] ABI: sysfs-class-rfkill: fix kernelversion tags Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 08/38] ABI: sysfs-bus-coresight-*: " Mauro Carvalho Chehab
2025-02-04 10:30   ` Suzuki K Poulose
2025-01-28  0:05 ` [RFC v2 09/38] ABI: sysfs-driver-dma-idxd: fix date tags Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 10/38] ABI: sysfs-fs-f2fs: " Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 11/38] ABI: sysfs-power: fix a what tag Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 12/38] scripts/documentation-file-ref-check: don't check perl/python scripts Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 13/38] scripts/get_abi.py: add a Python tool to generate ReST output Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 14/38] scripts/get_abi.py: add support for symbol search Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 15/38] docs: use get_abi.py for ABI generation Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 16/38] scripts/get_abi.py: optimize parse_abi() function Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 17/38] scripts/get_abi.py: use an interactor for ReST output Mauro Carvalho Chehab
2025-01-29 11:04   ` Akira Yokosawa
2025-01-29 14:06     ` Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 18/38] docs: sphinx/kernel_abi: use AbiParser directly Mauro Carvalho Chehab
2025-01-28 22:37   ` Jonathan Corbet
2025-01-29  0:43     ` Mauro Carvalho Chehab
2025-02-02 14:56       ` Mauro Carvalho Chehab
2025-02-04 17:12         ` Jonathan Corbet
2025-02-10  7:27           ` Mauro Carvalho Chehab
2025-02-10 14:40             ` Jonathan Corbet
2025-02-10 16:03               ` Mauro Carvalho Chehab
2025-02-16 12:53                 ` Mauro Carvalho Chehab [this message]
2025-01-28  0:06 ` [RFC v2 19/38] docs: sphinx/kernel_abi: reduce buffer usage for ABI messages Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 20/38] docs: sphinx/kernel_abi: properly split lines Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 21/38] scripts/get_abi.pl: Add filtering capabilities to rest output Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 22/38] scripts/get_abi.pl: add support to parse ABI README file Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 23/38] docs: sphinx/kernel_abi: parse ABI files only once Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 24/38] docs: admin-guide/abi: split files from symbols Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 25/38] docs: sphinx/automarkup: add cross-references for ABI Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 26/38] docs: sphinx/kernel_abi: avoid warnings during Sphinx module init Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 27/38] scripts/get_abi.py: Rename title name for ABI files Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 28/38] docs: media: Allow creating cross-references for RC ABI Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 29/38] docs: thunderbolt: Allow creating cross-references for ABI Mauro Carvalho Chehab
2025-01-28 11:27   ` Mika Westerberg
2025-01-28  0:06 ` [RFC v2 30/38] docs: arm: asymmetric-32bit: " Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 31/38] docs: arm: generic-counter: " Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 32/38] docs: iio: Allow creating cross-references ABI Mauro Carvalho Chehab
2025-01-28 18:27   ` Jonathan Cameron
2025-01-28  0:06 ` [RFC v2 33/38] docs: networking: Allow creating cross-references statistics ABI Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 34/38] docs: submit-checklist: Allow creating cross-references for ABI README Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 35/38] docs: translations: " Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 36/38] docs: ABI: drop two duplicate symbols Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 37/38] scripts/get_abi.py: add support for undefined ABIs Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 38/38] scripts/get_abi.pl: drop now obsoleted script Mauro Carvalho Chehab
2025-01-28 22:42 ` [RFC v2 00/38] Improve ABI documentation generation Jonathan Corbet
2025-01-29  1:45   ` Mauro Carvalho Chehab
2025-01-29 14:22     ` Mauro Carvalho Chehab
2025-01-29 15:41   ` Mauro Carvalho Chehab
2025-01-29 15:58     ` Jonathan Corbet
2025-01-29 16:19       ` Mauro Carvalho Chehab
2025-03-27 21:58 ` [f2fs-dev] " patchwork-bot+f2fs

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:9b10fd05b6a dfblob:df824692c4e )
 OR (
bs:"Re: [RFC v2 18/38] docs: sphinx/kernel_abi: use AbiParser directly" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250216135356.188be652@foz.lan \
    --to=mchehab+huawei@kernel.org \
    --cc=corbet@lwn.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox