All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Jonathan Corbet <corbet@lwn.net>
Cc: Linux Doc Mailing List <linux-doc@vger.kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC v2 18/38] docs: sphinx/kernel_abi: use AbiParser directly
Date: Sun, 16 Feb 2025 13:53:56 +0100	[thread overview]
Message-ID: <20250216135356.188be652@foz.lan> (raw)
In-Reply-To: <20250210170354.18c04f7c@sal.lan>

Em Mon, 10 Feb 2025 17:03:54 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> escreveu:

> Em Mon, 10 Feb 2025 07:40:02 -0700
> Jonathan Corbet <corbet@lwn.net> escreveu:
> 
> > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes:
> >   
> > > I took a look on Markus work: it was licensed under GPL 3.0 and it was
> > > written in 2016. There were several changes on kerneldoc since them,
> > > including the addition of a regex that it is not compatible with
> > > Python re[1]:
> > >
> > >         $members =~ s/\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;/$2/gos;
> > >
> > > This one use:
> > >
> > > 	- recursive patterns: ?1
> > > 	- atomic grouping (?>...)
> > >
> > > Also, it is hard to map what he does with the existing script. I'm
> > > opting to write a new script from scratch.    
> > 
> > That's fine, I just wanted to be sure you'd had a chance to look at
> > it... 
> >   
> > >     Another option would be to re-implement such regexes without using
> > >     such advanced patterns.    
> > 
> > Seems like a preferred option if that can be done.  Banging one's head
> > against all those regexes is often the hardest part of dealing with that
> > script; anything that makes it simpler is welcome.  
> 
> Agreed. This one, in special, is very hard for me to understand, as I
> never used recursive patterns or atomic grouping. The net result of
> the struct_group*() handling is that it removes some parameters when
> generating the function prototype. This is done using a complex logic
> on two steps:
> 
>        # unwrap struct_group():
>         # - first eat non-declaration parameters and rewrite for final match
>         # - then remove macro, outer parens, and trailing semicolon
>         $members =~ s/\bstruct_group\s*\(([^,]*,)/STRUCT_GROUP(/gos;
>         $members =~ s/\bstruct_group_attr\s*\(([^,]*,){2}/STRUCT_GROUP(/gos;
>         $members =~ s/\bstruct_group_tagged\s*\(([^,]*),([^,]*),/struct $1 $2; STRUCT_GROUP(/gos;
>         $members =~ s/\b__struct_group\s*\(([^,]*,){3}/STRUCT_GROUP(/gos;
>         $members =~ s/\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;/$2/gos;
> 
> 
> The first step basically eliminates some members of the function. At the
> places I checked, the second step was just removing parenthesis from the
> macro (and the STRUCT_GROUP name).
> 
> I suspect that the same result could be done with a much simpler expression
> like:
> 
> 	$members =~ s/\bSTRUCT_GROUP\((.*)\)[^;]*;/$2/gos;
> 
> But maybe there are some corner cases that would cause such simpler
> regex to fail.

A simpler regex didn't work.

Instead of using Python regex module and keep such complex regular
expressions, I opted to implement it on a completely different way.
See patch below.

Basically, I implemented a new class (NestedMatch) just to
handle patterns like:

	FOO(.*)

converting them into:

	.*

I expect that the new logic there would work properly for any
number of [], {} and () paired delimiters.

I ended writing such class from scratch, based on a suggestion from
stackoverflow.

Using such logic, replacing STRUCT_GROUP(.*) by .* is as simple
as:

        sub_nested_prefixes = [
            (re.compile(r'\bSTRUCT_GROUP'),  r'\1'),
        ]

        nested = NestedMatch()

        for search, sub in sub_nested_prefixes:
            members = nested.sub(search, sub, members)

This should probably help cleaning up other similar patterns.
Let's see. Anyway, I expect that this would make the C parsing
part easier to maintain.

For the first version, I'll use the new way only when I notice
discrepancies between Perl and Python versions.

Thanks,
Mauro

---

[PATCH] scripts/kernel-doc.py: properly handle struct_group macros

Handing nested parenthesis with regular expressions is not an
easy task. It is even harder with Python's re module, as it
has a limited subset of regular expressions, missing more
advanced features.

We might use instead Python regex module, but still the
regular expressions are very hard to understand. So, instead,
add a logic to properly match delimiters.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>

diff --git a/scripts/kernel-doc.py b/scripts/kernel-doc.py
index 9b10fd05b6af..df824692c4e7 100755
--- a/scripts/kernel-doc.py
+++ b/scripts/kernel-doc.py
@@ -92,6 +92,138 @@ class Re:
     def group(self, num):
         return self.last_match.group(num)
 
+class NestedMatch:
+    """
+    Finding nested delimiters is hard with regular expressions. It is
+    even harder on Python with its normal re module, as there are several
+    advanced regular expressions that are missing.
+
+    This is the case of this pattern:
+
+            '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;'
+
+    which is used to properly match open/close parenthesis of the
+    string search STRUCT_GROUP(),
+
+    Add a class that counts pairs of delimiters, using it to match and
+    replace nested expressions.
+
+    The original approach was suggested by:
+        https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
+
+    Although I re-implemented it to make it more generic and match 3 types
+    of delimiters. The logic checks if delimiters are paired. If not, it
+    will ignore the search string.
+    """
+
+    DELIMITER_PAIRS = {
+        '{': '}',
+        '(': ')',
+        '[': ']',
+    }
+
+    RE_DELIM = re.compile(r'[\{\}\[\]\(\)]')
+
+    def _search(self, regex, line):
+        """
+        Finds paired blocks.
+
+        If a given regex is not empty, the logic will first seek for its
+        position, and then handles the next delimiters afterwards
+
+        The suggestion of using finditer to match pairs came from:
+        https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
+        but I ended using a different implementation to align all three types
+        of delimiters and seek for an initial regular expression.
+
+        The algorithm seeks for open/close paired delimiters and place them
+        into a stack, yielding a start/stop position of each match  when the
+        stack is zeroed.
+
+        The algorithm shoud work fine for properly paired lines, but will
+        silently ignore end delimiters that preceeds an start delimiter.
+        This should be OK for kernel-doc parser, as unaligned delimiters
+        would cause compilation errors. So, we don't need to rise exceptions
+        to cover such issues.
+        """
+
+        stack = []
+
+        for match_re in regex.finditer(line):
+            offset = match_re.end()
+
+            for match in self.RE_DELIM.finditer(line[offset:]):
+                pos = match.start() + offset
+
+                d = line[pos]
+
+                if d in self.DELIMITER_PAIRS.keys():
+                    end = self.DELIMITER_PAIRS[d]
+
+                    stack.append(end)
+                    continue
+
+                # Does the end delimiter match what it is expected?
+                if stack and d == stack[-1]:
+                    stack.pop()
+
+                    if not stack:
+                        yield offset, pos + 1
+
+    def search(self, regex, line):
+        """
+        This is similar to re.search:
+
+        It matches a regex that it is followed by a delimiter,
+        returning occurrences only if all delimiters are paired.
+        """
+
+        for prev_pos, pos in self._search(regex, line):
+
+            yield line[prev_pos:pos]
+
+    def sub(self, regex, sub, line, count=0):
+        """
+        This is similar to re.sub:
+
+        It matches a regex that it is followed by a delimiter,
+        replacing occurrences only if all delimiters are paired.
+
+        if r'\1' is used, it works just like re: it places there the
+        matched paired data with the delimiter stripped.
+
+        If count is different than zero, it will replace at most count
+        items.
+        """
+        out = ""
+
+        cur_pos = 0
+        n = 0
+
+        for start, end in self._search(regex, line):
+            out += line[cur_pos:start]
+
+            # Value, ignoring start/end delimiters
+            value = line[start + 1:end - 1]
+
+            # replaces \1 at the sub string, if \1 is used there
+            new_sub = sub
+            new_sub = new_sub.replace(r'\1', value)
+
+            out += new_sub
+
+            cur_pos = end
+            n += 1
+
+            if count and count >= n:
+                break
+
+        # Append the remaining string
+        l = len(line)
+        out += line[cur_pos:l]
+
+        return out
+
 #
 # Regular expressions used to parse kernel-doc markups at KernelDoc class.
 #
@@ -667,19 +799,40 @@ class KernelDoc:
             # Unwrap struct_group() based on this definition:
             # __struct_group(TAG, NAME, ATTRS, MEMBERS...)
             # which has variants like: struct_group(NAME, MEMBERS...)
+            #
+            # Parsing them are done on two steps:
+            # 1. drop arguments that aren't struct names;
+            # 2. remove STRUCT_GROUP() ancillary macro.
+            #
+            # NOTE: the original logic which replaces STRUCT_GROUP(.*) by .*
+            # is incompatible with Python re, as it uses:
+            #
+            #   - a recursive pattern: (?1)
+            #   - an atomic grouping: (?>...)
+            #
+            # This is the original expression:
+            #   '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;'
+            #
+            # I tried a simpler version: but it didn't work either:
+            #   \bSTRUCT_GROUP\(([^\)]+)\)[^;]*;
+            #
+            # As it doesn't properly match the end parenthesis.
+            #
+            # So, a better solution was crafted: there's now a NestedMatch
+            # class that ensures that delimiters after a search are properly
+            # matched. So, the implementation to drop STRUCT_GROUP() will be
+            # handled in separate.
 
             (Re(r'\bstruct_group\s*\(([^,]*,)', re.S),  r'STRUCT_GROUP('),
             (Re(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S),  r'STRUCT_GROUP('),
             (Re(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S),  r'struct \1 \2; STRUCT_GROUP('),
             (Re(r'\b__struct_group\s*\(([^,]*,){3}', re.S),  r'STRUCT_GROUP('),
 
-            # This is incompatible with Python re, as it uses:
-            #  recursive patterns ((?1)) and atomic grouping ((?>...)):
-            #   '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;'
-            # Let's see if this works instead:
-            (Re(r'\bSTRUCT_GROUP\(([^\)]+)\)[^;]*;', re.S),  r'\1'),
-
             # Replace macros
+            #
+            # TODO: it is better to also move those to the NestedMatch logic,
+            # to endure that parenthesis will be properly matched.
+
             (Re(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S),  r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
             (Re(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S),  r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
             (Re(r'DECLARE_BITMAP\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S),  r'unsigned long \1[BITS_TO_LONGS(\2)]'),
@@ -691,9 +844,18 @@ class KernelDoc:
             (Re(r'DEFINE_DMA_UNMAP_LEN\s*\(' + args_pattern + r'\)', re.S),  r'__u32 \1'),
         ]
 
+        sub_nested_prefixes = [
+            (re.compile(r'\bSTRUCT_GROUP'),  r'\1'),
+        ]
+
         for search, sub in sub_prefixes:
             members = search.sub(sub, members)
 
+        nested = NestedMatch()
+
+        for search, sub in sub_nested_prefixes:
+            members = nested.sub(search, sub, members)
+
         # Keeps the original declaration as-is
         declaration = members
 





  reply	other threads:[~2025-02-16 12:54 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-28  0:05 [RFC v2 00/38] Improve ABI documentation generation Mauro Carvalho Chehab
2025-01-28  0:05 ` [f2fs-dev] " Mauro Carvalho Chehab via Linux-f2fs-devel
2025-01-28  0:05 ` [RFC v2 01/38] docs: power: video.rst: fix a footnote reference Mauro Carvalho Chehab
2025-01-28 22:10   ` Jonathan Corbet
2025-01-28  0:05 ` [RFC v2 02/38] docs: media: ipu3: fix two footnote references Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 03/38] docs: block: ublk.rst: remove a reference from a dropped text Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 04/38] docs: sphinx: remove kernellog.py file Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 05/38] docs: sphinx/kernel_abi: adjust coding style Mauro Carvalho Chehab
2025-02-04 17:19   ` Jonathan Corbet
2025-01-28  0:05 ` [RFC v2 06/38] docs: admin-guide: abi: add SPDX tags to ABI files Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 07/38] ABI: sysfs-class-rfkill: fix kernelversion tags Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 08/38] ABI: sysfs-bus-coresight-*: " Mauro Carvalho Chehab
2025-02-04 10:30   ` Suzuki K Poulose
2025-01-28  0:05 ` [RFC v2 09/38] ABI: sysfs-driver-dma-idxd: fix date tags Mauro Carvalho Chehab
2025-01-28  0:05 ` [RFC v2 10/38] ABI: sysfs-fs-f2fs: " Mauro Carvalho Chehab
2025-01-28  0:05   ` [f2fs-dev] " Mauro Carvalho Chehab via Linux-f2fs-devel
2025-01-28  0:06 ` [RFC v2 11/38] ABI: sysfs-power: fix a what tag Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 12/38] scripts/documentation-file-ref-check: don't check perl/python scripts Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 13/38] scripts/get_abi.py: add a Python tool to generate ReST output Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 14/38] scripts/get_abi.py: add support for symbol search Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 15/38] docs: use get_abi.py for ABI generation Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 16/38] scripts/get_abi.py: optimize parse_abi() function Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 17/38] scripts/get_abi.py: use an interactor for ReST output Mauro Carvalho Chehab
2025-01-29 11:04   ` Akira Yokosawa
2025-01-29 14:06     ` Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 18/38] docs: sphinx/kernel_abi: use AbiParser directly Mauro Carvalho Chehab
2025-01-28 22:37   ` Jonathan Corbet
2025-01-29  0:43     ` Mauro Carvalho Chehab
2025-02-02 14:56       ` Mauro Carvalho Chehab
2025-02-04 17:12         ` Jonathan Corbet
2025-02-10  7:27           ` Mauro Carvalho Chehab
2025-02-10 14:40             ` Jonathan Corbet
2025-02-10 16:03               ` Mauro Carvalho Chehab
2025-02-16 12:53                 ` Mauro Carvalho Chehab [this message]
2025-01-28  0:06 ` [RFC v2 19/38] docs: sphinx/kernel_abi: reduce buffer usage for ABI messages Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 20/38] docs: sphinx/kernel_abi: properly split lines Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 21/38] scripts/get_abi.pl: Add filtering capabilities to rest output Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 22/38] scripts/get_abi.pl: add support to parse ABI README file Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 23/38] docs: sphinx/kernel_abi: parse ABI files only once Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 24/38] docs: admin-guide/abi: split files from symbols Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 25/38] docs: sphinx/automarkup: add cross-references for ABI Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 26/38] docs: sphinx/kernel_abi: avoid warnings during Sphinx module init Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 27/38] scripts/get_abi.py: Rename title name for ABI files Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 28/38] docs: media: Allow creating cross-references for RC ABI Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 29/38] docs: thunderbolt: Allow creating cross-references for ABI Mauro Carvalho Chehab
2025-01-28 11:27   ` Mika Westerberg
2025-01-28  0:06 ` [RFC v2 30/38] docs: arm: asymmetric-32bit: " Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 31/38] docs: arm: generic-counter: " Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 32/38] docs: iio: Allow creating cross-references ABI Mauro Carvalho Chehab
2025-01-28 18:27   ` Jonathan Cameron
2025-01-28  0:06 ` [RFC v2 33/38] docs: networking: Allow creating cross-references statistics ABI Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 34/38] docs: submit-checklist: Allow creating cross-references for ABI README Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 35/38] docs: translations: " Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 36/38] docs: ABI: drop two duplicate symbols Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 37/38] scripts/get_abi.py: add support for undefined ABIs Mauro Carvalho Chehab
2025-01-28  0:06 ` [RFC v2 38/38] scripts/get_abi.pl: drop now obsoleted script Mauro Carvalho Chehab
2025-01-28 22:42 ` [RFC v2 00/38] Improve ABI documentation generation Jonathan Corbet
2025-01-28 22:42   ` [f2fs-dev] " Jonathan Corbet
2025-01-29  1:45   ` Mauro Carvalho Chehab
2025-01-29  1:45     ` [f2fs-dev] " Mauro Carvalho Chehab via Linux-f2fs-devel
2025-01-29 14:22     ` Mauro Carvalho Chehab
2025-01-29 14:22       ` [f2fs-dev] " Mauro Carvalho Chehab via Linux-f2fs-devel
2025-01-29 15:41   ` Mauro Carvalho Chehab
2025-01-29 15:41     ` [f2fs-dev] " Mauro Carvalho Chehab via Linux-f2fs-devel
2025-01-29 15:58     ` Jonathan Corbet
2025-01-29 15:58       ` [f2fs-dev] " Jonathan Corbet
2025-01-29 16:19       ` Mauro Carvalho Chehab
2025-01-29 16:19         ` [f2fs-dev] " Mauro Carvalho Chehab via Linux-f2fs-devel
2025-03-27 21:58 ` patchwork-bot+f2fs
2025-03-27 21:58   ` patchwork-bot+f2fs--- via Linux-f2fs-devel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250216135356.188be652@foz.lan \
    --to=mchehab+huawei@kernel.org \
    --cc=corbet@lwn.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.