[PATCH 0/8] Fix Python string escapes

linux-ia64.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Benjamin Gray <bgray@linux.ibm.com>
To: linux-kernel@vger.kernel.org, linux-ia64@vger.kernel.org,
	linux-doc@vger.kernel.org, bpf@vger.kernel.org,
	linux-pm@vger.kernel.org
Cc: abbotti@mev.co.uk, hsweeten@visionengravers.com,
	jan.kiszka@siemens.com, kbingham@kernel.org, mykolal@fb.com,
	Benjamin Gray <bgray@linux.ibm.com>
Subject: [PATCH 0/8] Fix Python string escapes
Date: Mon, 14 Aug 2023 16:06:56 +1000	[thread overview]
Message-ID: <20230814060704.79655-1-bgray@linux.ibm.com> (raw)

Python 3.6 introduced a DeprecationWarning for invalid escape sequences.
This is upgraded to a SyntaxWarning in Python 3.12 (3.12.0rc1), and is
intended to eventually be a syntax error.

This series aims to fix these now to get ahead of it before it's an error.

Most of the patches are generated using the below Python script. It
uses the builtin ast module to parse a Python file, locate all strings,
find the corresponding source code, and check for bad escape sequences.

If it finds things to fix then it applies the fixes and reparses the
file into a fixed AST. It finally compares the original and fixed ASTs
to ensure no semantic difference has been introduced (dumping is done to
remove node location information, which is expected to be different).

There are some limitations of the ast module, in particular it throws
away a lot of information about the string source. f-strings especially
interact poorly here (the slices between formats are presented as
separate strings, but the node range of each is the entire f-string),
so are skipped. f-strings are handled manually in the final patch.

A lot of the fixes are for regex patterns, which could be changed to use
r-strings instead. But that is less easy to automate, so I avoided doing
so in this series. AST verification should still be possible though,
because being a plain or r-string is stripped away in the AST.

---
#!/usr/bin/env python3

"""
Fix all bad escape characters in strings
"""

import ast
from pathlib import Path

def get_offset(source: str, row: int, col: int) -> int:
    """
    Turn a row + column pair into a byte offset
    """
    offset = 0

    cur_row = 1  # 1-indexed rows
    cur_col = 0  # 0-indexed columns

    for c in source:
        if cur_row == row and cur_col == col:
            return offset

        offset += 1

        if c == "\n":
            cur_row += 1
            cur_col = 0
        else:
            cur_col += 1

    raise Exception("Failed to get offset")


parse_failures: list[Path] = []
fix_failures = 0
bad_escapes = 0
fstrings: set[Path] = set()

for pyfile in Path(".").glob("**/*.py"):
    content = pyfile.read_text("utf-8")

    try:
        syntax = ast.parse(content, filename=pyfile)
    except:
        print(f"{pyfile}: ERROR Failed to parse, is it Python3?")
        parse_failures.append(pyfile)
        continue

    fixes: list[tuple[int, int, str]] = []

    for node in ast.walk(syntax):
        if not isinstance(node, ast.Constant):
            continue

        if not isinstance(node.value, str):
            continue

        if node.value.count("\\") == 0:
            continue

        assert(isinstance(node.end_lineno, int))
        assert(isinstance(node.end_col_offset, int))

        start = get_offset(content, node.lineno, node.col_offset)
        end = get_offset(content, node.end_lineno, node.end_col_offset)
        raw = content[start:end]

        # backslashes in r-strings are already literal
        if raw.startswith("r"):
            continue

        # f-strings are difficult to handle correctly
        if raw.startswith("f"):
            fstrings.add(pyfile)
            continue

        fixed = ""  # The fixed representation of the string
        escaped = False  # If the current character is escaped by a previous backslash
        allowed = '\n\\\'"abfnrtv01234567xNuU'  # characters allowed after a backslash

        for c in raw:
            if escaped:
                if c not in allowed:
                        fixed += '\\'

                fixed += c
                escaped = False
                continue

            fixed += c

            if c == '\\':
                escaped = True

        if fixed != raw:
            print(f"{pyfile}:{node.lineno}:{node.col_offset}: FOUND {raw}")
            fixes.append((start, end, fixed))

    if len(fixes) == 0:
        continue

    bad_escapes += len(fixes)

    # Apply fixes in reverse order to keep offsets valid
    for start, end, fix in reversed(sorted(fixes, key=lambda k: k[0])):
        print(f"{pyfile}:[{start}-{end}]: APPLY {fix}")
        content = content[:start] + fix + content[end:]

    fixed_syntax = ast.parse(content, filename=f"{pyfile}+fixed")

    if ast.dump(syntax) != ast.dump(fixed_syntax):
        print(f"{pyfile}: ERROR Fixed syntax tree yields different AST")
        fix_failures += 1
        continue

    pyfile.write_text(content)


print(f"---------------------------------")
print(f"Parse failures:               {len(parse_failures)}")
for f in sorted(parse_failures):
    print(f"  - {f}")

print(f"Bad escapes fixed:            {bad_escapes}")
print(f"Fixes that broke the AST:     {fix_failures}")
print(f"Files with skipped f-strings: {len(fstrings)}")
for f in sorted(fstrings):
    print(f"  - {f}")

---

Benjamin Gray (8):
  ia64: fix Python string escapes
  Documentation/sphinx: fix Python string escapes
  drivers/comedi: fix Python string escapes
  scripts: fix Python string escapes
  tools/perf: fix Python string escapes
  tools/power: fix Python string escapes
  selftests/bpf: fix Python string escapes
  selftests/bpf: fix Python string escapes in f-strings

 Documentation/sphinx/cdomain.py               |  2 +-
 Documentation/sphinx/kernel_abi.py            |  2 +-
 Documentation/sphinx/kernel_feat.py           |  2 +-
 Documentation/sphinx/kerneldoc.py             |  2 +-
 Documentation/sphinx/maintainers_include.py   |  8 +--
 arch/ia64/scripts/unwcheck.py                 |  2 +-
 .../ni_routing/tools/convert_csv_to_c.py      |  2 +-
 scripts/bpf_doc.py                            | 56 +++++++++----------
 scripts/clang-tools/gen_compile_commands.py   |  2 +-
 scripts/gdb/linux/symbols.py                  |  2 +-
 tools/perf/pmu-events/jevents.py              |  2 +-
 .../scripts/python/arm-cs-trace-disasm.py     |  4 +-
 tools/perf/scripts/python/compaction-times.py |  2 +-
 .../scripts/python/exported-sql-viewer.py     |  4 +-
 tools/power/pm-graph/bootgraph.py             | 12 ++--
 .../selftests/bpf/test_bpftool_synctypes.py   | 26 ++++-----
 tools/testing/selftests/bpf/test_offload.py   |  2 +-
 17 files changed, 66 insertions(+), 66 deletions(-)

--
2.41.0

next             reply	other threads:[~2023-08-14  6:08 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-14  6:06 Benjamin Gray [this message]
2023-08-14  6:06 ` [PATCH 1/8] ia64: fix Python string escapes Benjamin Gray
2023-08-14  6:06 ` [PATCH 2/8] Documentation/sphinx: " Benjamin Gray
2023-08-14 13:35   ` Jonathan Corbet
2023-08-14 23:26     ` Benjamin Gray
2023-08-14  6:06 ` [PATCH 3/8] drivers/comedi: " Benjamin Gray
2023-08-14  6:07 ` [PATCH 4/8] scripts: " Benjamin Gray
2023-08-14  6:07 ` [PATCH 5/8] tools/perf: " Benjamin Gray
2023-08-14  6:07 ` [PATCH 6/8] tools/power: " Benjamin Gray
2023-08-14  6:07 ` [PATCH 7/8] selftests/bpf: " Benjamin Gray
2023-08-14  6:07 ` [PATCH 8/8] selftests/bpf: fix Python string escapes in f-strings Benjamin Gray

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230814060704.79655-1-bgray@linux.ibm.com \
    --to=bgray@linux.ibm.com \
    --cc=abbotti@mev.co.uk \
    --cc=bpf@vger.kernel.org \
    --cc=hsweeten@visionengravers.com \
    --cc=jan.kiszka@siemens.com \
    --cc=kbingham@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-ia64@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mykolal@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).