From: Mauro Carvalho Chehab <mchehab@kernel.org>
To: Randy Dunlap <rdunlap@infradead.org>
Cc: "Michal Suchánek" <msuchanek@suse.de>,
"Matthew Wilcox" <willy@infradead.org>,
"Markus Heiser" <markus.heiser@darmarit.de>,
linux-doc@vger.kernel.org, "Jonathan Corbet" <corbet@lwn.net>
Subject: Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
Date: Mon, 10 May 2021 10:17:57 +0200 [thread overview]
Message-ID: <20210510101757.145087d3@coco.lan> (raw)
In-Reply-To: <347657c8-f5ae-517c-0b43-fb60d50f1dd8@infradead.org>
Em Sat, 8 May 2021 08:55:11 -0700
Randy Dunlap <rdunlap@infradead.org> escreveu:
> > In the mean time, I'm already preparing a patch series addressing
> > the issues inside documentation, using some scripting to avoid
> > manual mistakes:
> >
> > https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8
> >
> > (patch series is not 100% yet... some adjustments are still
> > needed on some places).
>
>
> Thanks for digging into this and providing fixes.
Just pushed a new version there, rebasing the branch:
https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8
The first tree patches were manually written, in order to address
a couple of special cases.
I'll be submitting the patches via e-mail later today.
The remaining ones were generated by a script that seeks for UTF-8
characters only inside Documentation .rst and ABI files, doing this
conversion:
my %char_map = (
0x2010 => '-', # HYPHEN
0xad => '-', # SOFT HYPHEN
0x2013 => '-', # EN DASH
0x2014 => '-', # EM DASH
0x2018 => "'", # LEFT SINGLE QUOTATION MARK
0x2019 => "'", # RIGHT SINGLE QUOTATION MARK
0xb4 => "'", # ACUTE ACCENT
0x201c => '"', # LEFT DOUBLE QUOTATION MARK
0x201d => '"', # RIGHT DOUBLE QUOTATION MARK
0x2212 => '-', # MINUS SIGN
0x2217 => '*', # ASTERISK OPERATOR
0xd7 => 'x', # MULTIPLICATION SIGN
0xbb => '>', # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
0xa0 => ' ', # NO-BREAK SPACE
0xfeff => '', # ZERO WIDTH NO-BREAK SPACE
);
Basically, after the conversion, those UTF-8 chars will remain
at Documentation/:
- U+00a9 ('©'): COPYRIGHT SIGN
- U+00ac ('¬'): NOT SIGN # only at Documentation/powerpc/transactional_memory.rst
- U+00ae ('®'): REGISTERED SIGN
- U+00b0 ('°'): DEGREE SIGN
- U+00b1 ('±'): PLUS-MINUS SIGN
- U+00b2 ('²'): SUPERSCRIPT TWO
- U+00b5 ('µ'): MICRO SIGN
- U+00b7 ('·'): MIDDLE DOT # See below
- U+00bd ('½'): VULGAR FRACTION ONE HALF
- U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA
- U+00df ('ß'): LATIN SMALL LETTER SHARP S
- U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE
- U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS
- U+00e6 ('æ'): LATIN SMALL LETTER AE
- U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA
- U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE
- U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX
- U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS
- U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE
- U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX
- U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS
- U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE
- U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE
- U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS
- U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE
- U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE
- U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE
- U+03bc ('μ'): GREEK SMALL LETTER MU
- U+2026 ('…'): HORIZONTAL ELLIPSIS
- U+2122 ('™'): TRADE MARK SIGN
- U+2191 ('↑'): UPWARDS ARROW
- U+2192 ('→'): RIGHTWARDS ARROW
- U+2193 ('↓'): DOWNWARDS ARROW
- U+2264 ('≤'): LESS-THAN OR EQUAL TO
- U+2265 ('≥'): GREATER-THAN OR EQUAL TO
- U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL
- U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL
- U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT
- U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT
- U+2b0d ('⬍'): UP DOWN BLACK ARROW
For U+00b7 ('·'): MIDDLE DOT, I opted to keep it on a few places:
- Documentation/devicetree/bindings/clock/qcom,rpmcc.txt
As this file will be some day converted to yaml, where the
MIDDLE DOT will be removed, I guess it is not worth touching it.
- Documentation/scheduler/sched-deadline.rst
There, it is used on a math expressions. So, better to keep.
- Documentation/devicetree/bindings/media/video-interface-devices.yaml
There, it part of an ASCII artwork.
- translations/zh_CN
I prefer not touching it, as it might have some special meaning
in Simplified Chinese.
Thanks,
Mauro
next prev parent reply other threads:[~2021-05-10 8:18 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-06 10:39 Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) Michal Suchánek
2021-05-06 11:20 ` Mauro Carvalho Chehab
2021-05-06 13:32 ` Michal Suchánek
2021-05-06 14:24 ` Mauro Carvalho Chehab
2021-05-06 14:35 ` Michal Suchánek
2021-05-06 15:57 ` Markus Heiser
2021-05-06 16:46 ` Mauro Carvalho Chehab
2021-05-06 17:04 ` Markus Heiser
2021-05-06 17:27 ` Mauro Carvalho Chehab
2021-05-06 17:53 ` Markus Heiser
2021-05-06 18:06 ` Michal Suchánek
2021-05-07 8:52 ` Mauro Carvalho Chehab
2021-05-06 17:57 ` Randy Dunlap
2021-05-06 18:08 ` Matthew Wilcox
2021-05-06 21:21 ` Randy Dunlap
2021-05-07 6:39 ` Mauro Carvalho Chehab
2021-05-07 6:49 ` Randy Dunlap
2021-05-07 8:04 ` Mauro Carvalho Chehab
2021-05-07 8:35 ` Michal Suchánek
2021-05-07 8:56 ` Markus Heiser
2021-05-07 9:14 ` Mauro Carvalho Chehab
2021-05-07 9:51 ` Markus Heiser
2021-05-07 10:29 ` Michal Suchánek
2021-05-07 9:02 ` Mauro Carvalho Chehab
2021-05-08 9:22 ` Mauro Carvalho Chehab
2021-05-08 10:41 ` Michal Suchánek
2021-05-08 14:41 ` Mauro Carvalho Chehab
2021-05-08 15:55 ` Randy Dunlap
2021-05-08 17:09 ` Michal Suchánek
2021-05-08 17:46 ` Randy Dunlap
2021-05-10 6:22 ` Mauro Carvalho Chehab
2021-05-10 8:17 ` Mauro Carvalho Chehab [this message]
2021-05-06 17:48 ` Michal Suchánek
2021-05-06 17:59 ` Markus Heiser
2021-05-06 18:16 ` Michal Suchánek
2021-05-12 6:22 ` Mauro Carvalho Chehab
2021-05-12 7:01 ` Michal Suchánek
2021-05-12 7:18 ` Markus Heiser
2021-05-12 7:37 ` Markus Heiser
2021-05-12 7:59 ` Mauro Carvalho Chehab
2021-05-17 13:10 ` Michal Suchánek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210510101757.145087d3@coco.lan \
--to=mchehab@kernel.org \
--cc=corbet@lwn.net \
--cc=linux-doc@vger.kernel.org \
--cc=markus.heiser@darmarit.de \
--cc=msuchanek@suse.de \
--cc=rdunlap@infradead.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).