From: Mauro Carvalho Chehab <mchehab@kernel.org>
To: Randy Dunlap <rdunlap@infradead.org>
Cc: "Michal Suchánek" <msuchanek@suse.de>,
"Matthew Wilcox" <willy@infradead.org>,
"Markus Heiser" <markus.heiser@darmarit.de>,
linux-doc@vger.kernel.org, "Jonathan Corbet" <corbet@lwn.net>
Subject: Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
Date: Mon, 10 May 2021 10:17:57 +0200 [thread overview]
Message-ID: <20210510101757.145087d3@coco.lan> (raw)
In-Reply-To: <347657c8-f5ae-517c-0b43-fb60d50f1dd8@infradead.org>
Em Sat, 8 May 2021 08:55:11 -0700
Randy Dunlap <rdunlap@infradead.org> escreveu:
> > In the mean time, I'm already preparing a patch series addressing
> > the issues inside documentation, using some scripting to avoid
> > manual mistakes:
> >
> > https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8
> >
> > (patch series is not 100% yet... some adjustments are still
> > needed on some places).
>
>
> Thanks for digging into this and providing fixes.
Just pushed a new version there, rebasing the branch:
https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8
The first tree patches were manually written, in order to address
a couple of special cases.
I'll be submitting the patches via e-mail later today.
The remaining ones were generated by a script that seeks for UTF-8
characters only inside Documentation .rst and ABI files, doing this
conversion:
my %char_map = (
0x2010 => '-', # HYPHEN
0xad => '-', # SOFT HYPHEN
0x2013 => '-', # EN DASH
0x2014 => '-', # EM DASH
0x2018 => "'", # LEFT SINGLE QUOTATION MARK
0x2019 => "'", # RIGHT SINGLE QUOTATION MARK
0xb4 => "'", # ACUTE ACCENT
0x201c => '"', # LEFT DOUBLE QUOTATION MARK
0x201d => '"', # RIGHT DOUBLE QUOTATION MARK
0x2212 => '-', # MINUS SIGN
0x2217 => '*', # ASTERISK OPERATOR
0xd7 => 'x', # MULTIPLICATION SIGN
0xbb => '>', # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
0xa0 => ' ', # NO-BREAK SPACE
0xfeff => '', # ZERO WIDTH NO-BREAK SPACE
);
Basically, after the conversion, those UTF-8 chars will remain
at Documentation/:
- U+00a9 ('©'): COPYRIGHT SIGN
- U+00ac ('¬'): NOT SIGN # only at Documentation/powerpc/transactional_memory.rst
- U+00ae ('®'): REGISTERED SIGN
- U+00b0 ('°'): DEGREE SIGN
- U+00b1 ('±'): PLUS-MINUS SIGN
- U+00b2 ('²'): SUPERSCRIPT TWO
- U+00b5 ('µ'): MICRO SIGN
- U+00b7 ('·'): MIDDLE DOT # See below
- U+00bd ('½'): VULGAR FRACTION ONE HALF
- U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA
- U+00df ('ß'): LATIN SMALL LETTER SHARP S
- U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE
- U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS
- U+00e6 ('æ'): LATIN SMALL LETTER AE
- U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA
- U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE
- U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX
- U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS
- U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE
- U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX
- U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS
- U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE
- U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE
- U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS
- U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE
- U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE
- U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE
- U+03bc ('μ'): GREEK SMALL LETTER MU
- U+2026 ('…'): HORIZONTAL ELLIPSIS
- U+2122 ('™'): TRADE MARK SIGN
- U+2191 ('↑'): UPWARDS ARROW
- U+2192 ('→'): RIGHTWARDS ARROW
- U+2193 ('↓'): DOWNWARDS ARROW
- U+2264 ('≤'): LESS-THAN OR EQUAL TO
- U+2265 ('≥'): GREATER-THAN OR EQUAL TO
- U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL
- U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL
- U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT
- U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT
- U+2b0d ('⬍'): UP DOWN BLACK ARROW
For U+00b7 ('·'): MIDDLE DOT, I opted to keep it on a few places:
- Documentation/devicetree/bindings/clock/qcom,rpmcc.txt
As this file will be some day converted to yaml, where the
MIDDLE DOT will be removed, I guess it is not worth touching it.
- Documentation/scheduler/sched-deadline.rst
There, it is used on a math expressions. So, better to keep.
- Documentation/devicetree/bindings/media/video-interface-devices.yaml
There, it part of an ASCII artwork.
- translations/zh_CN
I prefer not touching it, as it might have some special meaning
in Simplified Chinese.
Thanks,
Mauro
next prev parent reply other threads:[~2021-05-10 8:18 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-06 10:39 Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) Michal Suchánek
2021-05-06 11:20 ` Mauro Carvalho Chehab
2021-05-06 13:32 ` Michal Suchánek
2021-05-06 14:24 ` Mauro Carvalho Chehab
2021-05-06 14:35 ` Michal Suchánek
2021-05-06 15:57 ` Markus Heiser
2021-05-06 16:46 ` Mauro Carvalho Chehab
2021-05-06 17:04 ` Markus Heiser
2021-05-06 17:27 ` Mauro Carvalho Chehab
2021-05-06 17:53 ` Markus Heiser
2021-05-06 18:06 ` Michal Suchánek
2021-05-07 8:52 ` Mauro Carvalho Chehab
2021-05-06 17:57 ` Randy Dunlap
2021-05-06 18:08 ` Matthew Wilcox
2021-05-06 21:21 ` Randy Dunlap
2021-05-07 6:39 ` Mauro Carvalho Chehab
2021-05-07 6:49 ` Randy Dunlap
2021-05-07 8:04 ` Mauro Carvalho Chehab
2021-05-07 8:35 ` Michal Suchánek
2021-05-07 8:56 ` Markus Heiser
2021-05-07 9:14 ` Mauro Carvalho Chehab
2021-05-07 9:51 ` Markus Heiser
2021-05-07 10:29 ` Michal Suchánek
2021-05-07 9:02 ` Mauro Carvalho Chehab
2021-05-08 9:22 ` Mauro Carvalho Chehab
2021-05-08 10:41 ` Michal Suchánek
2021-05-08 14:41 ` Mauro Carvalho Chehab
2021-05-08 15:55 ` Randy Dunlap
2021-05-08 17:09 ` Michal Suchánek
2021-05-08 17:46 ` Randy Dunlap
2021-05-10 6:22 ` Mauro Carvalho Chehab
2021-05-10 8:17 ` Mauro Carvalho Chehab [this message]
2021-05-06 17:48 ` Michal Suchánek
2021-05-06 17:59 ` Markus Heiser
2021-05-06 18:16 ` Michal Suchánek
2021-05-12 6:22 ` Mauro Carvalho Chehab
2021-05-12 7:01 ` Michal Suchánek
2021-05-12 7:18 ` Markus Heiser
2021-05-12 7:37 ` Markus Heiser
2021-05-12 7:59 ` Mauro Carvalho Chehab
2021-05-17 13:10 ` Michal Suchánek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210510101757.145087d3@coco.lan \
--to=mchehab@kernel.org \
--cc=corbet@lwn.net \
--cc=linux-doc@vger.kernel.org \
--cc=markus.heiser@darmarit.de \
--cc=msuchanek@suse.de \
--cc=rdunlap@infradead.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.