linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab@kernel.org>
To: Randy Dunlap <rdunlap@infradead.org>
Cc: "Michal Suchánek" <msuchanek@suse.de>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Markus Heiser" <markus.heiser@darmarit.de>,
	linux-doc@vger.kernel.org, "Jonathan Corbet" <corbet@lwn.net>
Subject: Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
Date: Mon, 10 May 2021 10:17:57 +0200	[thread overview]
Message-ID: <20210510101757.145087d3@coco.lan> (raw)
In-Reply-To: <347657c8-f5ae-517c-0b43-fb60d50f1dd8@infradead.org>

Em Sat, 8 May 2021 08:55:11 -0700
Randy Dunlap <rdunlap@infradead.org> escreveu:

> > In the mean time, I'm already preparing a patch series addressing
> > the issues inside documentation, using some scripting to avoid
> > manual mistakes:
> > 
> > 	https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8
> > 
> > (patch series is not 100% yet... some adjustments are still
> > needed on some places).  
> 
> 
> Thanks for digging into this and providing fixes.

Just pushed a new version there, rebasing the branch:

	https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8

The first tree patches were manually written, in order to address 
a couple of special cases.

I'll be submitting the patches via e-mail later today.

The remaining ones were generated by a script that seeks for UTF-8
characters only inside Documentation .rst and ABI files, doing this
conversion:

my %char_map = (
	0x2010 => '-',		# HYPHEN
	0xad   => '-',		# SOFT HYPHEN
	0x2013 => '-',		# EN DASH
	0x2014 => '-',		# EM DASH

	0x2018 => "'",		# LEFT SINGLE QUOTATION MARK
	0x2019 => "'",		# RIGHT SINGLE QUOTATION MARK
	0xb4   => "'",		# ACUTE ACCENT

	0x201c => '"',		# LEFT DOUBLE QUOTATION MARK
	0x201d => '"',		# RIGHT DOUBLE QUOTATION MARK

	0x2212 => '-',		# MINUS SIGN
	0x2217 => '*',		# ASTERISK OPERATOR
	0xd7   => 'x',		# MULTIPLICATION SIGN

	0xbb   => '>',		# RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

	0xa0   => ' ',		# NO-BREAK SPACE
	0xfeff => '',		# ZERO WIDTH NO-BREAK SPACE
);

Basically, after the conversion, those UTF-8 chars will remain
at Documentation/:

	- U+00a9 ('©'): COPYRIGHT SIGN
	- U+00ac ('¬'): NOT SIGN		# only at Documentation/powerpc/transactional_memory.rst
	- U+00ae ('®'): REGISTERED SIGN
	- U+00b0 ('°'): DEGREE SIGN
	- U+00b1 ('±'): PLUS-MINUS SIGN
	- U+00b2 ('²'): SUPERSCRIPT TWO
	- U+00b5 ('µ'): MICRO SIGN
	- U+00b7 ('·'): MIDDLE DOT		# See below
	- U+00bd ('½'): VULGAR FRACTION ONE HALF
	- U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA
	- U+00df ('ß'): LATIN SMALL LETTER SHARP S
	- U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE
	- U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS
	- U+00e6 ('æ'): LATIN SMALL LETTER AE
	- U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA
	- U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE
	- U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX
	- U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS
	- U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE
	- U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX
	- U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS
	- U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE
	- U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE
	- U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS
	- U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE
	- U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE
	- U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE
	- U+03bc ('μ'): GREEK SMALL LETTER MU
	- U+2026 ('…'): HORIZONTAL ELLIPSIS
	- U+2122 ('™'): TRADE MARK SIGN
	- U+2191 ('↑'): UPWARDS ARROW
	- U+2192 ('→'): RIGHTWARDS ARROW
	- U+2193 ('↓'): DOWNWARDS ARROW
	- U+2264 ('≤'): LESS-THAN OR EQUAL TO
	- U+2265 ('≥'): GREATER-THAN OR EQUAL TO
	- U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL
	- U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL
	- U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT
	- U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT
	- U+2b0d ('⬍'): UP DOWN BLACK ARROW

For U+00b7 ('·'): MIDDLE DOT, I opted to keep it on a few places:

- Documentation/devicetree/bindings/clock/qcom,rpmcc.txt

  As this file will be some day converted to yaml, where the 
  MIDDLE DOT will be removed, I guess it is not worth touching it.

- Documentation/scheduler/sched-deadline.rst

  There, it is used on a math expressions. So, better to keep.

- Documentation/devicetree/bindings/media/video-interface-devices.yaml

  There, it part of an ASCII artwork.

- translations/zh_CN

  I prefer not touching it, as it might have some special meaning
  in Simplified Chinese.

Thanks,
Mauro

  parent reply	other threads:[~2021-05-10  8:18 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-06 10:39 Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) Michal Suchánek
2021-05-06 11:20 ` Mauro Carvalho Chehab
2021-05-06 13:32   ` Michal Suchánek
2021-05-06 14:24     ` Mauro Carvalho Chehab
2021-05-06 14:35       ` Michal Suchánek
2021-05-06 15:57 ` Markus Heiser
2021-05-06 16:46   ` Mauro Carvalho Chehab
2021-05-06 17:04     ` Markus Heiser
2021-05-06 17:27       ` Mauro Carvalho Chehab
2021-05-06 17:53         ` Markus Heiser
2021-05-06 18:06           ` Michal Suchánek
2021-05-07  8:52             ` Mauro Carvalho Chehab
2021-05-06 17:57         ` Randy Dunlap
2021-05-06 18:08           ` Matthew Wilcox
2021-05-06 21:21             ` Randy Dunlap
2021-05-07  6:39               ` Mauro Carvalho Chehab
2021-05-07  6:49                 ` Randy Dunlap
2021-05-07  8:04                 ` Mauro Carvalho Chehab
2021-05-07  8:35                   ` Michal Suchánek
2021-05-07  8:56                     ` Markus Heiser
2021-05-07  9:14                       ` Mauro Carvalho Chehab
2021-05-07  9:51                         ` Markus Heiser
2021-05-07 10:29                           ` Michal Suchánek
2021-05-07  9:02                     ` Mauro Carvalho Chehab
2021-05-08  9:22                 ` Mauro Carvalho Chehab
2021-05-08 10:41                   ` Michal Suchánek
2021-05-08 14:41                     ` Mauro Carvalho Chehab
2021-05-08 15:55                       ` Randy Dunlap
2021-05-08 17:09                         ` Michal Suchánek
2021-05-08 17:46                           ` Randy Dunlap
2021-05-10  6:22                             ` Mauro Carvalho Chehab
2021-05-10  8:17                         ` Mauro Carvalho Chehab [this message]
2021-05-06 17:48       ` Michal Suchánek
2021-05-06 17:59         ` Markus Heiser
2021-05-06 18:16           ` Michal Suchánek
2021-05-12  6:22         ` Mauro Carvalho Chehab
2021-05-12  7:01           ` Michal Suchánek
2021-05-12  7:18             ` Markus Heiser
2021-05-12  7:37               ` Markus Heiser
2021-05-12  7:59             ` Mauro Carvalho Chehab
2021-05-17 13:10               ` Michal Suchánek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210510101757.145087d3@coco.lan \
    --to=mchehab@kernel.org \
    --cc=corbet@lwn.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=markus.heiser@darmarit.de \
    --cc=msuchanek@suse.de \
    --cc=rdunlap@infradead.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).