From: Mauro Carvalho Chehab <mchehab@kernel.org>
To: "Michal Suchánek" <msuchanek@suse.de>
Cc: Randy Dunlap <rdunlap@infradead.org>,
Matthew Wilcox <willy@infradead.org>,
Markus Heiser <markus.heiser@darmarit.de>,
linux-doc@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>
Subject: Re: Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256)
Date: Sat, 8 May 2021 16:41:45 +0200 [thread overview]
Message-ID: <20210508164145.26f7b1e0@coco.lan> (raw)
In-Reply-To: <20210508104157.GC12700@kitsune.suse.cz>
Em Sat, 8 May 2021 12:41:57 +0200
Michal Suchánek <msuchanek@suse.de> escreveu:
> On Sat, May 08, 2021 at 11:22:05AM +0200, Mauro Carvalho Chehab wrote:
> > Em Fri, 7 May 2021 08:39:24 +0200
> > Mauro Carvalho Chehab <mchehab@kernel.org> escreveu:
> >
> > > Em Thu, 6 May 2021 14:21:01 -0700
> > > Randy Dunlap <rdunlap@infradead.org> escreveu:
> > >
> > > > On 5/6/21 11:08 AM, Matthew Wilcox wrote:
> > > > > On Thu, May 06, 2021 at 10:57:53AM -0700, Randy Dunlap wrote:
> > > > >> I have been going thru some of the Documentation/ files...
> > > > >>
> > > > >> Why do several of the files begin with
> > > > >> (hex) ef bb bf followed by "=================="
> > > > >> for a heading, instead of just "===================".
> > > > >> See e.g. Documentation/timers/no_hz.rst.
> > >
> > > No idea! It seems that the text editor I used on that time added
> > > it for whatever reason.
> >
> > > I'll prepare a patch fixing it. Some care should be taken, however, as
> > > it has two places where UTF-8 chars should be used[2].
> >
> > Ok, I did a small script in order to check what special chars we
> > currently have (next-20210507) at Documentation/ excluding the
> > translations.
> >
> > Based on my script results, we have those groups:
> >
> > 1. Latin accented characters:
> > - U+00c7 (LATIN CAPITAL LETTER C WITH CEDILLA) (Ç)
> > - U+00df (LATIN SMALL LETTER SHARP S) (ß)
> > - U+00e1 (LATIN SMALL LETTER A WITH ACUTE) (á)
> > - U+00e4 (LATIN SMALL LETTER A WITH DIAERESIS) (ä)
> > - U+00e6 (LATIN SMALL LETTER AE) (æ)
> > - U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) (ç)
> > - U+00e9 (LATIN SMALL LETTER E WITH ACUTE) (é)
> > - U+00ea (LATIN SMALL LETTER E WITH CIRCUMFLEX) (ê)
> > - U+00eb (LATIN SMALL LETTER E WITH DIAERESIS) (ë)
> > - U+00f3 (LATIN SMALL LETTER O WITH ACUTE) (ó)
> > - U+00f4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) (ô)
> > - U+00f6 (LATIN SMALL LETTER O WITH DIAERESIS) (ö)
> > - U+00f8 (LATIN SMALL LETTER O WITH STROKE) (ø)
> > - U+00fc (LATIN SMALL LETTER U WITH DIAERESIS) (ü)
> > - U+011f (LATIN SMALL LETTER G WITH BREVE) (ğ)
> > - U+0142 (LATIN SMALL LETTER L WITH STROKE) (ł)
> >
> > 2. symbols:
> > - U+00a9 (COPYRIGHT SIGN) (©)
> > - U+2122 (TRADE MARK SIGN) (™)
> > - U+00ae (REGISTERED SIGN) (®)
> > - U+00b0 (DEGREE SIGN) (°)
> > - U+00b1 (PLUS-MINUS SIGN) (±)
> > - U+00b2 (SUPERSCRIPT TWO) (²)
> > - U+00b5 (MICRO SIGN) (µ)
> > - U+00bd (VULGAR FRACTION ONE HALF) (½)
> > - U+2026 (HORIZONTAL ELLIPSIS) (…)
> >
> > 3. arrows:
> > - U+2191 (UPWARDS ARROW) (↑)
> > - U+2192 (RIGHTWARDS ARROW) (→)
> > - U+2193 (DOWNWARDS ARROW) (↓)
> > - U+2b0d (UP DOWN BLACK ARROW) (⬍)
> >
> > 4. box drawings:
> > - U+2500 (BOX DRAWINGS LIGHT HORIZONTAL) (─)
> > - U+2502 (BOX DRAWINGS LIGHT VERTICAL) (│)
> > - U+2514 (BOX DRAWINGS LIGHT UP AND RIGHT) (└)
> > - U+251c (BOX DRAWINGS LIGHT VERTICAL AND RIGHT) (├)
> >
> > 5. math symbols:
> > - U+00b7 (MIDDLE DOT) (·)
> > - U+00d7 (MULTIPLICATION SIGN) (×)
> > - U+2212 (MINUS SIGN) (−)
> > - U+2217 (ASTERISK OPERATOR) (∗)
> > - U+223c (TILDE OPERATOR) (∼)
> > - U+2264 (LESS-THAN OR EQUAL TO) (≤)
> > - U+2265 (GREATER-THAN OR EQUAL TO) (≥)
> > - U+27e8 (MATHEMATICAL LEFT ANGLE BRACKET) (⟨)
> > - U+27e9 (MATHEMATICAL RIGHT ANGLE BRACKET) (⟩)
> > - U+00ac (NOT SIGN) (¬)
>
Hi Michal,
> Clearly his is supposed to be ASCII tilde:
> Documentation/cdrom/cdrom-standard.rst: if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩)
Yes, for this specific file, iconv //translit should solve everything.
In the case of cdrom-standard, those came from the LaTeX conversion.
>
> Use of ¬ is also very dubious in documentation (in fonts it is understandable):
> Documentation/ABI/obsolete/sysfs-kernel-fadump_registered:This ABI is renamed and moved to a new location /sys/kernel/fadump/registered.¬
> Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem:This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem.¬
> Documentation/powerpc/transactional_memory.rst: if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then
Yeah, this should probably be better written as:
if (MSR 29:31 == 0b010 | SRR1 29:31 == 0b000) then
> The use of − is rare can could be replaed with ASCII hyphen-minus entirely
> without making the text harder to understand:
>
> Documentation/devicetree/bindings/iio/adc/adi,ad7124.yaml: 0: REFIN1(+)/REFIN1(−).
> Documentation/devicetree/bindings/iio/adc/adi,ad7124.yaml: 1: REFIN2(+)/REFIN2(−).
> Documentation/devicetree/bindings/iio/adc/adi,ad7192.yaml: External reference applied between the P1/REFIN2(+) and P0/REFIN2(−) pins.
> Documentation/scheduler/sched-deadline.rst: ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max
> drivers/gpu/drm/drm_color_mgmt.c: * - range: [-2^2, 2^2 - 2^−15]
> drivers/iio/light/tsl2583.c: * sheet (TAOS134 − MARCH 2011):
> drivers/staging/iio/adc/ad7280a.c: * (Number of Conversions per Part)) −
> sound/soc/codecs/sgtl5000.c: * is the array index and the following formula: 10^((idx−15)/40) * 100
Agreed.
> Asterisk operator is clearly meant to be ASCII:
> Documentation/cdrom/cdrom-standard.rst: NULL, /∗ lseek ∗/
> Documentation/cdrom/cdrom-standard.rst: block _read , /∗ read—general block-dev read ∗/
> Documentation/cdrom/cdrom-standard.rst: block _write, /∗ write—general block-dev write ∗/
> Documentation/cdrom/cdrom-standard.rst: NULL, /∗ readdir ∗/
> Documentation/cdrom/cdrom-standard.rst: NULL, /∗ select ∗/
> Documentation/cdrom/cdrom-standard.rst: cdrom_ioctl, /∗ ioctl ∗/
> Documentation/cdrom/cdrom-standard.rst: NULL, /∗ mmap ∗/
> Documentation/cdrom/cdrom-standard.rst: cdrom_open, /∗ open ∗/
> Documentation/cdrom/cdrom-standard.rst: cdrom_release, /∗ release ∗/
> Documentation/cdrom/cdrom-standard.rst: NULL, /∗ fsync ∗/
> Documentation/cdrom/cdrom-standard.rst: NULL, /∗ fasync ∗/
> Documentation/cdrom/cdrom-standard.rst: NULL /∗ revalidate ∗/
> Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
>
> There is only one place where ⟨⟩ is used which is very dubious:
> Documentation/cdrom/cdrom-standard.rst: if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩) ...
Yeah. Again, this was due to LaTeX to text conversion.
> The middle dot is mostly used in mathmatical formulas that would be
> unintelligible otherwise but there are a few odd uses:
> Documentation/ABI/testing/sysfs-module:KernelVersion:»·3.3
> Documentation/ABI/testing/sysfs-module:KernelVersion:»·3.3
> Documentation/devicetree/bindings/clock/qcom,rpmcc.txt: "qcom,rpmcc-msm8992",·"qcom,rpmcc"
> Documentation/devicetree/bindings/clock/qcom,rpmcc.txt: "qcom,rpmcc-msm8994",·"qcom,rpmcc"
Yeah. It sounds that space would be the best replacement there.
> Documentation/translations/zh_CN/kernel-hacking/hacking.rst: 阿列克谢·库兹涅佐夫享用的糟糕伏特加有关。
> Documentation/translations/zh_CN/process/howto.rst: 《C程序设计语言(第2版·新版)》(徐宝文 李志 译)[机械工业出版社]
> Documentation/translations/zh_CN/process/management-style.rst:.. [#cnf2] 保罗·西蒙演唱了“离开爱人的50种方法”,因为坦率地说,“告诉开发者
I wouldn't touch translations.
> The × ≤ and ≥ uses look fine.
Agreed.
Thanks for double-checking those. I'll address them.
In the mean time, I'm already preparing a patch series addressing
the issues inside documentation, using some scripting to avoid
manual mistakes:
https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8
(patch series is not 100% yet... some adjustments are still
needed on some places).
Thanks,
Mauro
next prev parent reply other threads:[~2021-05-08 14:41 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-06 10:39 Sphinx parallel build error: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 18-20: ordinal not in range(256) Michal Suchánek
2021-05-06 11:20 ` Mauro Carvalho Chehab
2021-05-06 13:32 ` Michal Suchánek
2021-05-06 14:24 ` Mauro Carvalho Chehab
2021-05-06 14:35 ` Michal Suchánek
2021-05-06 15:57 ` Markus Heiser
2021-05-06 16:46 ` Mauro Carvalho Chehab
2021-05-06 17:04 ` Markus Heiser
2021-05-06 17:27 ` Mauro Carvalho Chehab
2021-05-06 17:53 ` Markus Heiser
2021-05-06 18:06 ` Michal Suchánek
2021-05-07 8:52 ` Mauro Carvalho Chehab
2021-05-06 17:57 ` Randy Dunlap
2021-05-06 18:08 ` Matthew Wilcox
2021-05-06 21:21 ` Randy Dunlap
2021-05-07 6:39 ` Mauro Carvalho Chehab
2021-05-07 6:49 ` Randy Dunlap
2021-05-07 8:04 ` Mauro Carvalho Chehab
2021-05-07 8:35 ` Michal Suchánek
2021-05-07 8:56 ` Markus Heiser
2021-05-07 9:14 ` Mauro Carvalho Chehab
2021-05-07 9:51 ` Markus Heiser
2021-05-07 10:29 ` Michal Suchánek
2021-05-07 9:02 ` Mauro Carvalho Chehab
2021-05-08 9:22 ` Mauro Carvalho Chehab
2021-05-08 10:41 ` Michal Suchánek
2021-05-08 14:41 ` Mauro Carvalho Chehab [this message]
2021-05-08 15:55 ` Randy Dunlap
2021-05-08 17:09 ` Michal Suchánek
2021-05-08 17:46 ` Randy Dunlap
2021-05-10 6:22 ` Mauro Carvalho Chehab
2021-05-10 8:17 ` Mauro Carvalho Chehab
2021-05-06 17:48 ` Michal Suchánek
2021-05-06 17:59 ` Markus Heiser
2021-05-06 18:16 ` Michal Suchánek
2021-05-12 6:22 ` Mauro Carvalho Chehab
2021-05-12 7:01 ` Michal Suchánek
2021-05-12 7:18 ` Markus Heiser
2021-05-12 7:37 ` Markus Heiser
2021-05-12 7:59 ` Mauro Carvalho Chehab
2021-05-17 13:10 ` Michal Suchánek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210508164145.26f7b1e0@coco.lan \
--to=mchehab@kernel.org \
--cc=corbet@lwn.net \
--cc=linux-doc@vger.kernel.org \
--cc=markus.heiser@darmarit.de \
--cc=msuchanek@suse.de \
--cc=rdunlap@infradead.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.