From: "Michael Kerrisk (man-pages)" <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: Colin Watson <cjwatson-8fiUuRrzOP0dnm+yROfE0A@public.gmane.org>
Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
linux-man <linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Bruno Haible <bruno-nWNVUoHt2MvYtjvyW6yDsg@public.gmane.org>,
Werner Lemberg <wl-mXXj517/zsQ@public.gmane.org>,
Peter Schiffer <pschiffe-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: Converting man-pages to UTF-8
Date: Fri, 14 Feb 2014 16:28:11 +0100 [thread overview]
Message-ID: <52FE360B.9050302@gmail.com> (raw)
In-Reply-To: <20140214114216.GE6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>
On 02/14/2014 12:42 PM, Colin Watson wrote:
> On Fri, Feb 14, 2014 at 11:43:30AM +0100, Michael Kerrisk (man-pages) wrote:
>> At https://bugzilla.kernel.org/show_bug.cgi?id=60807 is a proposal to
>> convert the pages of the the "man-pages" project to UTF 8. I thought
>> it worthwhile bringing that topic to the list, and CCing a few people
>> who may have some ideas about this step, since I'm not too sure of the
>> implications.
>>
>> Peter Schiffer has kindly written some some scripts to do the
>> conversion, which would touch about 40 files. However, as far I can
>> tell, many of the pages that have non-ASCII characters have inside
>> groff comments (author's names, etc.). The only pages that have
>> non-ASCII characters in the rendered source are various man7 pages on
>> character sets. These were the pages to which I added a groff encoding
>> marker in response to Colin Watson's input on this Debian bug:
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=519209
>>
>> Moving to UTF-8 for the pages seems like a good idea, at least at some
>> point. However, I'm wondering whether there are any backward
>> compatibility issues that I should need to worry about. As far as I
>> know, groff added UTF-8 support back in Jan 2009, so, just over 5
>> years ago. Perhaps that's long enough ago now, that any backward
>> compatibility issues with old versions of groff would be minimal.
>> (I.e., the number of people installing new man-pages on systems with
>> old groff is likely to be very small, and anyway, only a dozen or so
>> pages in Section 7 are affected. Furthermore, I'm assuming that Linux
>> distros have been shipping groff v1.20+ for quite a long time now.)
>
> I think for characters in comments you're probably fine, and any
> problems you might have had should be gone as of groff 1.20. Debian
> switched to that in July 2009, and I think we were late to the party
> because we had some difficult historical baggage to clean up at the same
> time. I'm not aware of anyone shipping older versions of groff any
> more.
>
> When you convert characters that show up in rendered source, I suspect
> systems using the other man package (1.6g or similar versions) may
> render them poorly, because it invokes nroff in some fairly naïve and
> hardcoded ways. However, they already break in various related ways,
> and most distributions have switched to man-db now, or dealt with things
> some other way. My rough survey of the major distributions for this is:
>
> Arch has been good since about 2009
>
> Debian and descendants are good as of late 2007 / early 2008 (addition
> of manconv to man-db)
>
> Fedora is definitely good as of 2010 (switch to man-db), and I think
> was good before that as IIRC they did a flag day to switch everything
> to UTF-8 with man
>
> Gentoo switched to man-db at the end of 2013, so should be good now
>
> Mageia has a current groff, but uses man 1.6g with a stack of patches
> (some encoding-related)
>
> openSUSE has been fine for about the same length of time as Debian
>
> Slackware has a current groff, but uses man 1.6g without much in the
> way of special patches (just one to make things work for UTF-8
> *output*)
>
> My guess is that Mageia and Slackware may find that things only work
> properly for users in UTF-8 locales, but most other major distributions
> should be fine. You won't be the first author to switch to UTF-8 manual
> pages; all you'll be doing is making existing shortcomings perhaps
> marginally more obvious. In any case, the pages currently encoded in
> ISO-8859-1 won't be very seriously affected, and users of problematic
> systems will only have been able to read the other pages with good luck
> and a following wind anyway; switching to UTF-8 will probably actually
> improve things for them if they're using a UTF-8 locale. (That is, the
> problems that the affected systems have generally relate to attempting
> to read pages whose encoding doesn't match that of their locale.) They
> might possibly need to add the -k option to their nroff invocation in
> man.conf.
>
> If I were you I would just go ahead.
>
> Regarding your questions in the bug, please do keep the "coding:" tag in
> there; man-db will figure this out by brute force, but if left to its
> own devices I think groff's preconv will default to the locale's
> encoding, so it will only work for some people.
Hello Colin,
Thanks for the extensive reply! One final point. For the pages that
have non-ASCII characters only in source comments, not in rendered
input source, does it matter whether or not the "coding:" tag is added?
I ask because, simply for documentary purposes, I'm wondering whether
we should add that tag only in the pages that have UTF-8 in the rendered
input.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2014-02-14 15:28 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-14 10:43 Converting man-pages to UTF-8 Michael Kerrisk (man-pages)
[not found] ` <CAKgNAkh5tHmJc2DrcoAJsDWWFao6bPckd2sN1dw-CZDSFNi5kQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-02-14 11:42 ` Colin Watson
[not found] ` <20140214114216.GE6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>
2014-02-14 15:28 ` Michael Kerrisk (man-pages) [this message]
[not found] ` <52FE360B.9050302-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-02-14 16:30 ` Colin Watson
[not found] ` <20140214163035.GF6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>
2014-02-16 7:41 ` Michael Kerrisk (man-pages)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52FE360B.9050302@gmail.com \
--to=mtk.manpages-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
--cc=bruno-nWNVUoHt2MvYtjvyW6yDsg@public.gmane.org \
--cc=cjwatson-8fiUuRrzOP0dnm+yROfE0A@public.gmane.org \
--cc=linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=pschiffe-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=wl-mXXj517/zsQ@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).