linux-man.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Michael Kerrisk (man-pages)" <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: Colin Watson <cjwatson-8fiUuRrzOP0dnm+yROfE0A@public.gmane.org>
Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	linux-man <linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Bruno Haible <bruno-nWNVUoHt2MvYtjvyW6yDsg@public.gmane.org>,
	Werner Lemberg <wl-mXXj517/zsQ@public.gmane.org>,
	Peter Schiffer <pschiffe-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: Converting man-pages to UTF-8
Date: Fri, 14 Feb 2014 16:28:11 +0100	[thread overview]
Message-ID: <52FE360B.9050302@gmail.com> (raw)
In-Reply-To: <20140214114216.GE6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>

On 02/14/2014 12:42 PM, Colin Watson wrote:
> On Fri, Feb 14, 2014 at 11:43:30AM +0100, Michael Kerrisk (man-pages) wrote:
>> At https://bugzilla.kernel.org/show_bug.cgi?id=60807 is a proposal to
>> convert the pages of the the "man-pages" project to UTF 8. I thought
>> it worthwhile bringing that topic to the list, and CCing a few people
>> who may have some ideas about this step, since I'm not too sure of the
>> implications.
>>
>> Peter Schiffer has kindly written some some scripts to do the
>> conversion, which would touch about 40 files. However, as far I can
>> tell, many of the pages that have non-ASCII characters have inside
>> groff comments (author's names, etc.). The only pages that have
>> non-ASCII characters in the rendered source are various man7 pages on
>> character sets. These were the pages to which I added a groff encoding
>> marker in response to Colin Watson's input on this Debian bug:
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=519209
>>
>> Moving to UTF-8 for the pages seems like a good idea, at least at some
>> point. However, I'm wondering whether there are any backward
>> compatibility issues that I should need to worry about. As far as I
>> know, groff added UTF-8 support back in Jan 2009, so, just over 5
>> years ago. Perhaps that's long enough ago now, that any backward
>> compatibility issues with old versions of groff would be minimal.
>> (I.e., the number of people installing new man-pages on systems with
>> old groff is likely to be very small, and anyway, only a dozen or so
>> pages in Section 7 are affected. Furthermore, I'm assuming that Linux
>> distros have been shipping groff v1.20+ for quite a long time now.)
> 
> I think for characters in comments you're probably fine, and any
> problems you might have had should be gone as of groff 1.20.  Debian
> switched to that in July 2009, and I think we were late to the party
> because we had some difficult historical baggage to clean up at the same
> time.  I'm not aware of anyone shipping older versions of groff any
> more.
> 
> When you convert characters that show up in rendered source, I suspect
> systems using the other man package (1.6g or similar versions) may
> render them poorly, because it invokes nroff in some fairly naïve and
> hardcoded ways.  However, they already break in various related ways,
> and most distributions have switched to man-db now, or dealt with things
> some other way.  My rough survey of the major distributions for this is:
> 
>   Arch has been good since about 2009
> 
>   Debian and descendants are good as of late 2007 / early 2008 (addition
>   of manconv to man-db)
> 
>   Fedora is definitely good as of 2010 (switch to man-db), and I think
>   was good before that as IIRC they did a flag day to switch everything
>   to UTF-8 with man
> 
>   Gentoo switched to man-db at the end of 2013, so should be good now
> 
>   Mageia has a current groff, but uses man 1.6g with a stack of patches
>   (some encoding-related)
> 
>   openSUSE has been fine for about the same length of time as Debian
> 
>   Slackware has a current groff, but uses man 1.6g without much in the
>   way of special patches (just one to make things work for UTF-8
>   *output*)
> 
> My guess is that Mageia and Slackware may find that things only work
> properly for users in UTF-8 locales, but most other major distributions
> should be fine.  You won't be the first author to switch to UTF-8 manual
> pages; all you'll be doing is making existing shortcomings perhaps
> marginally more obvious.  In any case, the pages currently encoded in
> ISO-8859-1 won't be very seriously affected, and users of problematic
> systems will only have been able to read the other pages with good luck
> and a following wind anyway; switching to UTF-8 will probably actually
> improve things for them if they're using a UTF-8 locale.  (That is, the
> problems that the affected systems have generally relate to attempting
> to read pages whose encoding doesn't match that of their locale.)  They
> might possibly need to add the -k option to their nroff invocation in
> man.conf.
> 
> If I were you I would just go ahead.
> 
> Regarding your questions in the bug, please do keep the "coding:" tag in
> there; man-db will figure this out by brute force, but if left to its
> own devices I think groff's preconv will default to the locale's
> encoding, so it will only work for some people.

Hello Colin,

Thanks for the extensive reply! One final point. For the pages that
have non-ASCII characters only in source comments, not in rendered
input source, does it matter whether or not the "coding:" tag is added?
I ask because, simply for documentary purposes, I'm wondering whether
we should add that tag only in the pages that have UTF-8 in the rendered
input.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2014-02-14 15:28 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-14 10:43 Converting man-pages to UTF-8 Michael Kerrisk (man-pages)
     [not found] ` <CAKgNAkh5tHmJc2DrcoAJsDWWFao6bPckd2sN1dw-CZDSFNi5kQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-02-14 11:42   ` Colin Watson
     [not found]     ` <20140214114216.GE6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>
2014-02-14 15:28       ` Michael Kerrisk (man-pages) [this message]
     [not found]         ` <52FE360B.9050302-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-02-14 16:30           ` Colin Watson
     [not found]             ` <20140214163035.GF6397-K2jUmMR1UYV4cg9Nei1l7Q@public.gmane.org>
2014-02-16  7:41               ` Michael Kerrisk (man-pages)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52FE360B.9050302@gmail.com \
    --to=mtk.manpages-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=bruno-nWNVUoHt2MvYtjvyW6yDsg@public.gmane.org \
    --cc=cjwatson-8fiUuRrzOP0dnm+yROfE0A@public.gmane.org \
    --cc=linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=pschiffe-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=wl-mXXj517/zsQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).