public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Timothy Miller <miller@techsource.com>
To: linux-kernel@vger.kernel.org
Subject: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
Date: Thu, 10 Apr 2003 18:08:45 -0400	[thread overview]
Message-ID: <3E95EB6D.4020004@techsource.com> (raw)

I took the liberty of reading the FAQ (yeah, I saw 9.16) and joining the 
list after reading an interesting recent discussion on i18n of kernel 
messages.  In short, the primary maintainers of the kernel don't want 
it, and I agree with them.

HOWEVER, the discussion inspired me to think about ways of reducing some 
of the unfortunate but necessary bloat caused by keeping all of those 
strings in RAM.  Naturally, any way to do this must be absolutely 
painless, so I came up with the following set of restrictions:

- Absolutely no requirement to change existing strings, unless you feel 
like it
- Must be easy to use
- Must actually shrink the kernel
- The impact on the way kernel messages appear should be minimized

To be brief, the idea I came up with was to identify the 128 most common 
words in kernel messages and replace them with single character values 
above 127 which printk would decode on the way out.  Once the list was 
determined, there would be a header file people could use, at their 
leisure, to make stubstitutions.  So, for instance, instead of having this:

    printk("invalid: ...");

We would have this:

    #define MSG_INVALID "\200"
    ...
    prink(MSG_INVALID "...");


To judge the practicality of this, I used 'strings' on an uncompressed 
kernel image (2.4.20, IIRC) and then ran it through this:

tr '[:lower:]' '[:upper:]' | tr '[:blank:]' '\n' | sort | uniq -c | tr ' ' 0

This gave me a list of all words found in the kernel along with their 
counts.  Then I ran it through a positively awful little C program which 
I wrote to determine not the 128 most frequent, but rather, the 128 that 
would result in the maximum shrinkage (maximize count * (length-1)). 
 The results of that run are given below.  The results of the test are 
that this approach might save up to 62424 bytes of kernel space which is 
only about 3% of the kernel image size I got the strings from, but it's 
nearly 27% of the total output I got from 'strings'.  Is it worth it? 
 Maybe not yet, but then again, there may be an even more intelligent 
approach to this compression that we could use, hopefully one which 
wouldn't require any more effort to use.

Here's are the results:

   count string
-------- --------
      37 GIGABIT
     102 BLOCK
      62 NULL
     871 [^_]
      26 INTERFACE
      23 MICROSYSTEMS
      75 RAGE
     338 SE
     226 TECH
     113 DEVICE
     214 <3>
     838 PC
      19 <3>INIT_MODULE:
      35 REGISTER
      41 <3>EXT3-FS
     656 UWVS
      57 NETWORK
      32 SUPPORT
      97 COMPUTER
     878 [^_
     137 NET
     198 MODE
     534 INC
      33 INTERNATIONAL
      59 CARDBUS
     203 TECHNO
     119 TECHNOLOGY
      46 CORP.
      31 EXT2-FS
     290 CONTROLLER
      64 ASSERTION
      83 DATA/FAX
     249 DATA
      60 KERNEL:
     304 CONTROL
      33 INVALID
     322 %D
     486 PCI
     185 INC.
      61 ERROR
      80 PORT
     154 IDE
      74 INODE
     102 <4>
      88 KERNEL
      52 ELECTRONICS
      44 <3>EXT3
     117 FAILED
      70 AUDIO
      83 HOST
      27 SEMICONDUCTOR
      50 CHIPS
      63 DEVFS
     117 ETHERNET
     299 ID
     291 COM
      46 CANNOT
      24 TRANSACTION
     238 TO
      79 TECHNOLOGIES
      63 %08X
      98 D$$
      37 PROCESS
     288 CORP
      56 DATA/FAX/VOICE
      39 COMMUNICATIONS
      44 10/100
      38 SERIAL
     146 CORPORATION
     236 TEC
     107 MICRO
      26 MICROSYSTEM
      95 ADAPTER
     324 NO
      50 POWER
     121 56K
      27 ACCELERATOR
      33 RESEARCH
      21 INTEGRATED
     271 PRO
      19 TECHNOLOGIES,
     237 LT
      43 CHIPSET
      28 NETWORKS
     317 L$
      40 <3>EXT3-FS:
    1665 CO
     192 BRIDGE
      13 MICROELECTRONICS
     157 JOURNAL
     147 FOR
      91 9D$
      18 CYBERSERIAL
      54 CYBER
      56 MEMORY
      34 DATA/FAX/VOICE/SPKP
      49 SMART
     207 LTD
     137 TCP
      57 CACHE
     407 T$
     160 <6>
      26 GRAPHICS
     888 D$
     140 SYSTEMS
     249 AT
       6 JOURNAL->J_COMMITTING_TRANSACTION
     142 MODEM
      32 CHANNEL
     131 %S:
     394 %S
      14 COMMIT_TRANSACTION
      63 FILE
      28 SMARTDAA)
      67 CHIP
      30 WINMODEM
     113 NOT
     139 ETH
     331 DEV
     197 FO
      52 VIDEO
      73 ELECTRONIC
      67 EXT3
      99 CARD
    1336 IN
     222 SYSTEM
     197 AD
      53 COMMUNICATION
Total reduction: 62424

Comments?

NOTE:  I realize that some of those words probably aren't actually 
"strings" in the kernel.  This is a feasibility test, not a suggested list.



             reply	other threads:[~2003-04-10 21:44 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-04-10 22:08 Timothy Miller [this message]
2003-04-10 21:42 ` Painlessly shrinking kernel messages (Re: kernel support for non-english user messages) Alan Cox
2003-04-10 23:58   ` Timothy Miller
2003-04-11  1:14     ` Alan Cox
2003-04-11 23:02   ` Timothy Miller
2003-04-11 23:03     ` David Lang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3E95EB6D.4020004@techsource.com \
    --to=miller@techsource.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox