Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
  2003-04-10 22:08 Painlessly shrinking kernel messages (Re: kernel support for non-english user messages) Timothy Miller
@ 2003-04-10 21:42 ` Alan Cox
  2003-04-10 23:58   ` Timothy Miller
  2003-04-11 23:02   ` Timothy Miller
  0 siblings, 2 replies; 6+ messages in thread
From: Alan Cox @ 2003-04-10 21:42 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Linux Kernel Mailing List

On Thu, 2003-04-10 at 23:08, Timothy Miller wrote:
> To be brief, the idea I came up with was to identify the 128 most common 
> words in kernel messages and replace them with single character values 
> above 127 which printk would decode on the way out.  Once the list was 
> determined, there would be a header file people could use, at their 
> leisure, to make stubstitutions.  So, for instance, instead of having this:

Not a totally crazy idea. You could also do 5pack and some of the other
string tricks people have used in time. You also dont need to do word
boundaries.

For embedded at least this is far from ludicrous as a concept. The
tricky piece for all of these is working out how to grab each printk
format string and do things to it. That lets you do compression,
removal, internationalisation, cataloguing ..



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
@ 2003-04-10 22:08 Timothy Miller
  2003-04-10 21:42 ` Alan Cox
  0 siblings, 1 reply; 6+ messages in thread
From: Timothy Miller @ 2003-04-10 22:08 UTC (permalink / raw)
  To: linux-kernel

I took the liberty of reading the FAQ (yeah, I saw 9.16) and joining the 
list after reading an interesting recent discussion on i18n of kernel 
messages.  In short, the primary maintainers of the kernel don't want 
it, and I agree with them.

HOWEVER, the discussion inspired me to think about ways of reducing some 
of the unfortunate but necessary bloat caused by keeping all of those 
strings in RAM.  Naturally, any way to do this must be absolutely 
painless, so I came up with the following set of restrictions:

- Absolutely no requirement to change existing strings, unless you feel 
like it
- Must be easy to use
- Must actually shrink the kernel
- The impact on the way kernel messages appear should be minimized

To be brief, the idea I came up with was to identify the 128 most common 
words in kernel messages and replace them with single character values 
above 127 which printk would decode on the way out.  Once the list was 
determined, there would be a header file people could use, at their 
leisure, to make stubstitutions.  So, for instance, instead of having this:

    printk("invalid: ...");

We would have this:

    #define MSG_INVALID "\200"
    ...
    prink(MSG_INVALID "...");

To judge the practicality of this, I used 'strings' on an uncompressed 
kernel image (2.4.20, IIRC) and then ran it through this:

tr '[:lower:]' '[:upper:]' | tr '[:blank:]' '\n' | sort | uniq -c | tr ' ' 0

This gave me a list of all words found in the kernel along with their 
counts.  Then I ran it through a positively awful little C program which 
I wrote to determine not the 128 most frequent, but rather, the 128 that 
would result in the maximum shrinkage (maximize count * (length-1)). 
 The results of that run are given below.  The results of the test are 
that this approach might save up to 62424 bytes of kernel space which is 
only about 3% of the kernel image size I got the strings from, but it's 
nearly 27% of the total output I got from 'strings'.  Is it worth it? 
 Maybe not yet, but then again, there may be an even more intelligent 
approach to this compression that we could use, hopefully one which 
wouldn't require any more effort to use.

Here's are the results:

   count string
-------- --------
      37 GIGABIT
     102 BLOCK
      62 NULL
     871 [^_]
      26 INTERFACE
      23 MICROSYSTEMS
      75 RAGE
     338 SE
     226 TECH
     113 DEVICE
     214 <3>
     838 PC
      19 <3>INIT_MODULE:
      35 REGISTER
      41 <3>EXT3-FS
     656 UWVS
      57 NETWORK
      32 SUPPORT
      97 COMPUTER
     878 [^_
     137 NET
     198 MODE
     534 INC
      33 INTERNATIONAL
      59 CARDBUS
     203 TECHNO
     119 TECHNOLOGY
      46 CORP.
      31 EXT2-FS
     290 CONTROLLER
      64 ASSERTION
      83 DATA/FAX
     249 DATA
      60 KERNEL:
     304 CONTROL
      33 INVALID
     322 %D
     486 PCI
     185 INC.
      61 ERROR
      80 PORT
     154 IDE
      74 INODE
     102 <4>
      88 KERNEL
      52 ELECTRONICS
      44 <3>EXT3
     117 FAILED
      70 AUDIO
      83 HOST
      27 SEMICONDUCTOR
      50 CHIPS
      63 DEVFS
     117 ETHERNET
     299 ID
     291 COM
      46 CANNOT
      24 TRANSACTION
     238 TO
      79 TECHNOLOGIES
      63 %08X
      98 D$$
      37 PROCESS
     288 CORP
      56 DATA/FAX/VOICE
      39 COMMUNICATIONS
      44 10/100
      38 SERIAL
     146 CORPORATION
     236 TEC
     107 MICRO
      26 MICROSYSTEM
      95 ADAPTER
     324 NO
      50 POWER
     121 56K
      27 ACCELERATOR
      33 RESEARCH
      21 INTEGRATED
     271 PRO
      19 TECHNOLOGIES,
     237 LT
      43 CHIPSET
      28 NETWORKS
     317 L$
      40 <3>EXT3-FS:
    1665 CO
     192 BRIDGE
      13 MICROELECTRONICS
     157 JOURNAL
     147 FOR
      91 9D$
      18 CYBERSERIAL
      54 CYBER
      56 MEMORY
      34 DATA/FAX/VOICE/SPKP
      49 SMART
     207 LTD
     137 TCP
      57 CACHE
     407 T$
     160 <6>
      26 GRAPHICS
     888 D$
     140 SYSTEMS
     249 AT
       6 JOURNAL->J_COMMITTING_TRANSACTION
     142 MODEM
      32 CHANNEL
     131 %S:
     394 %S
      14 COMMIT_TRANSACTION
      63 FILE
      28 SMARTDAA)
      67 CHIP
      30 WINMODEM
     113 NOT
     139 ETH
     331 DEV
     197 FO
      52 VIDEO
      73 ELECTRONIC
      67 EXT3
      99 CARD
    1336 IN
     222 SYSTEM
     197 AD
      53 COMMUNICATION
Total reduction: 62424

Comments?

NOTE:  I realize that some of those words probably aren't actually 
"strings" in the kernel.  This is a feasibility test, not a suggested list.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
  2003-04-10 21:42 ` Alan Cox
@ 2003-04-10 23:58   ` Timothy Miller
  2003-04-11  1:14     ` Alan Cox
  2003-04-11 23:02   ` Timothy Miller
  1 sibling, 1 reply; 6+ messages in thread
From: Timothy Miller @ 2003-04-10 23:58 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List

Alan Cox wrote:

>Not a totally crazy idea. You could also do 5pack and some of the other
>string tricks people have used in time. You also dont need to do word
>boundaries.
>
My google search for '5pack' didn't come up with anything relevant. 
 Things that come to mind include converting to a character set which 
requires fewer than 8 bits per character and then packing them into 
bytes.  Or perhaps making a list of every quintuplet of characters that 
ever occurs and assign them codes.

I initially considered the idea of ignoring word boundaries.  I rejected 
it because part of the "painless" factor would be that it could be done 
manually without a lot of thinking.  But I will run a test which ignores 
word boundaries and see what kinds of results I get.  Of course, if we 
want to do something that involves some post-compile magic or whatnot, 
then we can do all sorts of gnarley tricks.  But that doesn't differ (in 
complexity) much from the idea someone else mentioned which was to 
completely remove all messages from the kernel by magically converting 
them to numbers or hashes and then decoding them outside of the kernel.  

There was mentioned a valid point that boot messages need to be handled 
properly by the kernel before any services are up.  Separating the boot 
messages from the non-boot messages would require manual intervention 
that goes against the painless factor, and is the pie slice containing 
only non-boot messages large enough that it's worth it?  There seem to 
be quite a lot of boot messages that could benefit from some sort of 
completely-in-kernel compression.

>
>For embedded at least this is far from ludicrous as a concept. The
>tricky piece for all of these is working out how to grab each printk
>format string and do things to it. That lets you do compression,
>removal, internationalisation, cataloguing ..
>  
>

Hmmm...
- Make gcc produce assember output
- Find all calls to prink
- Cross-reference those against all static strings
- Compress the strings
- Run through gas, etc.

The problem with this approach is that we have to deal with different 
architectures.  The plus is that any unsupported arch just doesn't run 
the compression tool and uses regular printk.

How about:
- Use perl or yacc or something to parse the kernel source for strings
- Compress them
- Make the substitutions inline in the source as part of the 
pre-processing stage
- Compile

Heck, we could just embed this functionality directly into the 
preprocessor.  Unfortunately, this one is somewhat beyond my current 
knowledge of the tools that would make it convenient.

Just as a note, I worked on my test program to make it a more accurate. 
 For 128 codes, the actual reduction is 38946 bytes.  For this 
algorithm, I look to see if any of the shorter words are contained in 
any of the larger ones; in the case where the shorter word's 
substitution would shrink the kernel more than the larger, I add the 
larger word's count to the smaller and delete the larger.  

If we were to outlaw some of the lower characters, such as most 
non-printing characters and all lower-case, then that brings us up to 
having 184 codes to work with.  That lets us save 42692 bytes.  If we 
were to go to two-character codes, where the first one is 128-255 and 
the second is 1-255, that brings the number of codes up to 32640.  It 
turns out that, with my current algorithm, it doesn't buy anything, and 
it also violates the painless factor by giving people a huge list of 
words they have to pick from when writing kernel messages.  Also, it 
turns out that there are only just over 500 different words which would 
save more than 2 bytes by being encoded.

I need to get a LOT more clever about this before it's worth doing. 
 I'll try the no-word-boundaries approach.  And we'll see how interested 
other people are in having to DEAL with it.

BTW, should I faint or something because THE Alan Cox responded to my 
first post to lkml?  :)
You hate it when people say that sort of thing, don't you.  :)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
  2003-04-10 23:58   ` Timothy Miller
@ 2003-04-11  1:14     ` Alan Cox
  0 siblings, 0 replies; 6+ messages in thread
From: Alan Cox @ 2003-04-11  1:14 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Linux Kernel Mailing List

On Fri, 2003-04-11 at 00:58, Timothy Miller wrote:
> My google search for '5pack' didn't come up with anything relevant. 
>  Things that come to mind include converting to a character set which 

Its a thing from the old 8bit gaming world.  You code in 5bit chunks
with a leading length marker. 5bits is enough for a-z and some bits
of punctuation, plus capital implying space and 'escape' for an 8bit
sequence block.

Gets you a bit under 40% compression with real life data and takes about
200 bytes to decode



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
  2003-04-10 21:42 ` Alan Cox
  2003-04-10 23:58   ` Timothy Miller
@ 2003-04-11 23:02   ` Timothy Miller
  2003-04-11 23:03     ` David Lang
  1 sibling, 1 reply; 6+ messages in thread
From: Timothy Miller @ 2003-04-11 23:02 UTC (permalink / raw)
  To: linux-kernel

This time, my text compression program ignores word boundaries.  Below 
are the results.

"shrink" is the size reduction due to replacing the given string with a 
one-byte code. (accounts for space required for codes and the decode 
string plus nul).
"count"  is how many times the string appears.
Strings are delimited by [brackets].

At the end, the "original size" accounts for null termination for all 
strings.  
The "new size" is the size of the text with the codes inserted plus the 
amount of space required to store what the codes map to.

Nothing accounts for the space wasted due to the strings being 32-bit 
aligned or whatever various architectures might require.  This is just 
an estimate.

What the heck is "[^_]" ?  Is it a machine instruction that 'strings' is 
mistaking for a string?  I grep'd through the kernel source and didn't 
find it.

The results are that the kernel messages are reduced from 232690 to 
154365, which is a savings of 33%.  Not bad, but it's probably still not 
worth it yet; the pain is still greater than the benefit.

Here's the list.

shrink count word
------ ----- ------------------------------
3166   1057  [[^_]]
2888    290  [ CONTROLLER]
1963    656  [UWVS]
1771    198  [ TECHNOLOG]
1758    881  [ IN]
1678    561  [TION]
1438    721  [ CO]
1268    255  [SYSTEM]
1126    189  [ BRIDGE]
1116    560  [PCI]
 994    167  [JOURNAL]
 986     53  [F 56K DATA/FAX/VOICE]
 964    323  [    ]
 932    468  [TER]
 870    110  [ ETHERNET]
 824    414  [ING]
 772    388  [DEV]
 745     84  [ELECTRONIC]
 742    249  [MODE]
 738    371  [POR]
 706    355  [ED ]
 706     60  [KERNEL: ASSER]
 643    216  [LOCK]
 592    298  [PRO]
 586    295  [ENT]
 554    279  [S: ]
 516    260  [REA]
 510    257  [IDE]
 496    167  [ABLE]
 478     81  [TRANSAC]
 474    239  [AGE]
 474    239  [NET]
 466    235  [ON ]
 464    234  [EXT]
 458    116  [CACHE]
 424    214  [<3>]
 418    106  [REGIS]
 416    210  [RES]
 415    140  [ LTD]
 415    140  [FAIL]
 414    209  [ %D]
 409    138  [DATA]
 394    199  [COM]
 394    133  [NOT ]
 390    197  [ARD]
 388    196  [E_R]
 386     98  [ ADAP]
 378    191  [FOR]
 374     95  [MICRO]
 373     16  [ ROAD RUNNER FRAME GRABBER]
 370    187  [SET]
 354    179  [VER]
 350    177  [ELE]
 348    176  [ALL]
 326    165  [100]
 318    161  [AND]
 312    158  [_IN]
 307    104  [TIME]
 306    155  [<6>]
 303     62  [AIC-78]
 298     26  [SEMICONDUCTOR]
 294    149  [SER]
 288    146  [AL ]
 288     59  [MEMORY]
 283     96  [%02X]
 282    143  [LIN]
 280    142  [ATE]
 273     56  [BUFFER]
 272    138  [TCP]
 270    137  [ %S]
 262     67  [AUDIO]
 258     66  [82801]
 256    130  [ECT]
 254     65  [ HOST]
 254    129  [PER]
 253     16  [08X %08X %08X %08X]
 250    127  [TO ]
 248    126  [OUN]
 248    126  [PHT]
 246     63  [C(%D)]
 246     32  [SMARTDAA)]
 244     42  [CANNOT ]
 244    124  [ORT]
 238    121  [9D$]
 238    121  [CHI]
 238     49  [F 56K ]
 238     21  [FIBRE CHANNEL]
 236    120  [T_R]
 234    119  [BIT]
 233     48  [MODULE]
 232    118  [CON]
 232    118  [NIC]
 230    117  [%LU]
 230     59  [POWER]
 230    117  [REE]
 226     58  [ERROR]
 215     32  [%U.%U.%U]
 215     32  [GRAPHICS]
 214    109  [EST]
 214     73  [NULL]
 214     73  [SOCK]
 212    108  [ALI]
 212    108  [TRI]
 210     54  [WRITE]
 208    106  [%08]
 202    103  [BUS]
 202     69  [GDT ]
 202     52  [GROUP]
 200    102  [<4>]
 200    102  [D$ ]
 200    102  [OUT]
 196     67  [FILE]
 196     67  [GET_]
 193     16  [ MIL-STD-1553 ]
 190     25  [ [APOLLO ]
 184     94  [ESS]
 182     47  [CHECK]
 182     93  [D$$]
 176     90  [STA]
 174     89  [AGP]
 174     45  [CYBER]
 173      8  [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
 169     58  [TECH]
 168     86  [ RE]
 166     85  [82C]
 160     82  [000]
 160     82  [ACC]
 160     82  [END]
original size=232690, new size=154365

You may notice a goof my program makes.  The string "%U.%U.%U" appears 
in there, when what is actually found in the kernel is "%U.%U.%U.%U". 
 That's because there are more of the former, but only because the 
program didn't realize that pairs of the former overlap with each other. 
 I'll see about fixing that later, but really, I think we'll need to 
hand-pick strings anyhow.

Lastly, I decided to allow two-character strings to be encoded and upped 
the number of codes to 184.  This is the result:
original size=232690, new size=109133
That's a reduction of 53%.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
  2003-04-11 23:02   ` Timothy Miller
@ 2003-04-11 23:03     ` David Lang
  0 siblings, 0 replies; 6+ messages in thread
From: David Lang @ 2003-04-11 23:03 UTC (permalink / raw)
  To: Timothy Miller; +Cc: linux-kernel

this is definantly something that wouldn't make sense to do manually, but
if somone can figure out how to do this as part of the build process the
80K saved can't hurt.

David Lang

On Fri, 11 Apr 2003, Timothy Miller wrote:

> Date: Fri, 11 Apr 2003 19:02:33 -0400
> From: Timothy Miller <miller@techsource.com>
> To: linux-kernel@vger.kernel.org
> Subject: Re: Painlessly shrinking kernel messages (Re: kernel support for
>     non-english user messages)
>
> This time, my text compression program ignores word boundaries.  Below
> are the results.
>
> "shrink" is the size reduction due to replacing the given string with a
> one-byte code. (accounts for space required for codes and the decode
> string plus nul).
> "count"  is how many times the string appears.
> Strings are delimited by [brackets].
>
> At the end, the "original size" accounts for null termination for all
> strings.
> The "new size" is the size of the text with the codes inserted plus the
> amount of space required to store what the codes map to.
>
> Nothing accounts for the space wasted due to the strings being 32-bit
> aligned or whatever various architectures might require.  This is just
> an estimate.
>
> What the heck is "[^_]" ?  Is it a machine instruction that 'strings' is
> mistaking for a string?  I grep'd through the kernel source and didn't
> find it.
>
>
> The results are that the kernel messages are reduced from 232690 to
> 154365, which is a savings of 33%.  Not bad, but it's probably still not
> worth it yet; the pain is still greater than the benefit.
>
>
> Here's the list.
>
> shrink count word
> ------ ----- ------------------------------
> 3166   1057  [[^_]]
> 2888    290  [ CONTROLLER]
> 1963    656  [UWVS]
> 1771    198  [ TECHNOLOG]
> 1758    881  [ IN]
> 1678    561  [TION]
> 1438    721  [ CO]
> 1268    255  [SYSTEM]
> 1126    189  [ BRIDGE]
> 1116    560  [PCI]
>  994    167  [JOURNAL]
>  986     53  [F 56K DATA/FAX/VOICE]
>  964    323  [    ]
>  932    468  [TER]
>  870    110  [ ETHERNET]
>  824    414  [ING]
>  772    388  [DEV]
>  745     84  [ELECTRONIC]
>  742    249  [MODE]
>  738    371  [POR]
>  706    355  [ED ]
>  706     60  [KERNEL: ASSER]
>  643    216  [LOCK]
>  592    298  [PRO]
>  586    295  [ENT]
>  554    279  [S: ]
>  516    260  [REA]
>  510    257  [IDE]
>  496    167  [ABLE]
>  478     81  [TRANSAC]
>  474    239  [AGE]
>  474    239  [NET]
>  466    235  [ON ]
>  464    234  [EXT]
>  458    116  [CACHE]
>  424    214  [<3>]
>  418    106  [REGIS]
>  416    210  [RES]
>  415    140  [ LTD]
>  415    140  [FAIL]
>  414    209  [ %D]
>  409    138  [DATA]
>  394    199  [COM]
>  394    133  [NOT ]
>  390    197  [ARD]
>  388    196  [E_R]
>  386     98  [ ADAP]
>  378    191  [FOR]
>  374     95  [MICRO]
>  373     16  [ ROAD RUNNER FRAME GRABBER]
>  370    187  [SET]
>  354    179  [VER]
>  350    177  [ELE]
>  348    176  [ALL]
>  326    165  [100]
>  318    161  [AND]
>  312    158  [_IN]
>  307    104  [TIME]
>  306    155  [<6>]
>  303     62  [AIC-78]
>  298     26  [SEMICONDUCTOR]
>  294    149  [SER]
>  288    146  [AL ]
>  288     59  [MEMORY]
>  283     96  [%02X]
>  282    143  [LIN]
>  280    142  [ATE]
>  273     56  [BUFFER]
>  272    138  [TCP]
>  270    137  [ %S]
>  262     67  [AUDIO]
>  258     66  [82801]
>  256    130  [ECT]
>  254     65  [ HOST]
>  254    129  [PER]
>  253     16  [08X %08X %08X %08X]
>  250    127  [TO ]
>  248    126  [OUN]
>  248    126  [PHT]
>  246     63  [C(%D)]
>  246     32  [SMARTDAA)]
>  244     42  [CANNOT ]
>  244    124  [ORT]
>  238    121  [9D$]
>  238    121  [CHI]
>  238     49  [F 56K ]
>  238     21  [FIBRE CHANNEL]
>  236    120  [T_R]
>  234    119  [BIT]
>  233     48  [MODULE]
>  232    118  [CON]
>  232    118  [NIC]
>  230    117  [%LU]
>  230     59  [POWER]
>  230    117  [REE]
>  226     58  [ERROR]
>  215     32  [%U.%U.%U]
>  215     32  [GRAPHICS]
>  214    109  [EST]
>  214     73  [NULL]
>  214     73  [SOCK]
>  212    108  [ALI]
>  212    108  [TRI]
>  210     54  [WRITE]
>  208    106  [%08]
>  202    103  [BUS]
>  202     69  [GDT ]
>  202     52  [GROUP]
>  200    102  [<4>]
>  200    102  [D$ ]
>  200    102  [OUT]
>  196     67  [FILE]
>  196     67  [GET_]
>  193     16  [ MIL-STD-1553 ]
>  190     25  [ [APOLLO ]
>  184     94  [ESS]
>  182     47  [CHECK]
>  182     93  [D$$]
>  176     90  [STA]
>  174     89  [AGP]
>  174     45  [CYBER]
>  173      8  [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
>  169     58  [TECH]
>  168     86  [ RE]
>  166     85  [82C]
>  160     82  [000]
>  160     82  [ACC]
>  160     82  [END]
> original size=232690, new size=154365
>
>
> You may notice a goof my program makes.  The string "%U.%U.%U" appears
> in there, when what is actually found in the kernel is "%U.%U.%U.%U".
>  That's because there are more of the former, but only because the
> program didn't realize that pairs of the former overlap with each other.
>  I'll see about fixing that later, but really, I think we'll need to
> hand-pick strings anyhow.
>
> Lastly, I decided to allow two-character strings to be encoded and upped
> the number of codes to 184.  This is the result:
> original size=232690, new size=109133
> That's a reduction of 53%.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-04-11 22:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-10 22:08 Painlessly shrinking kernel messages (Re: kernel support for non-english user messages) Timothy Miller
2003-04-10 21:42 ` Alan Cox
2003-04-10 23:58   ` Timothy Miller
2003-04-11  1:14     ` Alan Cox
2003-04-11 23:02   ` Timothy Miller
2003-04-11 23:03     ` David Lang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox