* Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
2003-04-10 22:08 Painlessly shrinking kernel messages (Re: kernel support for non-english user messages) Timothy Miller
@ 2003-04-10 21:42 ` Alan Cox
2003-04-10 23:58 ` Timothy Miller
2003-04-11 23:02 ` Timothy Miller
0 siblings, 2 replies; 6+ messages in thread
From: Alan Cox @ 2003-04-10 21:42 UTC (permalink / raw)
To: Timothy Miller; +Cc: Linux Kernel Mailing List
On Thu, 2003-04-10 at 23:08, Timothy Miller wrote:
> To be brief, the idea I came up with was to identify the 128 most common
> words in kernel messages and replace them with single character values
> above 127 which printk would decode on the way out. Once the list was
> determined, there would be a header file people could use, at their
> leisure, to make stubstitutions. So, for instance, instead of having this:
Not a totally crazy idea. You could also do 5pack and some of the other
string tricks people have used in time. You also dont need to do word
boundaries.
For embedded at least this is far from ludicrous as a concept. The
tricky piece for all of these is working out how to grab each printk
format string and do things to it. That lets you do compression,
removal, internationalisation, cataloguing ..
^ permalink raw reply [flat|nested] 6+ messages in thread
* Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
@ 2003-04-10 22:08 Timothy Miller
2003-04-10 21:42 ` Alan Cox
0 siblings, 1 reply; 6+ messages in thread
From: Timothy Miller @ 2003-04-10 22:08 UTC (permalink / raw)
To: linux-kernel
I took the liberty of reading the FAQ (yeah, I saw 9.16) and joining the
list after reading an interesting recent discussion on i18n of kernel
messages. In short, the primary maintainers of the kernel don't want
it, and I agree with them.
HOWEVER, the discussion inspired me to think about ways of reducing some
of the unfortunate but necessary bloat caused by keeping all of those
strings in RAM. Naturally, any way to do this must be absolutely
painless, so I came up with the following set of restrictions:
- Absolutely no requirement to change existing strings, unless you feel
like it
- Must be easy to use
- Must actually shrink the kernel
- The impact on the way kernel messages appear should be minimized
To be brief, the idea I came up with was to identify the 128 most common
words in kernel messages and replace them with single character values
above 127 which printk would decode on the way out. Once the list was
determined, there would be a header file people could use, at their
leisure, to make stubstitutions. So, for instance, instead of having this:
printk("invalid: ...");
We would have this:
#define MSG_INVALID "\200"
...
prink(MSG_INVALID "...");
To judge the practicality of this, I used 'strings' on an uncompressed
kernel image (2.4.20, IIRC) and then ran it through this:
tr '[:lower:]' '[:upper:]' | tr '[:blank:]' '\n' | sort | uniq -c | tr ' ' 0
This gave me a list of all words found in the kernel along with their
counts. Then I ran it through a positively awful little C program which
I wrote to determine not the 128 most frequent, but rather, the 128 that
would result in the maximum shrinkage (maximize count * (length-1)).
The results of that run are given below. The results of the test are
that this approach might save up to 62424 bytes of kernel space which is
only about 3% of the kernel image size I got the strings from, but it's
nearly 27% of the total output I got from 'strings'. Is it worth it?
Maybe not yet, but then again, there may be an even more intelligent
approach to this compression that we could use, hopefully one which
wouldn't require any more effort to use.
Here's are the results:
count string
-------- --------
37 GIGABIT
102 BLOCK
62 NULL
871 [^_]
26 INTERFACE
23 MICROSYSTEMS
75 RAGE
338 SE
226 TECH
113 DEVICE
214 <3>
838 PC
19 <3>INIT_MODULE:
35 REGISTER
41 <3>EXT3-FS
656 UWVS
57 NETWORK
32 SUPPORT
97 COMPUTER
878 [^_
137 NET
198 MODE
534 INC
33 INTERNATIONAL
59 CARDBUS
203 TECHNO
119 TECHNOLOGY
46 CORP.
31 EXT2-FS
290 CONTROLLER
64 ASSERTION
83 DATA/FAX
249 DATA
60 KERNEL:
304 CONTROL
33 INVALID
322 %D
486 PCI
185 INC.
61 ERROR
80 PORT
154 IDE
74 INODE
102 <4>
88 KERNEL
52 ELECTRONICS
44 <3>EXT3
117 FAILED
70 AUDIO
83 HOST
27 SEMICONDUCTOR
50 CHIPS
63 DEVFS
117 ETHERNET
299 ID
291 COM
46 CANNOT
24 TRANSACTION
238 TO
79 TECHNOLOGIES
63 %08X
98 D$$
37 PROCESS
288 CORP
56 DATA/FAX/VOICE
39 COMMUNICATIONS
44 10/100
38 SERIAL
146 CORPORATION
236 TEC
107 MICRO
26 MICROSYSTEM
95 ADAPTER
324 NO
50 POWER
121 56K
27 ACCELERATOR
33 RESEARCH
21 INTEGRATED
271 PRO
19 TECHNOLOGIES,
237 LT
43 CHIPSET
28 NETWORKS
317 L$
40 <3>EXT3-FS:
1665 CO
192 BRIDGE
13 MICROELECTRONICS
157 JOURNAL
147 FOR
91 9D$
18 CYBERSERIAL
54 CYBER
56 MEMORY
34 DATA/FAX/VOICE/SPKP
49 SMART
207 LTD
137 TCP
57 CACHE
407 T$
160 <6>
26 GRAPHICS
888 D$
140 SYSTEMS
249 AT
6 JOURNAL->J_COMMITTING_TRANSACTION
142 MODEM
32 CHANNEL
131 %S:
394 %S
14 COMMIT_TRANSACTION
63 FILE
28 SMARTDAA)
67 CHIP
30 WINMODEM
113 NOT
139 ETH
331 DEV
197 FO
52 VIDEO
73 ELECTRONIC
67 EXT3
99 CARD
1336 IN
222 SYSTEM
197 AD
53 COMMUNICATION
Total reduction: 62424
Comments?
NOTE: I realize that some of those words probably aren't actually
"strings" in the kernel. This is a feasibility test, not a suggested list.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
2003-04-10 21:42 ` Alan Cox
@ 2003-04-10 23:58 ` Timothy Miller
2003-04-11 1:14 ` Alan Cox
2003-04-11 23:02 ` Timothy Miller
1 sibling, 1 reply; 6+ messages in thread
From: Timothy Miller @ 2003-04-10 23:58 UTC (permalink / raw)
To: Alan Cox; +Cc: Linux Kernel Mailing List
Alan Cox wrote:
>Not a totally crazy idea. You could also do 5pack and some of the other
>string tricks people have used in time. You also dont need to do word
>boundaries.
>
My google search for '5pack' didn't come up with anything relevant.
Things that come to mind include converting to a character set which
requires fewer than 8 bits per character and then packing them into
bytes. Or perhaps making a list of every quintuplet of characters that
ever occurs and assign them codes.
I initially considered the idea of ignoring word boundaries. I rejected
it because part of the "painless" factor would be that it could be done
manually without a lot of thinking. But I will run a test which ignores
word boundaries and see what kinds of results I get. Of course, if we
want to do something that involves some post-compile magic or whatnot,
then we can do all sorts of gnarley tricks. But that doesn't differ (in
complexity) much from the idea someone else mentioned which was to
completely remove all messages from the kernel by magically converting
them to numbers or hashes and then decoding them outside of the kernel.
There was mentioned a valid point that boot messages need to be handled
properly by the kernel before any services are up. Separating the boot
messages from the non-boot messages would require manual intervention
that goes against the painless factor, and is the pie slice containing
only non-boot messages large enough that it's worth it? There seem to
be quite a lot of boot messages that could benefit from some sort of
completely-in-kernel compression.
>
>For embedded at least this is far from ludicrous as a concept. The
>tricky piece for all of these is working out how to grab each printk
>format string and do things to it. That lets you do compression,
>removal, internationalisation, cataloguing ..
>
>
Hmmm...
- Make gcc produce assember output
- Find all calls to prink
- Cross-reference those against all static strings
- Compress the strings
- Run through gas, etc.
The problem with this approach is that we have to deal with different
architectures. The plus is that any unsupported arch just doesn't run
the compression tool and uses regular printk.
How about:
- Use perl or yacc or something to parse the kernel source for strings
- Compress them
- Make the substitutions inline in the source as part of the
pre-processing stage
- Compile
Heck, we could just embed this functionality directly into the
preprocessor. Unfortunately, this one is somewhat beyond my current
knowledge of the tools that would make it convenient.
Just as a note, I worked on my test program to make it a more accurate.
For 128 codes, the actual reduction is 38946 bytes. For this
algorithm, I look to see if any of the shorter words are contained in
any of the larger ones; in the case where the shorter word's
substitution would shrink the kernel more than the larger, I add the
larger word's count to the smaller and delete the larger.
If we were to outlaw some of the lower characters, such as most
non-printing characters and all lower-case, then that brings us up to
having 184 codes to work with. That lets us save 42692 bytes. If we
were to go to two-character codes, where the first one is 128-255 and
the second is 1-255, that brings the number of codes up to 32640. It
turns out that, with my current algorithm, it doesn't buy anything, and
it also violates the painless factor by giving people a huge list of
words they have to pick from when writing kernel messages. Also, it
turns out that there are only just over 500 different words which would
save more than 2 bytes by being encoded.
I need to get a LOT more clever about this before it's worth doing.
I'll try the no-word-boundaries approach. And we'll see how interested
other people are in having to DEAL with it.
BTW, should I faint or something because THE Alan Cox responded to my
first post to lkml? :)
You hate it when people say that sort of thing, don't you. :)
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
2003-04-10 23:58 ` Timothy Miller
@ 2003-04-11 1:14 ` Alan Cox
0 siblings, 0 replies; 6+ messages in thread
From: Alan Cox @ 2003-04-11 1:14 UTC (permalink / raw)
To: Timothy Miller; +Cc: Linux Kernel Mailing List
On Fri, 2003-04-11 at 00:58, Timothy Miller wrote:
> My google search for '5pack' didn't come up with anything relevant.
> Things that come to mind include converting to a character set which
Its a thing from the old 8bit gaming world. You code in 5bit chunks
with a leading length marker. 5bits is enough for a-z and some bits
of punctuation, plus capital implying space and 'escape' for an 8bit
sequence block.
Gets you a bit under 40% compression with real life data and takes about
200 bytes to decode
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
2003-04-10 21:42 ` Alan Cox
2003-04-10 23:58 ` Timothy Miller
@ 2003-04-11 23:02 ` Timothy Miller
2003-04-11 23:03 ` David Lang
1 sibling, 1 reply; 6+ messages in thread
From: Timothy Miller @ 2003-04-11 23:02 UTC (permalink / raw)
To: linux-kernel
This time, my text compression program ignores word boundaries. Below
are the results.
"shrink" is the size reduction due to replacing the given string with a
one-byte code. (accounts for space required for codes and the decode
string plus nul).
"count" is how many times the string appears.
Strings are delimited by [brackets].
At the end, the "original size" accounts for null termination for all
strings.
The "new size" is the size of the text with the codes inserted plus the
amount of space required to store what the codes map to.
Nothing accounts for the space wasted due to the strings being 32-bit
aligned or whatever various architectures might require. This is just
an estimate.
What the heck is "[^_]" ? Is it a machine instruction that 'strings' is
mistaking for a string? I grep'd through the kernel source and didn't
find it.
The results are that the kernel messages are reduced from 232690 to
154365, which is a savings of 33%. Not bad, but it's probably still not
worth it yet; the pain is still greater than the benefit.
Here's the list.
shrink count word
------ ----- ------------------------------
3166 1057 [[^_]]
2888 290 [ CONTROLLER]
1963 656 [UWVS]
1771 198 [ TECHNOLOG]
1758 881 [ IN]
1678 561 [TION]
1438 721 [ CO]
1268 255 [SYSTEM]
1126 189 [ BRIDGE]
1116 560 [PCI]
994 167 [JOURNAL]
986 53 [F 56K DATA/FAX/VOICE]
964 323 [ ]
932 468 [TER]
870 110 [ ETHERNET]
824 414 [ING]
772 388 [DEV]
745 84 [ELECTRONIC]
742 249 [MODE]
738 371 [POR]
706 355 [ED ]
706 60 [KERNEL: ASSER]
643 216 [LOCK]
592 298 [PRO]
586 295 [ENT]
554 279 [S: ]
516 260 [REA]
510 257 [IDE]
496 167 [ABLE]
478 81 [TRANSAC]
474 239 [AGE]
474 239 [NET]
466 235 [ON ]
464 234 [EXT]
458 116 [CACHE]
424 214 [<3>]
418 106 [REGIS]
416 210 [RES]
415 140 [ LTD]
415 140 [FAIL]
414 209 [ %D]
409 138 [DATA]
394 199 [COM]
394 133 [NOT ]
390 197 [ARD]
388 196 [E_R]
386 98 [ ADAP]
378 191 [FOR]
374 95 [MICRO]
373 16 [ ROAD RUNNER FRAME GRABBER]
370 187 [SET]
354 179 [VER]
350 177 [ELE]
348 176 [ALL]
326 165 [100]
318 161 [AND]
312 158 [_IN]
307 104 [TIME]
306 155 [<6>]
303 62 [AIC-78]
298 26 [SEMICONDUCTOR]
294 149 [SER]
288 146 [AL ]
288 59 [MEMORY]
283 96 [%02X]
282 143 [LIN]
280 142 [ATE]
273 56 [BUFFER]
272 138 [TCP]
270 137 [ %S]
262 67 [AUDIO]
258 66 [82801]
256 130 [ECT]
254 65 [ HOST]
254 129 [PER]
253 16 [08X %08X %08X %08X]
250 127 [TO ]
248 126 [OUN]
248 126 [PHT]
246 63 [C(%D)]
246 32 [SMARTDAA)]
244 42 [CANNOT ]
244 124 [ORT]
238 121 [9D$]
238 121 [CHI]
238 49 [F 56K ]
238 21 [FIBRE CHANNEL]
236 120 [T_R]
234 119 [BIT]
233 48 [MODULE]
232 118 [CON]
232 118 [NIC]
230 117 [%LU]
230 59 [POWER]
230 117 [REE]
226 58 [ERROR]
215 32 [%U.%U.%U]
215 32 [GRAPHICS]
214 109 [EST]
214 73 [NULL]
214 73 [SOCK]
212 108 [ALI]
212 108 [TRI]
210 54 [WRITE]
208 106 [%08]
202 103 [BUS]
202 69 [GDT ]
202 52 [GROUP]
200 102 [<4>]
200 102 [D$ ]
200 102 [OUT]
196 67 [FILE]
196 67 [GET_]
193 16 [ MIL-STD-1553 ]
190 25 [ [APOLLO ]
184 94 [ESS]
182 47 [CHECK]
182 93 [D$$]
176 90 [STA]
174 89 [AGP]
174 45 [CYBER]
173 8 [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
169 58 [TECH]
168 86 [ RE]
166 85 [82C]
160 82 [000]
160 82 [ACC]
160 82 [END]
original size=232690, new size=154365
You may notice a goof my program makes. The string "%U.%U.%U" appears
in there, when what is actually found in the kernel is "%U.%U.%U.%U".
That's because there are more of the former, but only because the
program didn't realize that pairs of the former overlap with each other.
I'll see about fixing that later, but really, I think we'll need to
hand-pick strings anyhow.
Lastly, I decided to allow two-character strings to be encoded and upped
the number of codes to 184. This is the result:
original size=232690, new size=109133
That's a reduction of 53%.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)
2003-04-11 23:02 ` Timothy Miller
@ 2003-04-11 23:03 ` David Lang
0 siblings, 0 replies; 6+ messages in thread
From: David Lang @ 2003-04-11 23:03 UTC (permalink / raw)
To: Timothy Miller; +Cc: linux-kernel
this is definantly something that wouldn't make sense to do manually, but
if somone can figure out how to do this as part of the build process the
80K saved can't hurt.
David Lang
On Fri, 11 Apr 2003, Timothy Miller wrote:
> Date: Fri, 11 Apr 2003 19:02:33 -0400
> From: Timothy Miller <miller@techsource.com>
> To: linux-kernel@vger.kernel.org
> Subject: Re: Painlessly shrinking kernel messages (Re: kernel support for
> non-english user messages)
>
> This time, my text compression program ignores word boundaries. Below
> are the results.
>
> "shrink" is the size reduction due to replacing the given string with a
> one-byte code. (accounts for space required for codes and the decode
> string plus nul).
> "count" is how many times the string appears.
> Strings are delimited by [brackets].
>
> At the end, the "original size" accounts for null termination for all
> strings.
> The "new size" is the size of the text with the codes inserted plus the
> amount of space required to store what the codes map to.
>
> Nothing accounts for the space wasted due to the strings being 32-bit
> aligned or whatever various architectures might require. This is just
> an estimate.
>
> What the heck is "[^_]" ? Is it a machine instruction that 'strings' is
> mistaking for a string? I grep'd through the kernel source and didn't
> find it.
>
>
> The results are that the kernel messages are reduced from 232690 to
> 154365, which is a savings of 33%. Not bad, but it's probably still not
> worth it yet; the pain is still greater than the benefit.
>
>
> Here's the list.
>
> shrink count word
> ------ ----- ------------------------------
> 3166 1057 [[^_]]
> 2888 290 [ CONTROLLER]
> 1963 656 [UWVS]
> 1771 198 [ TECHNOLOG]
> 1758 881 [ IN]
> 1678 561 [TION]
> 1438 721 [ CO]
> 1268 255 [SYSTEM]
> 1126 189 [ BRIDGE]
> 1116 560 [PCI]
> 994 167 [JOURNAL]
> 986 53 [F 56K DATA/FAX/VOICE]
> 964 323 [ ]
> 932 468 [TER]
> 870 110 [ ETHERNET]
> 824 414 [ING]
> 772 388 [DEV]
> 745 84 [ELECTRONIC]
> 742 249 [MODE]
> 738 371 [POR]
> 706 355 [ED ]
> 706 60 [KERNEL: ASSER]
> 643 216 [LOCK]
> 592 298 [PRO]
> 586 295 [ENT]
> 554 279 [S: ]
> 516 260 [REA]
> 510 257 [IDE]
> 496 167 [ABLE]
> 478 81 [TRANSAC]
> 474 239 [AGE]
> 474 239 [NET]
> 466 235 [ON ]
> 464 234 [EXT]
> 458 116 [CACHE]
> 424 214 [<3>]
> 418 106 [REGIS]
> 416 210 [RES]
> 415 140 [ LTD]
> 415 140 [FAIL]
> 414 209 [ %D]
> 409 138 [DATA]
> 394 199 [COM]
> 394 133 [NOT ]
> 390 197 [ARD]
> 388 196 [E_R]
> 386 98 [ ADAP]
> 378 191 [FOR]
> 374 95 [MICRO]
> 373 16 [ ROAD RUNNER FRAME GRABBER]
> 370 187 [SET]
> 354 179 [VER]
> 350 177 [ELE]
> 348 176 [ALL]
> 326 165 [100]
> 318 161 [AND]
> 312 158 [_IN]
> 307 104 [TIME]
> 306 155 [<6>]
> 303 62 [AIC-78]
> 298 26 [SEMICONDUCTOR]
> 294 149 [SER]
> 288 146 [AL ]
> 288 59 [MEMORY]
> 283 96 [%02X]
> 282 143 [LIN]
> 280 142 [ATE]
> 273 56 [BUFFER]
> 272 138 [TCP]
> 270 137 [ %S]
> 262 67 [AUDIO]
> 258 66 [82801]
> 256 130 [ECT]
> 254 65 [ HOST]
> 254 129 [PER]
> 253 16 [08X %08X %08X %08X]
> 250 127 [TO ]
> 248 126 [OUN]
> 248 126 [PHT]
> 246 63 [C(%D)]
> 246 32 [SMARTDAA)]
> 244 42 [CANNOT ]
> 244 124 [ORT]
> 238 121 [9D$]
> 238 121 [CHI]
> 238 49 [F 56K ]
> 238 21 [FIBRE CHANNEL]
> 236 120 [T_R]
> 234 119 [BIT]
> 233 48 [MODULE]
> 232 118 [CON]
> 232 118 [NIC]
> 230 117 [%LU]
> 230 59 [POWER]
> 230 117 [REE]
> 226 58 [ERROR]
> 215 32 [%U.%U.%U]
> 215 32 [GRAPHICS]
> 214 109 [EST]
> 214 73 [NULL]
> 214 73 [SOCK]
> 212 108 [ALI]
> 212 108 [TRI]
> 210 54 [WRITE]
> 208 106 [%08]
> 202 103 [BUS]
> 202 69 [GDT ]
> 202 52 [GROUP]
> 200 102 [<4>]
> 200 102 [D$ ]
> 200 102 [OUT]
> 196 67 [FILE]
> 196 67 [GET_]
> 193 16 [ MIL-STD-1553 ]
> 190 25 [ [APOLLO ]
> 184 94 [ESS]
> 182 47 [CHECK]
> 182 93 [D$$]
> 176 90 [STA]
> 174 89 [AGP]
> 174 45 [CYBER]
> 173 8 [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
> 169 58 [TECH]
> 168 86 [ RE]
> 166 85 [82C]
> 160 82 [000]
> 160 82 [ACC]
> 160 82 [END]
> original size=232690, new size=154365
>
>
> You may notice a goof my program makes. The string "%U.%U.%U" appears
> in there, when what is actually found in the kernel is "%U.%U.%U.%U".
> That's because there are more of the former, but only because the
> program didn't realize that pairs of the former overlap with each other.
> I'll see about fixing that later, but really, I think we'll need to
> hand-pick strings anyhow.
>
> Lastly, I decided to allow two-character strings to be encoded and upped
> the number of codes to 184. This is the result:
> original size=232690, new size=109133
> That's a reduction of 53%.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2003-04-11 22:55 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-10 22:08 Painlessly shrinking kernel messages (Re: kernel support for non-english user messages) Timothy Miller
2003-04-10 21:42 ` Alan Cox
2003-04-10 23:58 ` Timothy Miller
2003-04-11 1:14 ` Alan Cox
2003-04-11 23:02 ` Timothy Miller
2003-04-11 23:03 ` David Lang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox