From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S264266AbTDJXeL (for ); Thu, 10 Apr 2003 19:34:11 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S264265AbTDJXeJ (for ); Thu, 10 Apr 2003 19:34:09 -0400 Received: from watch.techsource.com ([209.208.48.130]:41974 "EHLO techsource.com") by vger.kernel.org with ESMTP id S264264AbTDJXeE (for ); Thu, 10 Apr 2003 19:34:04 -0400 Message-ID: <3E960536.5010900@techsource.com> Date: Thu, 10 Apr 2003 19:58:46 -0400 From: Timothy Miller User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020823 Netscape/7.0 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Alan Cox CC: Linux Kernel Mailing List Subject: Re: Painlessly shrinking kernel messages (Re: kernel support for non-english user messages) References: <3E95EB6D.4020004@techsource.com> <1050010963.12494.132.camel@dhcp22.swansea.linux.org.uk> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Alan Cox wrote: >Not a totally crazy idea. You could also do 5pack and some of the other >string tricks people have used in time. You also dont need to do word >boundaries. > My google search for '5pack' didn't come up with anything relevant. Things that come to mind include converting to a character set which requires fewer than 8 bits per character and then packing them into bytes. Or perhaps making a list of every quintuplet of characters that ever occurs and assign them codes. I initially considered the idea of ignoring word boundaries. I rejected it because part of the "painless" factor would be that it could be done manually without a lot of thinking. But I will run a test which ignores word boundaries and see what kinds of results I get. Of course, if we want to do something that involves some post-compile magic or whatnot, then we can do all sorts of gnarley tricks. But that doesn't differ (in complexity) much from the idea someone else mentioned which was to completely remove all messages from the kernel by magically converting them to numbers or hashes and then decoding them outside of the kernel. There was mentioned a valid point that boot messages need to be handled properly by the kernel before any services are up. Separating the boot messages from the non-boot messages would require manual intervention that goes against the painless factor, and is the pie slice containing only non-boot messages large enough that it's worth it? There seem to be quite a lot of boot messages that could benefit from some sort of completely-in-kernel compression. > >For embedded at least this is far from ludicrous as a concept. The >tricky piece for all of these is working out how to grab each printk >format string and do things to it. That lets you do compression, >removal, internationalisation, cataloguing .. > > Hmmm... - Make gcc produce assember output - Find all calls to prink - Cross-reference those against all static strings - Compress the strings - Run through gas, etc. The problem with this approach is that we have to deal with different architectures. The plus is that any unsupported arch just doesn't run the compression tool and uses regular printk. How about: - Use perl or yacc or something to parse the kernel source for strings - Compress them - Make the substitutions inline in the source as part of the pre-processing stage - Compile Heck, we could just embed this functionality directly into the preprocessor. Unfortunately, this one is somewhat beyond my current knowledge of the tools that would make it convenient. Just as a note, I worked on my test program to make it a more accurate. For 128 codes, the actual reduction is 38946 bytes. For this algorithm, I look to see if any of the shorter words are contained in any of the larger ones; in the case where the shorter word's substitution would shrink the kernel more than the larger, I add the larger word's count to the smaller and delete the larger. If we were to outlaw some of the lower characters, such as most non-printing characters and all lower-case, then that brings us up to having 184 codes to work with. That lets us save 42692 bytes. If we were to go to two-character codes, where the first one is 128-255 and the second is 1-255, that brings the number of codes up to 32640. It turns out that, with my current algorithm, it doesn't buy anything, and it also violates the painless factor by giving people a huge list of words they have to pick from when writing kernel messages. Also, it turns out that there are only just over 500 different words which would save more than 2 bytes by being encoded. I need to get a LOT more clever about this before it's worth doing. I'll try the no-word-boundaries approach. And we'll see how interested other people are in having to DEAL with it. BTW, should I faint or something because THE Alan Cox responded to my first post to lkml? :) You hate it when people say that sort of thing, don't you. :)