* [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
@ 2005-08-30 16:15 Knut Petersen
2005-08-30 16:18 ` Geert Uytterhoeven
2005-08-31 1:14 ` Antonino A. Daplas
0 siblings, 2 replies; 16+ messages in thread
From: Knut Petersen @ 2005-08-30 16:15 UTC (permalink / raw)
To: Andrew Morton
Cc: Antonino A. Daplas, linux-fbdev-devel, linux-kernel, Jochen Hein
This trivial patch gives a performance boost to the framebuffer console
Constructing the bitmaps that are given to the bitblit functions of the
framebuffer
drivers is time consuming. Here we avoide a call to the slow
fb_pad_aligned_buffer().
The patch replaces that call with a simple but much more efficient
bytewise copy.
The kernel spends a significant time at this place if you use 8x* fonts.
Every
pixel displayed on your screen is prepared here.
Some benchmark results:
Displaying a file of 2000 lines with 160 characters each takes 889 ms
system
time using cyblafb on my system (I´m using a 1280x1024 video mode,
resulting in a 160x64 character console)
Displaying the same file with the enclosed patch applied to 2.6.13 only
takes
760 ms system time, saving 129 ms or 14.5%.
Font widths other than 8 are not affected.
The advantage and correctness of this patch should be obvious.
Signed-off-by: Knut Petersen <Knut_Petersen@t-online.de>
diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/drivers/video/console/bitblit.c linux/drivers/video/console/bitblit.c
--- linuxorig/drivers/video/console/bitblit.c 2005-08-29 01:41:01.000000000 +0200
+++ linux/drivers/video/console/bitblit.c 2005-08-30 17:19:57.000000000 +0200
@@ -114,7 +114,7 @@ static void bit_putcs(struct vc_data *vc
unsigned int scan_align = info->pixmap.scan_align - 1;
unsigned int buf_align = info->pixmap.buf_align - 1;
unsigned int shift_low = 0, mod = vc->vc_font.width % 8;
- unsigned int shift_high = 8, pitch, cnt, size, k;
+ unsigned int shift_high = 8, pitch, cnt, size, i, k;
unsigned int idx = vc->vc_font.width >> 3;
unsigned int attribute = get_attribute(info, scr_readw(s));
struct fb_image image;
@@ -175,7 +175,11 @@ static void bit_putcs(struct vc_data *vc
src = buf;
}
- fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
+ if (idx == 1)
+ for(i=0; i < image.height; i++)
+ dst[pitch*i] = src[i];
+ else
+ fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
dst += width;
}
}
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-30 16:15 [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts Knut Petersen
@ 2005-08-30 16:18 ` Geert Uytterhoeven
2005-08-30 17:58 ` Knut Petersen
2005-08-31 1:14 ` Antonino A. Daplas
1 sibling, 1 reply; 16+ messages in thread
From: Geert Uytterhoeven @ 2005-08-30 16:18 UTC (permalink / raw)
To: Linux Frame Buffer Device Development
Cc: Andrew Morton, Antonino A. Daplas, Linux Kernel Development,
Jochen Hein
On Tue, 30 Aug 2005, Knut Petersen wrote:
> linux/drivers/video/console/bitblit.c
> --- linuxorig/drivers/video/console/bitblit.c 2005-08-29 01:41:01.000000000
> +0200
> +++ linux/drivers/video/console/bitblit.c 2005-08-30 17:19:57.000000000
> +0200
> @@ -114,7 +114,7 @@ static void bit_putcs(struct vc_data *vc
> unsigned int scan_align = info->pixmap.scan_align - 1;
> unsigned int buf_align = info->pixmap.buf_align - 1;
> unsigned int shift_low = 0, mod = vc->vc_font.width % 8;
> - unsigned int shift_high = 8, pitch, cnt, size, k;
> + unsigned int shift_high = 8, pitch, cnt, size, i, k;
> unsigned int idx = vc->vc_font.width >> 3;
> unsigned int attribute = get_attribute(info, scr_readw(s));
> struct fb_image image;
> @@ -175,7 +175,11 @@ static void bit_putcs(struct vc_data *vc
> src = buf;
> }
>
> - fb_pad_aligned_buffer(dst, pitch, src, idx,
> image.height);
> + if (idx == 1)
> + for(i=0; i < image.height; i++)
> + dst[pitch*i] = src[i];
^^^^^^^
> + else
> + fb_pad_aligned_buffer(dst, pitch, src,
> idx, image.height);
> dst += width;
> }
> }
Probably you can make it even faster by avoiding the multiplication, like
unsigned int offset = 0;
for (i = 0; i < image.height; i++) {
dst[offset] = src[i];
offset += pitch;
}
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-30 16:18 ` Geert Uytterhoeven
@ 2005-08-30 17:58 ` Knut Petersen
2005-08-30 19:13 ` Roman Zippel
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Knut Petersen @ 2005-08-30 17:58 UTC (permalink / raw)
To: linux-fbdev-devel
Cc: Andrew Morton, Antonino A. Daplas, Linux Kernel Development,
Jochen Hein
>Probably you can make it even faster by avoiding the multiplication, like
>
> unsigned int offset = 0;
> for (i = 0; i < image.height; i++) {
> dst[offset] = src[i];
> offset += pitch;
> }
>
More than two decades ago I learned to avoid mul and imul. Use shifts,
add and lea instead,
that was the credo those days. The name of the game was CP/M 80/86, a86,
d86 and ddt ;-)
But let´s get serious again.
Your proposed change of the patch results in a 21 ms performance
decrease on my system.
Yes, I do know that this is hard to believe. I tested a similar
variation before, and the results
were even worse.
Avoiding mul is a good idea in assembly language today, but often it is
better to write a
multiplication with the loop counter in C and not to introduce an extra
variable instead. The
compiler will optimize the code and it´s easier for gcc without that
extra variable.
More interesting would be the question what should be done for idx==2 or
idx==3. Probably
fb_pad_aligned_buffer() is also slower for those cases. But does anybody
use such fonts?
cu,
knut
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-30 17:58 ` Knut Petersen
@ 2005-08-30 19:13 ` Roman Zippel
2005-08-30 22:26 ` Knut Petersen
2005-08-30 19:59 ` Geert Uytterhoeven
2005-08-31 1:14 ` Antonino A. Daplas
2 siblings, 1 reply; 16+ messages in thread
From: Roman Zippel @ 2005-08-30 19:13 UTC (permalink / raw)
To: Knut Petersen
Cc: linux-fbdev-devel, Andrew Morton, Antonino A. Daplas,
Linux Kernel Development, Jochen Hein
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1444 bytes --]
Hi,
On Tue, 30 Aug 2005, Knut Petersen wrote:
> > Probably you can make it even faster by avoiding the multiplication, like
> >
> > unsigned int offset = 0;
> > for (i = 0; i < image.height; i++) {
> > dst[offset] = src[i];
> > offset += pitch;
> > }
> >
>
> More than two decades ago I learned to avoid mul and imul. Use shifts, add and
> lea instead,
> that was the credo those days. The name of the game was CP/M 80/86, a86, d86
> and ddt ;-)
>
> But let´s get serious again.
>
> Your proposed change of the patch results in a 21 ms performance decrease on
> my system.
> Yes, I do know that this is hard to believe. I tested a similar variation
> before, and the results
> were even worse.
Could you try the patch below, for a few extra cycles you might want to
make it an inline function.
bye, Roman
Index: linux-2.6/drivers/video/fbmem.c
===================================================================
--- linux-2.6.orig/drivers/video/fbmem.c 2005-08-30 01:55:18.000000000 +0200
+++ linux-2.6/drivers/video/fbmem.c 2005-08-30 21:10:25.705462837 +0200
@@ -82,11 +82,11 @@ void fb_pad_aligned_buffer(u8 *dst, u32
{
int i, j;
+ d_pitch -= s_pitch;
for (i = height; i--; ) {
/* s_pitch is a few bytes at the most, memcpy is suboptimal */
for (j = 0; j < s_pitch; j++)
- dst[j] = src[j];
- src += s_pitch;
+ *dst++ = *src++;
dst += d_pitch;
}
}
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-30 17:58 ` Knut Petersen
2005-08-30 19:13 ` Roman Zippel
@ 2005-08-30 19:59 ` Geert Uytterhoeven
2005-08-31 1:14 ` Antonino A. Daplas
2 siblings, 0 replies; 16+ messages in thread
From: Geert Uytterhoeven @ 2005-08-30 19:59 UTC (permalink / raw)
To: Linux Frame Buffer Device Development
Cc: Andrew Morton, Antonino A. Daplas, Linux Kernel Development,
Jochen Hein
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1897 bytes --]
On Tue, 30 Aug 2005, Knut Petersen wrote:
> > Probably you can make it even faster by avoiding the multiplication, like
> >
> > unsigned int offset = 0;
> > for (i = 0; i < image.height; i++) {
> > dst[offset] = src[i];
> > offset += pitch;
> > }
>
> More than two decades ago I learned to avoid mul and imul. Use shifts, add and
> lea instead,
> that was the credo those days. The name of the game was CP/M 80/86, a86, d86
> and ddt ;-)
>
> But let�s get serious again.
On modern CPUs, a multiplication indeed takes 1 cycle, just like an addition.
But on older CPUs (still supported by Linux), this is not true.
> Your proposed change of the patch results in a 21 ms performance decrease on
> my system.
> Yes, I do know that this is hard to believe. I tested a similar variation
> before, and the results
> were even worse.
>
> Avoiding mul is a good idea in assembly language today, but often it is better
> to write a
> multiplication with the loop counter in C and not to introduce an extra
> variable instead. The
> compiler will optimize the code and it�s easier for gcc without that extra
> variable.
But you are right. On actual inspection of the generated assembly code for a
very simple test case, it turns out both (m68k-linux-)gcc 2.95.2 and 3.3.3 are
smart enough to convert the multiplication to an addition...
And interestingly, if I avoid the multiplication explicitly, gcc 2.95.2 still
generates the same code, but 3.3.3 adds a few extra instructions to
save/restore local vars. So this probably explains why it turned out to be
slower for you. Ugh...
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-30 19:13 ` Roman Zippel
@ 2005-08-30 22:26 ` Knut Petersen
2005-08-31 0:51 ` [Linux-fbdev-devel] " Roman Zippel
0 siblings, 1 reply; 16+ messages in thread
From: Knut Petersen @ 2005-08-30 22:26 UTC (permalink / raw)
To: Roman Zippel
Cc: linux-fbdev-devel, Andrew Morton, Antonino A. Daplas,
Linux Kernel Development, Jochen Hein
Hi Roman,
>Could you try the patch below, for a few extra cycles you might want to
>make it an inline function.
>
>
No, it does not help. If there is any difference, it is too small to be
measured on
my system ... and my system does run at 1000 Hz.
After 2.6.12 fb_pad_aligned_buffer() was changed to use memcpy() instead
of a
bytewise copy. That slowed things down a lot, some weeks ago that was
reverted.
fb_pad_aligned _buffer() isn´t that slow, it´s just an external function
to be called
and that means a lot of unnecessary code.
How could I make it an inline function? It is used in console/bitblit.c,
nvidia/nvidia.c,
riva/fbdev.c and softcursor.c.
cu,
Knut
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-30 22:26 ` Knut Petersen
@ 2005-08-31 0:51 ` Roman Zippel
2005-08-31 6:42 ` Antonino A. Daplas
2005-08-31 12:46 ` Knut Petersen
0 siblings, 2 replies; 16+ messages in thread
From: Roman Zippel @ 2005-08-31 0:51 UTC (permalink / raw)
To: Knut Petersen
Cc: linux-fbdev-devel, Andrew Morton, Antonino A. Daplas,
Linux Kernel Development, Jochen Hein
Hi,
On Wed, 31 Aug 2005, Knut Petersen wrote:
> How could I make it an inline function? It is used in console/bitblit.c,
> nvidia/nvidia.c,
> riva/fbdev.c and softcursor.c.
Something like below, which has the advantange that there is still only
one implementation of the function and if it's still slower, we really
need to check the compiler.
bye, Roman
drivers/video/console/bitblit.c | 5 ++++-
drivers/video/fbmem.c | 10 +---------
include/linux/fb.h | 13 +++++++++++++
3 files changed, 18 insertions(+), 10 deletions(-)
Index: linux-2.6/drivers/video/fbmem.c
===================================================================
--- linux-2.6.orig/drivers/video/fbmem.c 2005-08-30 21:17:44.000000000 +0200
+++ linux-2.6/drivers/video/fbmem.c 2005-08-31 01:20:37.000000000 +0200
@@ -80,15 +80,7 @@ EXPORT_SYMBOL(fb_get_color_depth);
*/
void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
{
- int i, j;
-
- for (i = height; i--; ) {
- /* s_pitch is a few bytes at the most, memcpy is suboptimal */
- for (j = 0; j < s_pitch; j++)
- dst[j] = src[j];
- src += s_pitch;
- dst += d_pitch;
- }
+ __fb_pad_aligned_buffer(dst, d_pitch, src, s_pitch, height);
}
EXPORT_SYMBOL(fb_pad_aligned_buffer);
Index: linux-2.6/drivers/video/console/bitblit.c
===================================================================
--- linux-2.6.orig/drivers/video/console/bitblit.c 2005-08-30 01:55:20.000000000 +0200
+++ linux-2.6/drivers/video/console/bitblit.c 2005-08-31 01:25:30.000000000 +0200
@@ -175,7 +175,10 @@ static void bit_putcs(struct vc_data *vc
src = buf;
}
- fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
+ if (likely(idx == 1))
+ __fb_pad_aligned_buffer(dst, pitch, src, 1, image.height);
+ else
+ fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
dst += width;
}
}
Index: linux-2.6/include/linux/fb.h
===================================================================
--- linux-2.6.orig/include/linux/fb.h 2005-08-30 01:56:29.000000000 +0200
+++ linux-2.6/include/linux/fb.h 2005-08-31 01:21:04.000000000 +0200
@@ -824,6 +824,19 @@ extern int fb_get_color_depth(struct fb_
extern int fb_get_options(char *name, char **option);
extern int fb_new_modelist(struct fb_info *info);
+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
+{
+ int i, j;
+
+ d_pitch -= s_pitch;
+ for (i = height; i--; ) {
+ /* s_pitch is a few bytes at the most, memcpy is suboptimal */
+ for (j = 0; j < s_pitch; j++)
+ *dst++ = *src++;
+ dst += d_pitch;
+ }
+}
+
extern struct fb_info *registered_fb[FB_MAX];
extern int num_registered_fb;
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-30 16:15 [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts Knut Petersen
2005-08-30 16:18 ` Geert Uytterhoeven
@ 2005-08-31 1:14 ` Antonino A. Daplas
1 sibling, 0 replies; 16+ messages in thread
From: Antonino A. Daplas @ 2005-08-31 1:14 UTC (permalink / raw)
To: Knut Petersen; +Cc: Andrew Morton, linux-fbdev-devel, linux-kernel, Jochen Hein
Knut Petersen wrote:
> This trivial patch gives a performance boost to the framebuffer console
>
> Constructing the bitmaps that are given to the bitblit functions of the
> framebuffer
> drivers is time consuming. Here we avoide a call to the slow
> fb_pad_aligned_buffer().
> The patch replaces that call with a simple but much more efficient
> bytewise copy.
>
> The kernel spends a significant time at this place if you use 8x* fonts.
> Every
> pixel displayed on your screen is prepared here.
>
> Some benchmark results:
>
> Displaying a file of 2000 lines with 160 characters each takes 889 ms
> system
> time using cyblafb on my system (I´m using a 1280x1024 video mode,
> resulting in a 160x64 character console)
>
> Displaying the same file with the enclosed patch applied to 2.6.13 only
> takes
> 760 ms system time, saving 129 ms or 14.5%.
Where did this 14.5% come from? Is it bit_putcs alone or is more
real world, such as 'time cat text_file'? If I'm to guess, a large
percent of the improvement is caused by the inlining of the code.
I'm not against the patch, it will benefit drivers with very fast
imageblits. However, since most drivers have imageblits done in software,
a large proportion of the processing time will go to the imageblit itself,
so I don't think you'll get that high a number (I get only a 4%
improvement, and this is in a driver with accelerated blits, and it will
probably be lower, ie, in vesafb).
Anyway, with the addition of your patch, bit_putcs has now reached an
'ugliness threshhold' for a revamp.
Tony
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-30 17:58 ` Knut Petersen
2005-08-30 19:13 ` Roman Zippel
2005-08-30 19:59 ` Geert Uytterhoeven
@ 2005-08-31 1:14 ` Antonino A. Daplas
2 siblings, 0 replies; 16+ messages in thread
From: Antonino A. Daplas @ 2005-08-31 1:14 UTC (permalink / raw)
To: linux-fbdev-devel
Cc: Andrew Morton, Linux Kernel Development, Jochen Hein,
Knut Petersen
Knut Petersen wrote:
> fb_pad_aligned_buffer() is also slower for those cases. But does anybody
> use such fonts?
Yes, there are 16x30 fonts out there in the wild.
Tony
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-31 0:51 ` [Linux-fbdev-devel] " Roman Zippel
@ 2005-08-31 6:42 ` Antonino A. Daplas
2005-08-31 15:49 ` Roman Zippel
2005-08-31 12:46 ` Knut Petersen
1 sibling, 1 reply; 16+ messages in thread
From: Antonino A. Daplas @ 2005-08-31 6:42 UTC (permalink / raw)
To: Roman Zippel
Cc: Knut Petersen, linux-fbdev-devel, Andrew Morton,
Linux Kernel Development, Jochen Hein
Roman Zippel wrote:
> Hi,
>
> On Wed, 31 Aug 2005, Knut Petersen wrote:
>
>> How could I make it an inline function? It is used in console/bitblit.c,
>> nvidia/nvidia.c,
>> riva/fbdev.c and softcursor.c.
>
> Something like below, which has the advantange that there is still only
> one implementation of the function and if it's still slower, we really
> need to check the compiler.
>
I do get better numbers with this, not much, but better than Knut's (5ms), and
definitely much better than the old uninlined one (100ms).
Andrew, don't get this yet. I'll incorporate this with the bit_putcs()
breakup that I'm currently doing.
Roman, okay if you have a 'Signed-off-by' line?
Tony
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-31 0:51 ` [Linux-fbdev-devel] " Roman Zippel
2005-08-31 6:42 ` Antonino A. Daplas
@ 2005-08-31 12:46 ` Knut Petersen
2005-08-31 17:15 ` [Linux-fbdev-devel] " Roman Zippel
1 sibling, 1 reply; 16+ messages in thread
From: Knut Petersen @ 2005-08-31 12:46 UTC (permalink / raw)
To: Andrew Morton
Cc: Roman Zippel, linux-fbdev-devel, Antonino A. Daplas,
Linux Kernel Development, Jochen Hein, Geert Uytterhoeven
>Something like below, which has the advantange that there is still only
>one implementation of the function
>
True, that´s a great advantage.
> and if it's still slower, we really need to check the compiler
>
>
Please have a look at the following patch. It takes your idea of
inlining but moves
the special cases into the macro, speeding things up for the very likely
case of
s_pitch == 1 and the less likely case of s_pitch of 2. Treating s_pitch
== 2 special
gives a still significant performance improvement of more than 10 % for
16x30
fonts.
This way also bit_putcs looks better again ...
Andrew, as this way is better than and still as fast as my first
approach I think
framebuffer-bit_putcs-optimization-for-8x.patch should be reverted and the
following patch should be applied instead.
Antonino, Roman, Geert, do you agree?
cu,
knut
diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/drivers/video/console/bitblit.c linux/drivers/video/console/bitblit.c
--- linuxorig/drivers/video/console/bitblit.c 2005-08-29 01:41:01.000000000 +0200
+++ linux/drivers/video/console/bitblit.c 2005-08-31 10:06:22.000000000 +0200
@@ -175,7 +175,7 @@ static void bit_putcs(struct vc_data *vc
src = buf;
}
- fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
+ __fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
dst += width;
}
}
diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/drivers/video/fbmem.c linux/drivers/video/fbmem.c
--- linuxorig/drivers/video/fbmem.c 2005-08-29 01:41:01.000000000 +0200
+++ linux/drivers/video/fbmem.c 2005-08-31 13:36:16.000000000 +0200
@@ -80,15 +80,7 @@ EXPORT_SYMBOL(fb_get_color_depth);
*/
void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
{
- int i, j;
-
- for (i = height; i--; ) {
- /* s_pitch is a few bytes at the most, memcpy is suboptimal */
- for (j = 0; j < s_pitch; j++)
- dst[j] = src[j];
- src += s_pitch;
- dst += d_pitch;
- }
+ __fb_pad_aligned_buffer(dst, d_pitch, src, s_pitch, height);
}
EXPORT_SYMBOL(fb_pad_aligned_buffer);
diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/include/linux/fb.h linux/include/linux/fb.h
--- linuxorig/include/linux/fb.h 2005-08-29 01:41:01.000000000 +0200
+++ linux/include/linux/fb.h 2005-08-31 12:45:08.000000000 +0200
@@ -824,6 +824,38 @@ extern int fb_get_color_depth(struct fb_
extern int fb_get_options(char *name, char **option);
extern int fb_new_modelist(struct fb_info *info);
+
+/*
+ * Don't change without testing performance of framebuffer
+ * bitblitting. Inlining is necessary for performance reasons.
+ * Although the code might not _look_ fast because of some
+ * multiplications, it really _is_ fast as it is easier for gcc
+ * to optimize well.
+ */
+
+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src,
+ u32 s_pitch, u32 height)
+{
+ int i, j;
+
+ if (likely(s_pitch==1))
+ for(i=0; i < height; i++)
+ dst[d_pitch*i] = src[i];
+ else if (s_pitch==2)
+ for(i=0; i < height; i++) {
+ *(u16 *)dst = ((u16 *)src)[i];
+ dst += d_pitch;
+ }
+ else {
+ d_pitch -= s_pitch;
+ for (i = height; i--; ) {
+ for (j = 0; j < s_pitch; j++)
+ *dst++ = *src++;
+ dst += d_pitch;
+ }
+ }
+}
+
extern struct fb_info *registered_fb[FB_MAX];
extern int num_registered_fb;
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-31 6:42 ` Antonino A. Daplas
@ 2005-08-31 15:49 ` Roman Zippel
0 siblings, 0 replies; 16+ messages in thread
From: Roman Zippel @ 2005-08-31 15:49 UTC (permalink / raw)
To: Antonino A. Daplas
Cc: Knut Petersen, linux-fbdev-devel, Andrew Morton,
Linux Kernel Development, Jochen Hein
Hi,
On Wed, 31 Aug 2005, Antonino A. Daplas wrote:
> Roman, okay if you have a 'Signed-off-by' line?
Okay.
bye, Roman
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-31 12:46 ` Knut Petersen
@ 2005-08-31 17:15 ` Roman Zippel
2005-08-31 19:19 ` Knut Petersen
0 siblings, 1 reply; 16+ messages in thread
From: Roman Zippel @ 2005-08-31 17:15 UTC (permalink / raw)
To: Knut Petersen
Cc: Andrew Morton, linux-fbdev-devel, Antonino A. Daplas,
Linux Kernel Development, Jochen Hein, Geert Uytterhoeven
Hi,
On Wed, 31 Aug 2005, Knut Petersen wrote:
> +static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, +
> u32 s_pitch, u32 height)
> +{
> + int i, j;
> +
> + if (likely(s_pitch==1))
> + for(i=0; i < height; i++)
> + dst[d_pitch*i] = src[i];
> + else if (s_pitch==2)
> + for(i=0; i < height; i++) {
> + *(u16 *)dst = ((u16 *)src)[i];
> + dst += d_pitch;
> + }
> + else {
> + d_pitch -= s_pitch;
> + for (i = height; i--; ) {
> + for (j = 0; j < s_pitch; j++)
> + *dst++ = *src++;
> + dst += d_pitch;
> + }
> + }
> +}
Why did you add the multiply back?
You have now 3 slightly different variants of the same, which isn't really
an improvement. In my example I showed you how to generate the first and
last version from the same source.
If you also want to optimize for other sizes, you might want to always
inline the function, if the function call overhead is the largest part
anyway, the special case for 2 bytes might not be needed anymore.
BTW this version saves another condition:
static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
{
int i, j;
d_pitch -= s_pitch;
i = height;
do {
/* s_pitch is a few bytes at the most, memcpy is suboptimal */
j = s_pitch;
do
*dst++ = *src++;
while (--j > 0);
dst += d_pitch;
} while (--i > 0);
}
bye, Roman
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-31 17:15 ` [Linux-fbdev-devel] " Roman Zippel
@ 2005-08-31 19:19 ` Knut Petersen
2005-08-31 19:34 ` Roman Zippel
0 siblings, 1 reply; 16+ messages in thread
From: Knut Petersen @ 2005-08-31 19:19 UTC (permalink / raw)
To: Roman Zippel
Cc: Andrew Morton, linux-fbdev-devel, Antonino A. Daplas,
Linux Kernel Development, Jochen Hein, Geert Uytterhoeven
Hi Roman!
>>+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, +
>>u32 s_pitch, u32 height)
>>+{
>>+ int i, j;
>>+
>>+ if (likely(s_pitch==1))
>>+ for(i=0; i < height; i++)
>>+ dst[d_pitch*i] = src[i];
>>
>>
I added the multiply back because gcc (v. 3.3.4) does generate the
fastest code
if I write it this way. I compiled, inspected the generated assembly and
benchmarked
about a dozend variations of the code (benchmark as previously described).
The special case for s_pitch == 1 saves about 10 ms system time (770 ms
-> 760 ms)
The special case for s_pitch == 2 saves about 270 ms system time (2120
-> 1850ms)
with a 16x30 font.
The third case is for even bigger fonts ... I believe that it will not
often be used but
something like that must be present.
>You have now 3 slightly different variants of the same, which isn't really
>an improvement. In my example I showed you how to generate the first and
>last version from the same source.
>
The first version will only be generated when gcc can be sure that
s_pitch is 1.
Therefore you had to explicitly call __fb_pad_aligned_buffer with that
value:
if (likely(idx == 1))
__fb_pad_aligned_buffer(dst, pitch, src, 1, image.height);
else
fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
With the version I propose it큦 enough to write
__fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
instead, and you will get good performance for all cases. If the value of
idx/s_pitch is know at compile time, the compiler can and will ignore the
other cases.
>If you also want to optimize for other sizes, you might want to always
>inline the function, if the function call overhead is the largest part
>anyway, the special case for 2 bytes might not be needed anymore.
>
>
fb_pad_aligned_buffer() is usefull to save some space in cases like
softcursor.
It큦 also used by some drivers (nvidia and riva), but the authors of
those drivers
have to decide if they prefer the inlined version or the version fixed
in fbmem.
>BTW this version saves another condition:
>
>static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
>{
> int i, j;
>
> d_pitch -= s_pitch;
> i = height;
> do {
> /* s_pitch is a few bytes at the most, memcpy is suboptimal */
> j = s_pitch;
> do
> *dst++ = *src++;
> while (--j > 0);
> dst += d_pitch;
> } while (--i > 0);
>}
>
I tested that code, together with the followig code in but_putcs():
if(idx==1)
__fb_pad_aligned_buffer(dst, pitch, src, 1,
image.height);
else if (idx==2)
__fb_pad_aligned_buffer(dst, pitch, src, 2,
image.height);
else
fb_pad_aligned_buffer(dst, pitch, src, idx,
image.height);
dst += width;
It큦 as fast/slow as your previous version, the measurements are almost
identical.
Let큦 summarize:
Your version of __fb_pad_aligned_buffer looks much better, but it needs
not so nice
conditionals when used. My version looks bad, but it is easier to use
and it is
_faster_.
cu,
knut
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-31 19:19 ` Knut Petersen
@ 2005-08-31 19:34 ` Roman Zippel
2005-08-31 19:52 ` Knut Petersen
0 siblings, 1 reply; 16+ messages in thread
From: Roman Zippel @ 2005-08-31 19:34 UTC (permalink / raw)
To: Knut Petersen
Cc: Andrew Morton, linux-fbdev-devel, Antonino A. Daplas,
Linux Kernel Development, Jochen Hein, Geert Uytterhoeven
[-- Attachment #1: Type: TEXT/PLAIN, Size: 699 bytes --]
Hi,
On Wed, 31 Aug 2005, Knut Petersen wrote:
> I added the multiply back because gcc (v. 3.3.4) does generate the fastest
> code
> if I write it this way.
The multiply is not generally faster, so your version may be the fastest,
but in other situations it will be a lot slower. My version is generally
fast.
> The special case for s_pitch == 2 saves about 270 ms system time (2120 ->
> 1850ms)
> with a 16x30 font.
Compared to what? How much is the function call overhead?
> It´s as fast/slow as your previous version, the measurements are almost
> identical.
The generated code is a bit smaller, so it mostly depends on the icache to
see a difference.
bye, Roman
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts
2005-08-31 19:34 ` Roman Zippel
@ 2005-08-31 19:52 ` Knut Petersen
0 siblings, 0 replies; 16+ messages in thread
From: Knut Petersen @ 2005-08-31 19:52 UTC (permalink / raw)
To: Roman Zippel
Cc: Andrew Morton, linux-fbdev-devel, Antonino A. Daplas,
Linux Kernel Development, Jochen Hein, Geert Uytterhoeven
>>The special case for s_pitch == 2 saves about 270 ms system time (2120 ->
>>1850ms)
>>with a 16x30 font.
>>
>>
>Compared to what? How much is the function call overhead?
>
>
>
Your version of the inline code inserted after an if (idx==2) in
bit_putcs against my version of the
inline code.
cu,
knut
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2005-08-31 19:52 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-30 16:15 [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts Knut Petersen
2005-08-30 16:18 ` Geert Uytterhoeven
2005-08-30 17:58 ` Knut Petersen
2005-08-30 19:13 ` Roman Zippel
2005-08-30 22:26 ` Knut Petersen
2005-08-31 0:51 ` [Linux-fbdev-devel] " Roman Zippel
2005-08-31 6:42 ` Antonino A. Daplas
2005-08-31 15:49 ` Roman Zippel
2005-08-31 12:46 ` Knut Petersen
2005-08-31 17:15 ` [Linux-fbdev-devel] " Roman Zippel
2005-08-31 19:19 ` Knut Petersen
2005-08-31 19:34 ` Roman Zippel
2005-08-31 19:52 ` Knut Petersen
2005-08-30 19:59 ` Geert Uytterhoeven
2005-08-31 1:14 ` Antonino A. Daplas
2005-08-31 1:14 ` Antonino A. Daplas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).