Re: Re: dmix plugin

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Re: dmix plugin
@ 2003-02-17 15:32 Jaroslaw Sobierski
  2003-02-17 19:45 ` Jaroslav Kysela
  0 siblings, 1 reply; 41+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-17 15:32 UTC (permalink / raw)
  To: abramo.bagnara; +Cc: perex, alsa-devel

>> I see, the read/saturate/write must be atomic, too. In this case, it would
>> be better to use a global (or a set of) mutex(es) to lock the hardware
>> ring buffer. The futexes are nice.
>
>They are nice indeed, but definitely not the right solution here.
>
>Although I don't know if it's the absolute best solution, the 'retry'
>approach I've proposed is far better and much more efficient.

I have to agree with Abramo. A global mutex would cause long and unnecessary 
waits for the processes trying to write to the plugin. Locking access to
individual parts of the buffer is messy. Notice that concurrent writes 
to the same sample in the buffer will occur sporadically, and the "re-read"
in the loop costs almost nothing, while synchronization mechanisms could 
block often.

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 15:32 Re: dmix plugin Jaroslaw Sobierski
@ 2003-02-17 19:45 ` Jaroslav Kysela
  2003-02-17 20:44   ` tomasz motylewski
  2003-02-18 10:00   ` Abramo Bagnara
  0 siblings, 2 replies; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-17 19:45 UTC (permalink / raw)
  To: Jaroslaw Sobierski
  Cc: abramo.bagnara@libero.it, alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:

> >> I see, the read/saturate/write must be atomic, too. In this case, it would
> >> be better to use a global (or a set of) mutex(es) to lock the hardware
> >> ring buffer. The futexes are nice.
> >
> >They are nice indeed, but definitely not the right solution here.
> >
> >Although I don't know if it's the absolute best solution, the 'retry'
> >approach I've proposed is far better and much more efficient.
> 
> I have to agree with Abramo. A global mutex would cause long and unnecessary 
> waits for the processes trying to write to the plugin. Locking access to
> individual parts of the buffer is messy. Notice that concurrent writes 
> to the same sample in the buffer will occur sporadically, and the "re-read"
> in the loop costs almost nothing, while synchronization mechanisms could 
> block often.

Note that your all nice ideas go to some blind alley. Who will silence the 
sum buffer? Driver silences only hardware buffer which will not be used 
for the calculation in your algorithm.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 19:45 ` Jaroslav Kysela
@ 2003-02-17 20:44   ` tomasz motylewski
  2003-02-17 20:59     ` Jaroslav Kysela
  2003-02-18 10:00   ` Abramo Bagnara
  1 sibling, 1 reply; 41+ messages in thread
From: tomasz motylewski @ 2003-02-17 20:44 UTC (permalink / raw)
  To: Jaroslav Kysela
  Cc: Jaroslaw Sobierski, abramo.bagnara@libero.it,
	alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, Jaroslav Kysela wrote:

> Note that your all nice ideas go to some blind alley. Who will silence the 
> sum buffer? Driver silences only hardware buffer which will not be used 
> for the calculation in your algorithm.

Silencing is not time critical, if buffer is big enough it does not matter
whether is it done 1 ms or 100 ms after the card has played the data. Therefore
it may be done by a separate thread/process/kernel task without any
interference with other processes writing to the buffer.

Anyway, I strongly support writing/adding directly to DMA buffer - lowest
latency possible. Precise information about current position of HW pointer
should be available to each application so it may tune the delay (synchronize
the data coming from the source with slightly different clock frequency!) by
adding/deleting single samples (with interpolation). Mutexes optional.

Best regards,
--
Tomasz Motylewski
BFAD GmbH & Co. KG

-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 20:44   ` tomasz motylewski
@ 2003-02-17 20:59     ` Jaroslav Kysela
  0 siblings, 0 replies; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-17 20:59 UTC (permalink / raw)
  To: tomasz motylewski
  Cc: Jaroslaw Sobierski, abramo.bagnara@libero.it,
	alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, tomasz motylewski wrote:

> On Mon, 17 Feb 2003, Jaroslav Kysela wrote:
> 
> > Note that your all nice ideas go to some blind alley. Who will silence the 
> > sum buffer? Driver silences only hardware buffer which will not be used 
> > for the calculation in your algorithm.
> 
> Silencing is not time critical, if buffer is big enough it does not matter
> whether is it done 1 ms or 100 ms after the card has played the data. Therefore
> it may be done by a separate thread/process/kernel task without any
> interference with other processes writing to the buffer.

It is time critical for the dmix plugin, because other processes might 
write new samples to "empty" areas.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 19:45 ` Jaroslav Kysela
  2003-02-17 20:44   ` tomasz motylewski
@ 2003-02-18 10:00   ` Abramo Bagnara
  2003-02-18 12:52     ` Jaroslav Kysela
  2003-02-18 21:07     ` Jaroslav Kysela
  1 sibling, 2 replies; 41+ messages in thread
From: Abramo Bagnara @ 2003-02-18 10:00 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:
> 
> > >> I see, the read/saturate/write must be atomic, too. In this case, it would
> > >> be better to use a global (or a set of) mutex(es) to lock the hardware
> > >> ring buffer. The futexes are nice.
> > >
> > >They are nice indeed, but definitely not the right solution here.
> > >
> > >Although I don't know if it's the absolute best solution, the 'retry'
> > >approach I've proposed is far better and much more efficient.
> >
> > I have to agree with Abramo. A global mutex would cause long and unnecessary
> > waits for the processes trying to write to the plugin. Locking access to
> > individual parts of the buffer is messy. Notice that concurrent writes
> > to the same sample in the buffer will occur sporadically, and the "re-read"
> > in the loop costs almost nothing, while synchronization mechanisms could
> > block often.
> 
> Note that your all nice ideas go to some blind alley. Who will silence the
> sum buffer? Driver silences only hardware buffer which will not be used
> for the calculation in your algorithm.


Not so blind ;-)

	v = *src;
	if (cmpxchg(hw, 0, 1) == 0)
		v -= *sw;
        xadd(sw, v);
        do {
                v = *sw;
                if (v > 0x7fff)
                        s = 0x7fff;
                else if (v < -0x8000)
                        s = -0x8000;
                else
                        s = v;
                *hw = s;
        } while (unlikely(v != *sw));

I've convinced you?

However as I've written in the my first message the evil of dmix
approach lies in details: they might destroy efficiency of approach
rather easily.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 10:00   ` Abramo Bagnara
@ 2003-02-18 12:52     ` Jaroslav Kysela
  2003-02-18 13:10       ` Jaroslaw Sobierski
  2003-02-18 14:51       ` Paul Davis
  2003-02-18 21:07     ` Jaroslav Kysela
  1 sibling, 2 replies; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-18 12:52 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Tue, 18 Feb 2003, Abramo Bagnara wrote:

> Jaroslav Kysela wrote:
> > 
> > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:
> > 
> > > >> I see, the read/saturate/write must be atomic, too. In this case, it would
> > > >> be better to use a global (or a set of) mutex(es) to lock the hardware
> > > >> ring buffer. The futexes are nice.
> > > >
> > > >They are nice indeed, but definitely not the right solution here.
> > > >
> > > >Although I don't know if it's the absolute best solution, the 'retry'
> > > >approach I've proposed is far better and much more efficient.
> > >
> > > I have to agree with Abramo. A global mutex would cause long and unnecessary
> > > waits for the processes trying to write to the plugin. Locking access to
> > > individual parts of the buffer is messy. Notice that concurrent writes
> > > to the same sample in the buffer will occur sporadically, and the "re-read"
> > > in the loop costs almost nothing, while synchronization mechanisms could
> > > block often.
> > 
> > Note that your all nice ideas go to some blind alley. Who will silence the
> > sum buffer? Driver silences only hardware buffer which will not be used
> > for the calculation in your algorithm.
> 
> 
> Not so blind ;-)
> 
> 	v = *src;
> 	if (cmpxchg(hw, 0, 1) == 0)
> 		v -= *sw;
>         xadd(sw, v);
>         do {
>                 v = *sw;
>                 if (v > 0x7fff)
>                         s = 0x7fff;
>                 else if (v < -0x8000)
>                         s = -0x8000;
>                 else
>                         s = v;

A bit correction (we have to avoid zero results in hw buffer):

		  else if (v == 0)
			s = 1;
		  else
			s = v;

>                 *hw = s;
>         } while (unlikely(v != *sw));
> 
> I've convinced you?
> 
> However as I've written in the my first message the evil of dmix
> approach lies in details: they might destroy efficiency of approach
> rather easily.

Yes, but it seems that we can still do proper task without global locks 
which seems pretty nice. Thank you for your help.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 12:52     ` Jaroslav Kysela
@ 2003-02-18 13:10       ` Jaroslaw Sobierski
  2003-02-18 13:19         ` Jaroslav Kysela
  2003-02-18 14:51       ` Paul Davis
  1 sibling, 1 reply; 41+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-18 13:10 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net

Quoting Jaroslav Kysela:
[...]
> > 
> > 	v = *src;
> > 	if (cmpxchg(hw, 0, 1) == 0)
> > 		v -= *sw;
> >         xadd(sw, v);
> >         do {
> >                 v = *sw;
> >                 if (v > 0x7fff)
> >                         s = 0x7fff;
> >                 else if (v < -0x8000)
> >                         s = -0x8000;
> >                 else
> >                         s = v;
> 
> A bit correction (we have to avoid zero results in hw buffer):
> 
> 		  else if (v == 0)
> 			s = 1;
> 		  else
> 			s = v;
> 

Why?! It's like I've written yesterday : even if the outcoming sample
is zero, we can still treat the hw buffer as cleared. It makes no
difference whether it was reset by the driver or the samples just
added up to zero. If we have zero in the hw not because of a reset
we must also have 0 in sw, so the clearing code will have no effect.

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 13:10       ` Jaroslaw Sobierski
@ 2003-02-18 13:19         ` Jaroslav Kysela
  0 siblings, 0 replies; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-18 13:19 UTC (permalink / raw)
  To: Jaroslaw Sobierski; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net

On Tue, 18 Feb 2003, Jaroslaw Sobierski wrote:

> Quoting Jaroslav Kysela:
> [...]
> > > 
> > > 	v = *src;
> > > 	if (cmpxchg(hw, 0, 1) == 0)
> > > 		v -= *sw;
> > >         xadd(sw, v);
> > >         do {
> > >                 v = *sw;
> > >                 if (v > 0x7fff)
> > >                         s = 0x7fff;
> > >                 else if (v < -0x8000)
> > >                         s = -0x8000;
> > >                 else
> > >                         s = v;
> > 
> > A bit correction (we have to avoid zero results in hw buffer):
> > 
> > 		  else if (v == 0)
> > 			s = 1;
> > 		  else
> > 			s = v;
> > 
> 
> Why?! It's like I've written yesterday : even if the outcoming sample
> is zero, we can still treat the hw buffer as cleared. It makes no
> difference whether it was reset by the driver or the samples just
> added up to zero. If we have zero in the hw not because of a reset
> we must also have 0 in sw, so the clearing code will have no effect.

Thanks for correction.. Some things are not visible at first glance.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 12:52     ` Jaroslav Kysela
  2003-02-18 13:10       ` Jaroslaw Sobierski
@ 2003-02-18 14:51       ` Paul Davis
  2003-02-18 16:51         ` Jaroslav Kysela
  1 sibling, 1 reply; 41+ messages in thread
From: Paul Davis @ 2003-02-18 14:51 UTC (permalink / raw)
  To: alsa-devel@lists.sourceforge.net

>> 	v = *src;
>> 	if (cmpxchg(hw, 0, 1) == 0)
>> 		v -= *sw;
>>         xadd(sw, v);
>>         do {
>>                 v = *sw;
>>                 if (v > 0x7fff)
>>                         s = 0x7fff;
>>                 else if (v < -0x8000)
>>                         s = -0x8000;
>>                 else
>>                         s = v;
>
>A bit correction (we have to avoid zero results in hw buffer):
>
>		  else if (v == 0)
>			s = 1;
>		  else
>			s = v;
>
>>                 *hw = s;
>>         } while (unlikely(v != *sw));

help me out here. is this the code path that has be followed to write
a single sample to the buffer?


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 14:51       ` Paul Davis
@ 2003-02-18 16:51         ` Jaroslav Kysela
  0 siblings, 0 replies; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-18 16:51 UTC (permalink / raw)
  To: Paul Davis; +Cc: alsa-devel@lists.sourceforge.net

On Tue, 18 Feb 2003, Paul Davis wrote:

> >> 	v = *src;
> >> 	if (cmpxchg(hw, 0, 1) == 0)
> >> 		v -= *sw;
> >>         xadd(sw, v);
> >>         do {
> >>                 v = *sw;
> >>                 if (v > 0x7fff)
> >>                         s = 0x7fff;
> >>                 else if (v < -0x8000)
> >>                         s = -0x8000;
> >>                 else
> >>                         s = v;
> >
> >A bit correction (we have to avoid zero results in hw buffer):
> >
> >		  else if (v == 0)
> >			s = 1;
> >		  else
> >			s = v;
> >
> >>                 *hw = s;
> >>         } while (unlikely(v != *sw));
> 
> help me out here. is this the code path that has be followed to write
> a single sample to the buffer?

Yes, this code updates one sample in the hardware buffer.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 10:00   ` Abramo Bagnara
  2003-02-18 12:52     ` Jaroslav Kysela
@ 2003-02-18 21:07     ` Jaroslav Kysela
  2003-02-19 10:20       ` Abramo Bagnara
  2003-02-19 10:33       ` Jaroslaw Sobierski
  1 sibling, 2 replies; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-18 21:07 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Tue, 18 Feb 2003, Abramo Bagnara wrote:

> Jaroslav Kysela wrote:
> > 
> > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:
> > 
> > > >> I see, the read/saturate/write must be atomic, too. In this case, it would
> > > >> be better to use a global (or a set of) mutex(es) to lock the hardware
> > > >> ring buffer. The futexes are nice.
> > > >
> > > >They are nice indeed, but definitely not the right solution here.
> > > >
> > > >Although I don't know if it's the absolute best solution, the 'retry'
> > > >approach I've proposed is far better and much more efficient.
> > >
> > > I have to agree with Abramo. A global mutex would cause long and unnecessary
> > > waits for the processes trying to write to the plugin. Locking access to
> > > individual parts of the buffer is messy. Notice that concurrent writes
> > > to the same sample in the buffer will occur sporadically, and the "re-read"
> > > in the loop costs almost nothing, while synchronization mechanisms could
> > > block often.
> > 
> > Note that your all nice ideas go to some blind alley. Who will silence the
> > sum buffer? Driver silences only hardware buffer which will not be used
> > for the calculation in your algorithm.
> 
> 
> Not so blind ;-)
> 
> 	v = *src;
> 	if (cmpxchg(hw, 0, 1) == 0)
> 		v -= *sw;
>         xadd(sw, v);
>         do {
>                 v = *sw;
>                 if (v > 0x7fff)
>                         s = 0x7fff;
>                 else if (v < -0x8000)
>                         s = -0x8000;
>                 else
>                         s = v;
>                 *hw = s;
>         } while (unlikely(v != *sw));
> 
> I've convinced you?
> 
> However as I've written in the my first message the evil of dmix
> approach lies in details: they might destroy efficiency of approach
> rather easily.

I've implemented the whole transfer and mix loop in assembly and it works
without any drastic impact on CPU usage. I tried to optimize the assembler
part as much as I can, but if some assembler guru want to give a glance,
I'll appreciate it. The function is named mix_areas1() in
alsa-lib/src/pcm/pcm_dmix.c.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 21:07     ` Jaroslav Kysela
@ 2003-02-19 10:20       ` Abramo Bagnara
  2003-02-19 11:01         ` Jaroslav Kysela
  2003-02-19 10:33       ` Jaroslaw Sobierski
  1 sibling, 1 reply; 41+ messages in thread
From: Abramo Bagnara @ 2003-02-19 10:20 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> I've implemented the whole transfer and mix loop in assembly and it works
> without any drastic impact on CPU usage. I tried to optimize the assembler
> part as much as I can, but if some assembler guru want to give a glance,
> I'll appreciate it. The function is named mix_areas1() in
> alsa-lib/src/pcm/pcm_dmix.c.

one comment:

It's better to execute interleaved check once and not in mix_areas

one objection:

I doubt very much that you gain anything coding the mixing loop in
assembler, you've data showing that?


-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 21:07     ` Jaroslav Kysela
  2003-02-19 10:20       ` Abramo Bagnara
@ 2003-02-19 10:33       ` Jaroslaw Sobierski
  2003-02-19 11:08         ` Jaroslav Kysela
  1 sibling, 1 reply; 41+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-19 10:33 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net

Quoting Jaroslav Kysela <perex@suse.cz>:
> 
> I've implemented the whole transfer and mix loop in assembly and it works
> without any drastic impact on CPU usage. I tried to optimize the assembler
> part as much as I can, but if some assembler guru want to give a glance,
> I'll appreciate it. The function is named mix_areas1() in
> alsa-lib/src/pcm/pcm_dmix.c.
> 

It seems to me it would make sens to code it for mmx (to use the saturation
it offers for example). If you go for pure 386 there's little to win.
Did you look at the assembly generated by gcc when compiling with 
optimiazations? I usually make this a start point when moving time-critical 
code to assembly, and if it looks optimized enough - I leave it at that,
unless I can use tricks not available to the compiler - like, again, mmx.

I don't know how well gcc is optimized for intels, but I remember that you
really had to work your ass of to beat inner loops optimized by Watcomm
compilers (BTW I heard they're coming back with open source compilers :-). 
Not to mention proprietary Intel compilers which can take into
account things like word alignment for data and code, cache hit / miss 
situations, branch preditiction and all kinds of magical stuff.

I'll take a closer look at the code when I have more time though.

--------------
Fycio (J.Sobierski)
 fycio@gucio.com

-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 10:20       ` Abramo Bagnara
@ 2003-02-19 11:01         ` Jaroslav Kysela
  2003-02-19 11:17           ` Abramo Bagnara
  0 siblings, 1 reply; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-19 11:01 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Wed, 19 Feb 2003, Abramo Bagnara wrote:

> Jaroslav Kysela wrote:
> > 
> > I've implemented the whole transfer and mix loop in assembly and it works
> > without any drastic impact on CPU usage. I tried to optimize the assembler
> > part as much as I can, but if some assembler guru want to give a glance,
> > I'll appreciate it. The function is named mix_areas1() in
> > alsa-lib/src/pcm/pcm_dmix.c.
> 
> one comment:
> 
> It's better to execute interleaved check once and not in mix_areas

Done. I was tired enough yesterday to bother with these details.

> one objection:
> 
> I doubt very much that you gain anything coding the mixing loop in
> assembler, you've data showing that?

I think that I spent some ticks by duplicating code for saturation and 
also the main while{} loop is more effective than GCC generates. But it's 
only guess.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs




-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 10:33       ` Jaroslaw Sobierski
@ 2003-02-19 11:08         ` Jaroslav Kysela
  0 siblings, 0 replies; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-19 11:08 UTC (permalink / raw)
  To: Jaroslaw Sobierski; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net

On Wed, 19 Feb 2003, Jaroslaw Sobierski wrote:

> Quoting Jaroslav Kysela <perex@suse.cz>:
> > 
> > I've implemented the whole transfer and mix loop in assembly and it works
> > without any drastic impact on CPU usage. I tried to optimize the assembler
> > part as much as I can, but if some assembler guru want to give a glance,
> > I'll appreciate it. The function is named mix_areas1() in
> > alsa-lib/src/pcm/pcm_dmix.c.
> > 
> 
> It seems to me it would make sens to code it for mmx (to use the saturation
> it offers for example). If you go for pure 386 there's little to win.

Yes and no. I don't think that there will be enough need for the
saturations, so the saturation code path mostly takes 4 instructions (two
compare, two skipped conditional jumps).

> Did you look at the assembly generated by gcc when compiling with 
> optimiazations? I usually make this a start point when moving time-critical 

Yes, my code is based on the code from GCC.

> code to assembly, and if it looks optimized enough - I leave it at that,
> unless I can use tricks not available to the compiler - like, again, mmx.
> 
> I don't know how well gcc is optimized for intels, but I remember that you
> really had to work your ass of to beat inner loops optimized by Watcomm
> compilers (BTW I heard they're coming back with open source compilers :-). 
> Not to mention proprietary Intel compilers which can take into
> account things like word alignment for data and code, cache hit / miss 
> situations, branch preditiction and all kinds of magical stuff.

Yes, of course. I've not claimed that I wrote the best code in the world ;-)
But something we can start with.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 11:01         ` Jaroslav Kysela
@ 2003-02-19 11:17           ` Abramo Bagnara
  2003-02-19 13:49             ` Abramo Bagnara
  0 siblings, 1 reply; 41+ messages in thread
From: Abramo Bagnara @ 2003-02-19 11:17 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> 
> > Jaroslav Kysela wrote:
> > >
> > > I've implemented the whole transfer and mix loop in assembly and it works
> > > without any drastic impact on CPU usage. I tried to optimize the assembler
> > > part as much as I can, but if some assembler guru want to give a glance,
> > > I'll appreciate it. The function is named mix_areas1() in
> > > alsa-lib/src/pcm/pcm_dmix.c.
> >
> > one comment:
> >
> > It's better to execute interleaved check once and not in mix_areas
> 
> Done. I was tired enough yesterday to bother with these details.
> 
> > one objection:
> >
> > I doubt very much that you gain anything coding the mixing loop in
> > assembler, you've data showing that?
> 
> I think that I spent some ticks by duplicating code for saturation and
> also the main while{} loop is more effective than GCC generates. But it's
> only guess.

I hope to find the time to check it this evening

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 11:17           ` Abramo Bagnara
@ 2003-02-19 13:49             ` Abramo Bagnara
  2003-02-19 15:45               ` Jaroslaw Sobierski
  2003-02-19 18:34               ` Jaroslav Kysela
  0 siblings, 2 replies; 41+ messages in thread
From: Abramo Bagnara @ 2003-02-19 13:49 UTC (permalink / raw)
  To: Jaroslav Kysela, Jaroslaw Sobierski,
	alsa-devel@lists.sourceforge.net

[-- Attachment #1: Type: text/plain, Size: 2816 bytes --]

Abramo Bagnara wrote:
> 
> Jaroslav Kysela wrote:
> >
> > On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> >
> > > Jaroslav Kysela wrote:
> > > >
> > > > I've implemented the whole transfer and mix loop in assembly and it works
> > > > without any drastic impact on CPU usage. I tried to optimize the assembler
> > > > part as much as I can, but if some assembler guru want to give a glance,
> > > > I'll appreciate it. The function is named mix_areas1() in
> > > > alsa-lib/src/pcm/pcm_dmix.c.
> > >
> > > one comment:
> > >
> > > It's better to execute interleaved check once and not in mix_areas
> >
> > Done. I was tired enough yesterday to bother with these details.
> >
> > > one objection:
> > >
> > > I doubt very much that you gain anything coding the mixing loop in
> > > assembler, you've data showing that?
> >
> > I think that I spent some ticks by duplicating code for saturation and
> > also the main while{} loop is more effective than GCC generates. But it's
> > only guess.
> 
> I hope to find the time to check it this evening

I've stolen some time to paid work.

The results are amazing and I'd say Jaroslav has done some mistakes in
his handmade asm.

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 6
model name      : AMD Athlon(tm) XP 1700+
stepping        : 2
cpu MHz         : 1460.471
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca
cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 2916.35
$ gcc -v
Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/3.2.1/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--disable-checking --with-system-zlib --enable-__cxa_atexit
--host=i386-redhat-linux
Thread model: posix
gcc version 3.2.1 20021125 (Red Hat Linux 8.0 3.2.1-1)
$ make
gcc -O6 -W -Wall   -c -o sum.o sum.c
sum.c: In function `main':
sum.c:242: warning: implicit declaration of function `printf'
sum.c:219: warning: unused parameter `argc'
sum.c:255: warning: control reaches end of non-void function
sum.c: In function `mix_areas0':
sum.c:64: warning: unused parameter `sum'
gcc   sum.o   -o sum
$ ./sum 2048 4 32767
mix_areas0: 110603
mix_areas1: 1512610
mix_areas2: 157597

mix_areas0 is the naive, incorrect version
mix_areas1 is Jaroslav asm
mix_areas2 is my best attempt

Time in clock ticks.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

[-- Attachment #2: sum.c --]
[-- Type: text/plain, Size: 5168 bytes --]

#include <stdlib.h>
#include <stdlib.h>
#include <string.h>

#define rdtscll(val) \
     __asm__ __volatile__("rdtsc" : "=A" (val))

#define likely(x)       __builtin_expect((x),1)
#define unlikely(x)     __builtin_expect((x),0)

typedef short int s16;
typedef int s32;

#ifdef CONFIG_SMP
#define LOCK_PREFIX "lock ; "
#else
#define LOCK_PREFIX ""
#endif

struct __xchg_dummy { unsigned long a[100]; };
#define __xg(x) ((struct __xchg_dummy *)(x))

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
				      unsigned long new, int size)
{
	unsigned long prev;
	switch (size) {
	case 1:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	case 2:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	case 4:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	}
	return old;
}

#define cmpxchg(ptr,o,n)\
	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
					(unsigned long)(n),sizeof(*(ptr))))

static inline void atomic_add(volatile int *dst, int v)
{
	__asm__ __volatile__(
		LOCK_PREFIX "addl %0,%1"
		:"=m" (*dst)
		:"ir" (v));
}

void mix_areas0(unsigned int size,
		volatile s16 *dst, s16 *src,
		volatile s32 *sum,
		unsigned int dst_step, unsigned int src_step)
{
	while (size-- > 0) {
		s32 sample = *dst + *src;
		if (unlikely(sample & 0xffff0000))
			*dst = sample > 0 ? 0x7fff : -0x8000;
		else
			*dst = sample;
		dst += dst_step;
		src += src_step;
	}
}

void mix_areas1(unsigned int size,
		volatile s16 *dst, s16 *src,
		volatile s32 *sum, unsigned int dst_step,
		unsigned int src_step, unsigned int sum_step)
{
	/*
	 *  ESI - src
	 *  EDI - dst
	 *  EBX - sum
	 *  ECX - old sample
	 *  EAX - sample / temporary
	 *  EDX - size
	 */
	__asm__ __volatile__ (
		"\n"

		/*
		 *  initialization, load EDX, ESI, EDI, EBX registers
		 */
		"\tmovl %0, %%edx\n"
		"\tmovl %1, %%edi\n"
		"\tmovl %2, %%esi\n"
		"\tmovl %3, %%ebx\n"

		/*
		 * while (size-- > 0) {
		 */
		"\tcmp $0, %%edx\n"
		"jz 6f\n"

		"1:"

		/*
		 *   sample = *src;
		 *   if (cmpxchg(*dst, 0, 1) == 0)
		 *     sample -= *sum;
		 *   xadd(*sum, sample);
		 */
		"\tmovw $0, %%ax\n"
		"\tmovw $1, %%cx\n"
		"\tlock; cmpxchgw %%cx, (%%edi)\n"
		"\tmovswl (%%esi), %%ecx\n"
		"\tjnz 2f\n"
		"\tsubl (%%ebx), %%ecx\n"
		"2:"
		"\tlock; addl %%ecx, (%%ebx)\n"

		/*
		 *   do {
		 *     sample = old_sample = *sum;
		 *     saturate(v);
		 *     *dst = sample;
		 *   } while (v != *sum);
		 */

		"3:"
		"\tmovl (%%ebx), %%ecx\n"
		"\tcmpl $0x7fff,%%ecx\n"
		"\tjg 4f\n"
		"\tcmpl $-0x8000,%%ecx\n"
		"\tjl 5f\n"
		"\tmovw %%cx, (%%edi)\n"
		"\tcmpl %%ecx, (%%ebx)\n"
		"\tjnz 3b\n"

		/*
		 * while (size-- > 0)
		 */
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		"\tjmp 6f\n"

		/*
		 *  sample > 0x7fff
		 */

		"4:"
		"\tmovw $0x7fff, %%ax\n"
		"\tmovw %%ax, (%%edi)\n"
		"\tcmpl %%ecx,(%%ebx)\n"
		"\tjnz 3b\n"
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		"\tjmp 6f\n"

		/*
		 *  sample < -0x8000
		 */

		"5:"
		"\tmovw $-0x8000, %%ax\n"
		"\tmovw %%ax, (%%edi)\n"
		"\tcmpl %%ecx, (%%ebx)\n"
		"\tjnz 3b\n"
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		// "\tjmp 6f\n"
		
		"6:"

		: /* no output regs */
		: "m" (size), "m" (dst), "m" (src), "m" (sum), "m" (dst_step), "m" (src_step), "m" (sum_step)
		: "esi", "edi", "edx", "ecx", "ebx", "eax"
	);
}


void mix_areas2(unsigned int size,
		volatile s16 *dst, s16 *src,
		volatile s32 *sum,
		unsigned int dst_step, unsigned int src_step)
{
	while (size-- > 0) {
		s32 sample = *src;
		if (cmpxchg(dst, 0, 1) == 0)
			sample -= *sum;
		atomic_add(sum, sample);
		do {
			sample = *sum;
			s16 s;
			if (unlikely(sample & 0xffff0000))
				s = sample > 0 ? 0x7fff : -0x8000;
			else
				s = sample;
			*dst = s;
		} while (unlikely(sample != *sum));
		sum++;
		dst += dst_step;
		src += src_step;
	}
}

int main(int argc, char **argv)
{
	int size = atoi(argv[1]);
	int n = atoi(argv[2]);
	int max = atoi(argv[3]);
	int i;
	unsigned long long begin, end;
	s16 *dst = malloc(sizeof(*dst) * size);
	s32 *sum = calloc(size, sizeof(*sum));
	s16 **srcs = malloc(sizeof(*srcs) * n);
	for (i = 0; i < n; i++) {
		int k;
		s16 *s;
		srcs[i] = s = malloc(sizeof(s16) * size);
		for (k = 0; k < size; ++k, ++s) {
			*s = (rand() % (max * 2)) - max;
		}
	}
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas0(size, dst, srcs[i], sum, 1, 1);
	}
	rdtscll(end);
	printf("mix_areas0: %lld\n", end - begin);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas1(size, dst, srcs[i], sum, 1, 1, 1);
	}
	rdtscll(end);
	printf("mix_areas1: %lld\n", end - begin);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas2(size, dst, srcs[i], sum, 1, 1);
	}
	rdtscll(end);
	printf("mix_areas2: %lld\n", end - begin);
}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 13:49             ` Abramo Bagnara
@ 2003-02-19 15:45               ` Jaroslaw Sobierski
  2003-02-19 20:39                 ` Abramo Bagnara
  2003-02-19 18:34               ` Jaroslav Kysela
  1 sibling, 1 reply; 41+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-19 15:45 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslav Kysela, alsa-devel@lists.sourceforge.net

Quoting Abramo Bagnara <abramo.bagnara@libero.it>:
> 
> The results are amazing and I'd say Jaroslav has done some mistakes in
> his handmade asm.
> 

This may be true, but I think you're trying to be a little too quick yourself.
Did you *test* your code? I only had time to take a short glance at it, but
too me it seems that this is not the correct check for overflow on signed
numbers:

>                       if (unlikely(sample & 0xffff0000))
>                                s = sample > 0 ? 0x7fff : -0x8000;
>                        else
>                                s = sample;

I noticed it because this is the first thought I had, but it only works
for unsgined. Notice that -1 will be 0xffffffff in a 32 bit sample. So
your code will "saturate" all negative samples to -8000 effectively
killing half of the wave, the way a diode does. I'm pretty sure this
would not sound good ;-). Still, even if you change this to two normal
ifs I assume the speed will not be affected by an order of magnitude.

Secondly, the test code is hardly a good representation of our "working"
environment because we're expecting multiple processes to write
concurrently to the buffer. I think you sholud have a "verification"
procedure which carefully mixes the waves one by one and then the 
n test mixes should be run in m processes concurrently. And the result
compared to the "verification" table.

--------------
Fycio (J.Sobierski)
 fycio@gucio.com

-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 13:49             ` Abramo Bagnara
  2003-02-19 15:45               ` Jaroslaw Sobierski
@ 2003-02-19 18:34               ` Jaroslav Kysela
  2003-02-19 21:24                 ` Jaroslav Kysela
                                   ` (3 more replies)
  1 sibling, 4 replies; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-19 18:34 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Wed, 19 Feb 2003, Abramo Bagnara wrote:

> The results are amazing and I'd say Jaroslav has done some mistakes in
> his handmade asm.

I don't think so. It seems that my brain still remembers assembler ;-)
You passed wrong values to my code so it did unaligned accesses.

Fixes to make things same:

--- sum.c	2003-02-19 18:55:20.000000000 +0100
+++ a.c	2003-02-19 19:31:00.000000000 +0100
@@ -11,6 +11,8 @@
 typedef short int s16;
 typedef int s32;
 
+#define CONFIG_SMP
+
 #ifdef CONFIG_SMP
 #define LOCK_PREFIX "lock ; "
 #else
@@ -54,7 +56,7 @@
 static inline void atomic_add(volatile int *dst, int v)
 {
 	__asm__ __volatile__(
-		LOCK_PREFIX "addl %0,%1"
+		LOCK_PREFIX "addl %1,%0"
 		:"=m" (*dst)
 		:"ir" (v));
 }
@@ -62,7 +64,9 @@
 void mix_areas0(unsigned int size,
 		volatile s16 *dst, s16 *src,
 		volatile s32 *sum,
-		unsigned int dst_step, unsigned int src_step)
+		unsigned int dst_step,
+		unsigned int src_step,
+		unsigned int sum_step)
 {
 	while (size-- > 0) {
 		s32 sample = *dst + *src;
@@ -70,8 +74,8 @@
 			*dst = sample > 0 ? 0x7fff : -0x8000;
 		else
 			*dst = sample;
-		dst += dst_step;
-		src += src_step;
+		((char *)dst) += dst_step;
+		((char *)src) += src_step;
 	}
 }
 
@@ -194,7 +198,9 @@
 void mix_areas2(unsigned int size,
 		volatile s16 *dst, s16 *src,
 		volatile s32 *sum,
-		unsigned int dst_step, unsigned int src_step)
+		unsigned int dst_step,
+		unsigned int src_step,
+		unsigned int sum_step)
 {
 	while (size-- > 0) {
 		s32 sample = *src;
@@ -204,15 +210,15 @@
 		do {
 			sample = *sum;
 			s16 s;
-			if (unlikely(sample & 0xffff0000))
+			if (unlikely(sample & 0x7fff0000))
 				s = sample > 0 ? 0x7fff : -0x8000;
 			else
 				s = sample;
 			*dst = s;
 		} while (unlikely(sample != *sum));
-		sum++;
-		dst += dst_step;
-		src += src_step;
+		((char *)sum) += sum_step;
+		((char *)dst) += dst_step;
+		((char *)src) += src_step;
 	}
 }
 
@@ -236,19 +242,19 @@
 	}
 	rdtscll(begin);
 	for (i = 0; i < n; i++) {
-		mix_areas0(size, dst, srcs[i], sum, 1, 1);
+		mix_areas0(size, dst, srcs[i], sum, 2, 2, 4);
 	}
 	rdtscll(end);
 	printf("mix_areas0: %lld\n", end - begin);
 	rdtscll(begin);
 	for (i = 0; i < n; i++) {
-		mix_areas1(size, dst, srcs[i], sum, 1, 1, 1);
+		mix_areas1(size, dst, srcs[i], sum, 2, 2, 4);
 	}
 	rdtscll(end);
 	printf("mix_areas1: %lld\n", end - begin);
 	rdtscll(begin);
 	for (i = 0; i < n; i++) {
-		mix_areas2(size, dst, srcs[i], sum, 1, 1);
+		mix_areas2(size, dst, srcs[i], sum, 2, 2, 4);
 	}
 	rdtscll(end);
 	printf("mix_areas2: %lld\n", end - begin);

perex@pnote:~> cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 6
cpu MHz         : 847.473
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov 
pat pse36 mmx fxsr sse
bogomips        : 1679.36

perex@pnote:~> ./a.out 2048 4 32267
mix_areas0: 170691
mix_areas1: 675795
mix_areas2: 708995


					Have fun,
						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 15:45               ` Jaroslaw Sobierski
@ 2003-02-19 20:39                 ` Abramo Bagnara
  0 siblings, 0 replies; 41+ messages in thread
From: Abramo Bagnara @ 2003-02-19 20:39 UTC (permalink / raw)
  To: Jaroslaw Sobierski; +Cc: Jaroslav Kysela, alsa-devel@lists.sourceforge.net

Jaroslaw Sobierski wrote:
> 
> Quoting Abramo Bagnara <abramo.bagnara@libero.it>:
> >
> > The results are amazing and I'd say Jaroslav has done some mistakes in
> > his handmade asm.
> >
> 
> This may be true, but I think you're trying to be a little too quick yourself.

No doubts about that, I was in a hurry.

> Did you *test* your code? I only had time to take a short glance at it, but
> too me it seems that this is not the correct check for overflow on signed
> numbers:
> 
> >                       if (unlikely(sample & 0xffff0000))
> >                                s = sample > 0 ? 0x7fff : -0x8000;
> >                        else
> >                                s = sample;
> 
> I noticed it because this is the first thought I had, but it only works
> for unsgined. Notice that -1 will be 0xffffffff in a 32 bit sample. So
> your code will "saturate" all negative samples to -8000 effectively
> killing half of the wave, the way a diode does. I'm pretty sure this
> would not sound good ;-). Still, even if you change this to two normal
> ifs I assume the speed will not be affected by an order of magnitude.
> 
> Secondly, the test code is hardly a good representation of our "working"
> environment because we're expecting multiple processes to write
> concurrently to the buffer. I think you sholud have a "verification"
> procedure which carefully mixes the waves one by one and then the
> n test mixes should be run in m processes concurrently. And the result
> compared to the "verification" table.

This is best tested with an SMP machine and I've not an easy access to
it.

That's apart you're perfectly right and this was exactly my intentions.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 18:34               ` Jaroslav Kysela
@ 2003-02-19 21:24                 ` Jaroslav Kysela
  2003-02-20  8:28                 ` Abramo Bagnara
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-19 21:24 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Wed, 19 Feb 2003, Jaroslav Kysela wrote:

> perex@pnote:~> cat /proc/cpuinfo
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 8
> model name      : Pentium III (Coppermine)
> stepping        : 6
> cpu MHz         : 847.473
> cache size      : 256 KB
> fdiv_bug        : no
> hlt_bug         : no
> f00f_bug        : no
> coma_bug        : no
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 2
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov 
> pat pse36 mmx fxsr sse
> bogomips        : 1679.36
> 
> perex@pnote:~> ./a.out 2048 4 32267
> mix_areas0: 170691
> mix_areas1: 675795
> mix_areas2: 708995

More results (with MMX code):

perex@pnote:~/alsa/alsa-lib/test> ./code 2048 4 32767
mix_areas0    : 172345
mix_areas1    : 677021
mix_areas1_mmx: 620597
mix_areas2    : 702227

Note - the test utility is in CVS - alsa-lib/test/code.c - now.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 18:34               ` Jaroslav Kysela
  2003-02-19 21:24                 ` Jaroslav Kysela
@ 2003-02-20  8:28                 ` Abramo Bagnara
  2003-02-20  8:30                 ` Jaroslaw Sobierski
  2003-02-20  8:53                 ` Re: dmix plugin Abramo Bagnara
  3 siblings, 0 replies; 41+ messages in thread
From: Abramo Bagnara @ 2003-02-20  8:28 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> 
> > The results are amazing and I'd say Jaroslav has done some mistakes in
> > his handmade asm.
> 
> I don't think so. It seems that my brain still remembers assembler ;-)

I've no doubts about that ;-)

> You passed wrong values to my code so it did unaligned accesses.

I guessed that but I was too lazy to deeply analyze your asm.

> Fixes to make things same:

>                 volatile s32 *sum,
> -               unsigned int dst_step, unsigned int src_step)
> +               unsigned int dst_step,
> +               unsigned int src_step,
> +               unsigned int sum_step)

sum_step is useless I've deliberately removed it.
Please do it also on your code.

> +               ((char *)dst) += dst_step;
> +               ((char *)src) += src_step;

IMHO it's a sane assumption suppose that step is multiple of sample
size.
However this should not have any impact on efficiency (at least I
believe).

> -                       if (unlikely(sample & 0xffff0000))
> +                       if (unlikely(sample & 0x7fff0000))

As Jaroslaw has written this is a mistake and I've verified the right
version has no speed benefits.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 18:34               ` Jaroslav Kysela
  2003-02-19 21:24                 ` Jaroslav Kysela
  2003-02-20  8:28                 ` Abramo Bagnara
@ 2003-02-20  8:30                 ` Jaroslaw Sobierski
  2003-02-20  8:48                   ` Abramo Bagnara
  2003-02-20  9:17                   ` Echoaudio drivers Giuliano Pochini
  2003-02-20  8:53                 ` Re: dmix plugin Abramo Bagnara
  3 siblings, 2 replies; 41+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-20  8:30 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net

Quoting Jaroslav Kysela <perex@suse.cz>:

> I don't think so. It seems that my brain still remembers assembler ;-)
...
>  			sample = *sum;
>  			s16 s;
> -			if (unlikely(sample & 0xffff0000))
> +			if (unlikely(sample & 0x7fff0000))
>  				s = sample > 0 ? 0x7fff : -0x8000;
>  			else
>  				s = sample;

I think I remember some of the x86 assembly myself and this correction
does not fix the problem. This code will still "saturate" all negative
samples to -8000. You cannot detect an overflow into the upper half of
the register with a simple bitwise and. The actual test should be as
follows : 
- extend the sign of the lower half
- check if the upper half is the same as the effect of expansion
 if it is - there is no overflow
 if it differs - there was overflow and you need to saturate.
examples : 
value 0x 0000 0335
ext   0x 0000 0335
  -> no overflow

value 0x 0002 43b1
ext   0x 0000 43b1
  -> overflow

value 0x ffff f25b
ext   0x ffff f25b
  -> no overflow

value 0x ff1c 35c9
ext   0x 0000 35c9
  -> overflow

to put it in asm:

mov ebx,eax
cwde
cmp eax,ebx

The problem is cwde operates only on ax/eax.
This may sound complicated but in fact it amounts to a very simple
question : does the sample fit in a 16 bit int, or does it not, so
I guess in C it could look something like :

    s16 s=sample;
    if (unlikely(sample != (s32)s))

The cast is just there for clarity I believe it would be done
implicitly anyway. But don't take my word for it - I did not
test this.

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-20  8:30                 ` Jaroslaw Sobierski
@ 2003-02-20  8:48                   ` Abramo Bagnara
  2003-02-20  9:17                   ` Echoaudio drivers Giuliano Pochini
  1 sibling, 0 replies; 41+ messages in thread
From: Abramo Bagnara @ 2003-02-20  8:48 UTC (permalink / raw)
  To: Jaroslaw Sobierski; +Cc: Jaroslav Kysela, alsa-devel

Jaroslaw Sobierski wrote:
> 

> 
>     s16 s=sample;
>     if (unlikely(sample != (s32)s))
> 

I've verified exactly this yesterday evening, but it's less efficient
than ordinary boundary check.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 18:34               ` Jaroslav Kysela
                                   ` (2 preceding siblings ...)
  2003-02-20  8:30                 ` Jaroslaw Sobierski
@ 2003-02-20  8:53                 ` Abramo Bagnara
  2003-02-20 16:49                   ` Jaroslav Kysela
  3 siblings, 1 reply; 41+ messages in thread
From: Abramo Bagnara @ 2003-02-20  8:53 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

[-- Attachment #1: Type: text/plain, Size: 893 bytes --]

Jaroslav Kysela wrote:
> 
> On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> 
> > The results are amazing and I'd say Jaroslav has done some mistakes in
> > his handmade asm.
> 
> I don't think so. It seems that my brain still remembers assembler ;-)
> You passed wrong values to my code so it did unaligned accesses.
> 
> Fixes to make things same:

I've done the needed changes in my version of sum.c to get correct
results from asm version, but I'm still unable to get from it good
performance numbers.

I'm puzzled...

$ ./sum 2048 8 32768
CPU clock: 1460474444.671998
mix_areas0: 90773 0.033459%
mix_areas1: 141173 0.052036% (1103)
mix_areas2: 870134 0.320731% (0)
mix_areas3: 343792 0.126722% (0)


-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

[-- Attachment #2: sum.c --]
[-- Type: text/plain, Size: 7213 bytes --]

#include <stdlib.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>

#define rdtscll(val) \
     __asm__ __volatile__("rdtsc" : "=A" (val))

#define likely(x)       __builtin_expect((x),1)
#define unlikely(x)     __builtin_expect((x),0)

typedef short int s16;
typedef int s32;

#ifdef CONFIG_SMP
#define LOCK_PREFIX "lock ; "
#else
#define LOCK_PREFIX ""
#endif

struct __xchg_dummy { unsigned long a[100]; };
#define __xg(x) ((struct __xchg_dummy *)(x))

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
				      unsigned long new, int size)
{
	unsigned long prev;
	switch (size) {
	case 1:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	case 2:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	case 4:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	}
	return old;
}

#define cmpxchg(ptr,o,n)\
	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
					(unsigned long)(n),sizeof(*(ptr))))

static inline void atomic_add(volatile int *dst, int v)
{
	__asm__ __volatile__(
		LOCK_PREFIX "addl %1,%0"
		:"=m" (*dst)
		:"ir" (v), "m" (*dst));
}


static double
detect_cpu_clock()
{
        struct timeval tm_begin, tm_end;
        unsigned long long tsc_begin, tsc_end;

        /* Warm cache */
        gettimeofday(&tm_begin, 0);

        rdtscll(tsc_begin);
        gettimeofday(&tm_begin, 0);

        usleep(1000000);

        rdtscll(tsc_end);
        gettimeofday(&tm_end, 0);

        return (tsc_end - tsc_begin) / (tm_end.tv_sec - tm_begin.tv_sec + (tm_end.tv_usec - tm_begin.tv_usec) / 1e6);
}

void mix_areas0(unsigned int size,
		const s16 *src,
		volatile s32 *sum,
		unsigned int src_step)
{
	while (size-- > 0) {
		atomic_add(sum, *src);
		(char*)src += src_step;
		sum++;
	}
}

void saturate(unsigned int size,
	      s16 *dst, const s32 *sum,
	      unsigned int dst_step)
{
	while (size-- > 0) {
		s32 sample = *sum;
		if (unlikely(sample < -0x8000))
			*dst = -0x8000;
		else if (unlikely(sample > 0x7fff))
			*dst = 0x7fff;
		else
			*dst = sample;
		(char*)dst += dst_step;
		sum++;
	}
}

void mix_areas1(unsigned int size,
		volatile s16 *dst, const s16 *src,
		unsigned int dst_step, unsigned int src_step)
{
	while (size-- > 0) {
		s32 sample = *dst + *src;
		if (unlikely(sample < -0x8000))
			*dst = -0x8000;
		else if (unlikely(sample > 0x7fff))
			*dst = 0x7fff;
		else
			*dst = sample;
		(char*)dst += dst_step;
		(char*)src += src_step;
	}
}

void mix_areas2(unsigned int size,
		volatile s16 *dst, const s16 *src,
		volatile s32 *sum, unsigned int dst_step,
		unsigned int src_step, unsigned int sum_step)
{
	/*
	 *  ESI - src
	 *  EDI - dst
	 *  EBX - sum
	 *  ECX - old sample
	 *  EAX - sample / temporary
	 *  EDX - size
	 */
	__asm__ __volatile__ (
		"\n"

		/*
		 *  initialization, load EDX, ESI, EDI, EBX registers
		 */
		"\tmovl %0, %%edx\n"
		"\tmovl %1, %%edi\n"
		"\tmovl %2, %%esi\n"
		"\tmovl %3, %%ebx\n"

		/*
		 * while (size-- > 0) {
		 */
		"\tcmp $0, %%edx\n"
		"jz 6f\n"

		"1:"

		/*
		 *   sample = *src;
		 *   if (cmpxchg(*dst, 0, 1) == 0)
		 *     sample -= *sum;
		 *   xadd(*sum, sample);
		 */
		"\tmovw $0, %%ax\n"
		"\tmovw $1, %%cx\n"
		"\tlock; cmpxchgw %%cx, (%%edi)\n"
		"\tmovswl (%%esi), %%ecx\n"
		"\tjnz 2f\n"
		"\tsubl (%%ebx), %%ecx\n"
		"2:"
		"\tlock; addl %%ecx, (%%ebx)\n"

		/*
		 *   do {
		 *     sample = old_sample = *sum;
		 *     saturate(v);
		 *     *dst = sample;
		 *   } while (v != *sum);
		 */

		"3:"
		"\tmovl (%%ebx), %%ecx\n"
		"\tcmpl $0x7fff,%%ecx\n"
		"\tjg 4f\n"
		"\tcmpl $-0x8000,%%ecx\n"
		"\tjl 5f\n"
		"\tmovw %%cx, (%%edi)\n"
		"\tcmpl %%ecx, (%%ebx)\n"
		"\tjnz 3b\n"

		/*
		 * while (size-- > 0)
		 */
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		"\tjmp 6f\n"

		/*
		 *  sample > 0x7fff
		 */

		"4:"
		"\tmovw $0x7fff, %%ax\n"
		"\tmovw %%ax, (%%edi)\n"
		"\tcmpl %%ecx,(%%ebx)\n"
		"\tjnz 3b\n"
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		"\tjmp 6f\n"

		/*
		 *  sample < -0x8000
		 */

		"5:"
		"\tmovw $-0x8000, %%ax\n"
		"\tmovw %%ax, (%%edi)\n"
		"\tcmpl %%ecx, (%%ebx)\n"
		"\tjnz 3b\n"
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		// "\tjmp 6f\n"
		
		"6:"

		: /* no output regs */
		: "m" (size), "m" (dst), "m" (src), "m" (sum), "m" (dst_step), "m" (src_step), "m" (sum_step)
		: "esi", "edi", "edx", "ecx", "ebx", "eax"
	);
}


void mix_areas3(unsigned int size,
		volatile s16 *dst, const s16 *src,
		volatile s32 *sum,
		unsigned int dst_step, unsigned int src_step)
{
	while (size-- > 0) {
		s32 sample = *src;
		if (cmpxchg(dst, 0, 1) == 0)
			sample -= *sum;
		atomic_add(sum, sample);
		do {
			sample = *sum;
			if (unlikely(sample < -0x8000))
				*dst = -0x8000;
			else if (unlikely(sample > 0x7fff))
				*dst = 0x7fff;
			else
				*dst = sample;
		} while (unlikely(sample != *sum));
		sum++;
		(char*)dst += dst_step;
		(char*)src += src_step;
	}
}

int compare(const s16* b1, const s16 *b2, unsigned int size)
{
	unsigned int c = 0;
	while (size-- > 0) {
		if (*b1 != *b2)
			c++;
		b1++;
		b2++;
	}
	return c;
}

int main(int argc, char **argv)
{
	int size = atoi(argv[1]);
	int n = atoi(argv[2]);
	int max = atoi(argv[3]);
	int i;
	unsigned long long begin, end;
	s16 *dst = malloc(sizeof(*dst) * size);
	s16 *check = malloc(sizeof(*check) * size);
	s32 *sum = malloc(sizeof(*sum) * size);
	s16 **srcs = malloc(sizeof(*srcs) * n);
	double cpu_clock = detect_cpu_clock();
	printf("CPU clock: %f\n", cpu_clock);
	for (i = 0; i < n; i++) {
		int k;
		s16 *s;
		srcs[i] = s = malloc(sizeof(s16) * size);
		for (k = 0; k < size; ++k, ++s) {
			*s = (rand() % (max * 2)) - max;
		}
	}

	memset(sum, 0, sizeof(*sum) * size);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas0(size, srcs[i], sum, 2);
	}
	saturate(size, check, sum, 2);
	rdtscll(end);
	printf("mix_areas0: %lld %f%%\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock));

	memset(dst, 0, sizeof(*dst) * size);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas1(size, dst, srcs[i], 2, 2);
	}
	rdtscll(end);
	printf("mix_areas1: %lld %f%% (%d)\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock), compare(dst, check, size));

	memset(sum, 0, sizeof(*sum) * size);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas2(size, dst, srcs[i], sum, 2, 2, 4);
	}
	rdtscll(end);
	printf("mix_areas2: %lld %f%% (%d)\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock), compare(dst, check, size));

	memset(sum, 0, sizeof(*sum) * size);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas3(size, dst, srcs[i], sum, 2, 2);
	}
	rdtscll(end);
	printf("mix_areas3: %lld %f%% (%d)\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock), compare(dst, check, size));
	return 0;
}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Echoaudio drivers
  2003-02-20  8:30                 ` Jaroslaw Sobierski
  2003-02-20  8:48                   ` Abramo Bagnara
@ 2003-02-20  9:17                   ` Giuliano Pochini
  2003-02-20 14:37                     ` David Olofson
  1 sibling, 1 reply; 41+ messages in thread
From: Giuliano Pochini @ 2003-02-20  9:17 UTC (permalink / raw)
  To: alsa-devel


Is someone writing drivers for Echoaudio cards ?  Perhaps
I'll buy one soon and I can try to write drivers if nobody
alse is working on it.

Bye.



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Echoaudio drivers
  2003-02-20  9:17                   ` Echoaudio drivers Giuliano Pochini
@ 2003-02-20 14:37                     ` David Olofson
  2003-02-20 15:40                       ` Giuliano Pochini
  0 siblings, 1 reply; 41+ messages in thread
From: David Olofson @ 2003-02-20 14:37 UTC (permalink / raw)
  To: alsa-devel

On Thursday 20 February 2003 10.17, Giuliano Pochini wrote:
> Is someone writing drivers for Echoaudio cards ?  Perhaps
> I'll buy one soon and I can try to write drivers if nobody
> alse is working on it.

I have an old Layla20 and intend to write a driver for it. It 
shouldn't be too much work to get the other cards working I think, 
but I can't test on anything but Layla20 myself.

Anyway, I'm short on hacking time these days, and I have some other 
projects I need to deal with first.


//David Olofson - Programmer, Composer, Open Source Advocate

.- The Return of Audiality! --------------------------------.
| Free/Open Source Audio Engine for use in Games or Studio. |
| RT and off-line synth. Scripting. Sample accurate timing. |
`---------------------------> http://olofson.net/audiality -'
   --- http://olofson.net --- http://www.reologica.se ---



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Echoaudio drivers
  2003-02-20 14:37                     ` David Olofson
@ 2003-02-20 15:40                       ` Giuliano Pochini
  2003-02-20 16:03                         ` David Olofson
  0 siblings, 1 reply; 41+ messages in thread
From: Giuliano Pochini @ 2003-02-20 15:40 UTC (permalink / raw)
  To: David Olofson; +Cc: alsa-devel


On 20-Feb-2003 David Olofson wrote:
> On Thursday 20 February 2003 10.17, Giuliano Pochini wrote:
>> Is someone writing drivers for Echoaudio cards ?  Perhaps
>> I'll buy one soon and I can try to write drivers if nobody
>> alse is working on it.
>
> I have an old Layla20 and intend to write a driver for it. It
> shouldn't be too much work to get the other cards working I think,
> but I can't test on anything but Layla20 myself.

According to official docs, 20 and 24bit versions have only a
different DAC. The driver always sends data in the same format.
I've not looked at the sources yet.


Bye.



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Echoaudio drivers
  2003-02-20 15:40                       ` Giuliano Pochini
@ 2003-02-20 16:03                         ` David Olofson
  0 siblings, 0 replies; 41+ messages in thread
From: David Olofson @ 2003-02-20 16:03 UTC (permalink / raw)
  To: alsa-devel

On Thursday 20 February 2003 16.40, Giuliano Pochini wrote:
[...]
> According to official docs, 20 and 24bit versions have only a
> different DAC. The driver always sends data in the same format.
> I've not looked at the sources yet.

Yes, that seems to be the case. All models use 24 bit signal paths 
internally, and they seem to use the same multichannel DMA engine and 
stuff as well. There's specific firmware for pretty much every model 
in their driver, but on the host side, it seems like it's mostly 
about configurations and feature sets.


//David Olofson - Programmer, Composer, Open Source Advocate

.- The Return of Audiality! --------------------------------.
| Free/Open Source Audio Engine for use in Games or Studio. |
| RT and off-line synth. Scripting. Sample accurate timing. |
`---------------------------> http://olofson.net/audiality -'
   --- http://olofson.net --- http://www.reologica.se ---



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-20  8:53                 ` Re: dmix plugin Abramo Bagnara
@ 2003-02-20 16:49                   ` Jaroslav Kysela
  2003-02-20 17:57                     ` Abramo Bagnara
  0 siblings, 1 reply; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-20 16:49 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Thu, 20 Feb 2003, Abramo Bagnara wrote:

> Jaroslav Kysela wrote:
> > 
> > On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> > 
> > > The results are amazing and I'd say Jaroslav has done some mistakes in
> > > his handmade asm.
> > 
> > I don't think so. It seems that my brain still remembers assembler ;-)
> > You passed wrong values to my code so it did unaligned accesses.
> > 
> > Fixes to make things same:
> 
> I've done the needed changes in my version of sum.c to get correct
> results from asm version, but I'm still unable to get from it good
> performance numbers.
> 
> I'm puzzled...
> 
> $ ./sum 2048 8 32768
> CPU clock: 1460474444.671998
> mix_areas0: 90773 0.033459%
> mix_areas1: 141173 0.052036% (1103)
> mix_areas2: 870134 0.320731% (0)
> mix_areas3: 343792 0.126722% (0)

1) my asm code used lock prefix so there are huge differences in code for 
   UP and MP on i386
2) we need to clear dst and sum buffers to work with same values for all
   routines
3) we need to clear the CPU caches

I've commited updated alsa-lib/test/code.c which solves all these troubles 
and I've added next optimizations to my asm routine and results are (not 
impressive, but I'm better than GCC, especially using MMX 
saturation instruction):

pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768
Scheduler set to Round Robin with priority 99...
CPU clock: 847.293134Mhz (UP)

Summary (the best times):
mix_areas0    : 548456
mix_areas1    : 863636
mix_areas1_mmx: 629765
mix_areas2    : 910819

pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768
Scheduler set to Round Robin with priority 99...
CPU clock: 847.293395Mhz (SMP)

Summary (the best times):
mix_areas0    : 562342
mix_areas1    : 1705274
mix_areas1_mmx: 1565539
mix_areas2    : 1735491

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 16:49                   ` Jaroslav Kysela
@ 2003-02-20 17:57                     ` Abramo Bagnara
  2003-02-20 18:26                       ` Paul Davis
  2003-02-20 19:55                       ` Jaroslav Kysela
  0 siblings, 2 replies; 41+ messages in thread
From: Abramo Bagnara @ 2003-02-20 17:57 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Thu, 20 Feb 2003, Abramo Bagnara wrote:
> 
> > Jaroslav Kysela wrote:
> > >
> > > On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> > >
> > > > The results are amazing and I'd say Jaroslav has done some mistakes in
> > > > his handmade asm.
> > >
> > > I don't think so. It seems that my brain still remembers assembler ;-)
> > > You passed wrong values to my code so it did unaligned accesses.
> > >
> > > Fixes to make things same:
> >
> > I've done the needed changes in my version of sum.c to get correct
> > results from asm version, but I'm still unable to get from it good
> > performance numbers.
> >
> > I'm puzzled...
> >
> > $ ./sum 2048 8 32768
> > CPU clock: 1460474444.671998
> > mix_areas0: 90773 0.033459%
> > mix_areas1: 141173 0.052036% (1103)
> > mix_areas2: 870134 0.320731% (0)
> > mix_areas3: 343792 0.126722% (0)
> 
> 1) my asm code used lock prefix so there are huge differences in code for
>    UP and MP on i386

Indeed, this made the difference.

> 2) we need to clear dst and sum buffers to work with same values for all
>    routines

This was present in sum.c

> 3) we need to clear the CPU caches

This has irrelevant impact in sum.c.

> I've commited updated alsa-lib/test/code.c which solves all these troubles
> and I've added next optimizations to my asm routine and results are (not
> impressive, but I'm better than GCC, especially using MMX
> saturation instruction):

Now I'm able to get the same results you see.

However I think that we need to extract some results from this data.

I'll leave alone MMX optimizations because I want to compare apples with
apples.

The distributed saturation (also when it's missing the check/repeat
concurrency correctness part) costs more than 4 times the ticks needed
for a (fully correct wrt concurrency) saturate once approach for the
case 2048 8 32768.

CPU clock: 1460477150.884593
mix_areas0: 86747 0.031975%
mix_areas1: 259424 0.095623% (0)
mix_areas1_mmx: 253894 0.093585% (0)
mix_areas2: 132321 0.048773% (365)
mix_areas3: 332411 0.122526% (0)

The server based approach has an added cost of an extra context switch
every period (about 1500 cycles on my machine i.e.), but this is fully
amortized by such an huge difference.

What's your opinion?

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 17:57                     ` Abramo Bagnara
@ 2003-02-20 18:26                       ` Paul Davis
  2003-02-20 19:23                         ` unterminated conditionals: @HAVE_JACK_TRUE@ tomasz motylewski
  2003-02-20 22:14                         ` Re: dmix plugin Abramo Bagnara
  2003-02-20 19:55                       ` Jaroslav Kysela
  1 sibling, 2 replies; 41+ messages in thread
From: Paul Davis @ 2003-02-20 18:26 UTC (permalink / raw)
  To: alsa-devel@lists.sourceforge.net

>The server based approach has an added cost of an extra context switch
>every period (about 1500 cycles on my machine i.e.), but this is fully
>amortized by such an huge difference.

recall that (1) the context switch time is not a fixed cost but
depends on the memory behaviour between switches and (2) isn't it
either two switches per participating client/application, or if they
are chained (as in JACK), N+2 switches, where N is the number of
clients/applications ?



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* unterminated conditionals: @HAVE_JACK_TRUE@
  2003-02-20 18:26                       ` Paul Davis
@ 2003-02-20 19:23                         ` tomasz motylewski
  2003-02-20 19:57                           ` Jaroslav Kysela
  2003-02-20 22:14                         ` Re: dmix plugin Abramo Bagnara
  1 sibling, 1 reply; 41+ messages in thread
From: tomasz motylewski @ 2003-02-20 19:23 UTC (permalink / raw)
  To: alsa-devel@lists.sourceforge.net


Debian woody, current cvs:

./build  prep
Pre-configuring alsa-driver
make: Nothing to be done for `all-deps'.
Pre-configuring alsa-lib
src/pcm/Makefile.am:6: JACK_PLUGIN multiply defined in condition
automake: src/pcm/Makefile.am: unterminated conditionals: @HAVE_JACK_TRUE@
src/pcm/Makefile.am:9: warning: automake does not support conditional
definition of JACK_PLUGIN in libpcm_la_SOURCES

Then after ./build config I get in alsa-lib/src/pcm/Makefile

@HAVE_JACK_TRUE@else !HAVE_JACK
@HAVE_JACK_TRUE@endif !HAVE_JACK

@HAVE_JACK_TRUE@all: libpcm.la



Best regards,
--
Tomasz Motylewski




-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 17:57                     ` Abramo Bagnara
  2003-02-20 18:26                       ` Paul Davis
@ 2003-02-20 19:55                       ` Jaroslav Kysela
  2003-02-20 21:19                         ` tomasz motylewski
                                           ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-20 19:55 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Thu, 20 Feb 2003, Abramo Bagnara wrote:

> Now I'm able to get the same results you see.
> 
> However I think that we need to extract some results from this data.
> 
> I'll leave alone MMX optimizations because I want to compare apples with
> apples.
> 
> The distributed saturation (also when it's missing the check/repeat
> concurrency correctness part) costs more than 4 times the ticks needed
> for a (fully correct wrt concurrency) saturate once approach for the
> case 2048 8 32768.
> 
> CPU clock: 1460477150.884593
> mix_areas0: 86747 0.031975%
> mix_areas1: 259424 0.095623% (0)
> mix_areas1_mmx: 253894 0.093585% (0)
> mix_areas2: 132321 0.048773% (365)
> mix_areas3: 332411 0.122526% (0)
> 
> The server based approach has an added cost of an extra context switch
> every period (about 1500 cycles on my machine i.e.), but this is fully
> amortized by such an huge difference.
> 
> What's your opinion?

Interesting is that my Intel P3 CPU has slightly different times:

pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768
Scheduler set to Round Robin with priority 99...
CPU clock: 847.292487Mhz (UP)

Summary (the best times):
mix_areas_srv : 576382 0.366206%
mix_areas0    : 556852 0.353798%
mix_areas1    : 867989 0.551480%
mix_areas1_mmx: 625144 0.397187%
mix_areas2    : 903335 0.573937%

areas1/srv ratio     : 1.505927
areas1_mmx/srv ratio : 1.084600

I think that we can lose more in the client/server model. Also, note that
we can use even futexes (if there's a hope that the possible context
switch is acceptable) and then we can remove the cmpxchg trick and
write-retry trick and use MMX for parallel saturation of two samples (this
last can be used in the client/server model, too, indeed).

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: unterminated conditionals: @HAVE_JACK_TRUE@
  2003-02-20 19:23                         ` unterminated conditionals: @HAVE_JACK_TRUE@ tomasz motylewski
@ 2003-02-20 19:57                           ` Jaroslav Kysela
  2003-02-20 20:30                             ` tomasz motylewski
  0 siblings, 1 reply; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-20 19:57 UTC (permalink / raw)
  To: tomasz motylewski; +Cc: alsa-devel@lists.sourceforge.net

On Thu, 20 Feb 2003, tomasz motylewski wrote:

> 
> Debian woody, current cvs:
> 
> ./build  prep
> Pre-configuring alsa-driver
> make: Nothing to be done for `all-deps'.
> Pre-configuring alsa-lib
> src/pcm/Makefile.am:6: JACK_PLUGIN multiply defined in condition
> automake: src/pcm/Makefile.am: unterminated conditionals: @HAVE_JACK_TRUE@
> src/pcm/Makefile.am:9: warning: automake does not support conditional
> definition of JACK_PLUGIN in libpcm_la_SOURCES
> 
> Then after ./build config I get in alsa-lib/src/pcm/Makefile
> 
> @HAVE_JACK_TRUE@else !HAVE_JACK
> @HAVE_JACK_TRUE@endif !HAVE_JACK
> 
> @HAVE_JACK_TRUE@all: libpcm.la

Could you try to remove !HAVE_JACK string?

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: unterminated conditionals: @HAVE_JACK_TRUE@
  2003-02-20 19:57                           ` Jaroslav Kysela
@ 2003-02-20 20:30                             ` tomasz motylewski
  0 siblings, 0 replies; 41+ messages in thread
From: tomasz motylewski @ 2003-02-20 20:30 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: alsa-devel@lists.sourceforge.net

On Thu, 20 Feb 2003, Jaroslav Kysela wrote:

> > Then after ./build config I get in alsa-lib/src/pcm/Makefile
> > 
> > @HAVE_JACK_TRUE@else !HAVE_JACK
> > @HAVE_JACK_TRUE@endif !HAVE_JACK
> > 
> > @HAVE_JACK_TRUE@all: libpcm.la
> 
> Could you try to remove !HAVE_JACK string?

>From where?

I have just removed that 3 lines from that Makefile amd run ./build all
again. This time:

Making all in alsamixer
make[1]: Entering directory `/root/ALSA/alsa-utils/alsamixer'
cd .. && automake --foreign alsamixer/Makefile
cd .. \
  && CONFIG_FILES=alsamixer/Makefile CONFIG_HEADERS= /bin/sh ./config.status
creating alsamixer/Makefile
make[1]: Leaving directory `/root/ALSA/alsa-utils/alsamixer'
make[1]: Entering directory `/root/ALSA/alsa-utils/alsamixer'
gcc -DHAVE_CONFIG_H -I. -I. -I../include     -g -O2 -c alsamixer.c
gcc  -g -O2  -o alsamixer  alsamixer.o -lncurses -lasound -lm -ldl -lpthread
alsamixer.o: In function `update_enum_list':
/root/ALSA/alsa-utils/alsamixer/alsamixer.c:513: undefined reference to
`snd_mixer_selem_get_enum_item'
/root/ALSA/alsa-utils/alsamixer/alsamixer.c:520: undefined reference to
`snd_mixer_selem_get_enum_items'
/root/ALSA/alsa-utils/alsamixer/alsamixer.c:527: undefined reference to
`snd_mixer_selem_set_enum_item'
alsamixer.o: In function `display_enum_list':
/root/ALSA/alsa-utils/alsamixer/alsamixer.c:696: undefined reference to
`snd_mixer_selem_get_enum_item'
/root/ALSA/alsa-utils/alsamixer/alsamixer.c:699: undefined reference to
`snd_mixer_selem_get_enum_item_name'
alsamixer.o: In function `mixer_reinit':
/root/ALSA/alsa-utils/alsamixer/alsamixer.c:1512: undefined reference to
`snd_mixer_selem_is_enumerated'
collect2: ld returned 1 exit status
make[1]: *** [alsamixer] Error 1
make[1]: Leaving directory `/root/ALSA/alsa-utils/alsamixer'
make: *** [all-recursive] Error 1

The problem is I have (previous version?) of
/usr/lib/libasound.so.2
/usr/lib/libasound.so.2.0.0

But should not build script take care of it by linking ../../alsa-lib/ ?

I have run "make install" in alsa-lib and then again ./build all
it went through OK.

Best regards,
--
Tomek

-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 19:55                       ` Jaroslav Kysela
@ 2003-02-20 21:19                         ` tomasz motylewski
  2003-02-20 21:27                           ` Jaroslav Kysela
  2003-02-21 10:25                         ` Abramo Bagnara
  2003-02-21 14:08                         ` Jaroslaw Sobierski
  2 siblings, 1 reply; 41+ messages in thread
From: tomasz motylewski @ 2003-02-20 21:19 UTC (permalink / raw)
  To: Jaroslav Kysela
  Cc: Abramo Bagnara, Jaroslaw Sobierski,
	alsa-devel@lists.sourceforge.net

Jaroslav:
> I think that we can lose more in the client/server model. Also, note that

client/server will have higher latency. The server has to copy the samples
"last minute" to DMA buffer and the client has to manage before the server
copies the data. In the direct model only the client's timing has to be within
the typical(maximum) system latency.

Please note that on many cards supporting DMA if the client is late just a few
samples but still adds the whole period, only these few samples will be
silence. The "nondestructive underrun detection" is the beauty here. The client
knows it is late (by comparing its pointer with HW pointer) but may continue
nevertheless if it knows next data will be coming on time. You know, throwing
out all samples or stopping the card in case of small underrun is like pulling
emergency brake because the train is a bit late. It only makes things worse.

With client/server either either all is good, or the whole period is lost. 

Do I understand it correctly that the server stores data in 32 bit buffer and
then puts it in 16 bit DMA buffer of the card? This is one operation more
compared with mixing directly in DMA buffer.

Best regards,
--
Tomek

-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 21:19                         ` tomasz motylewski
@ 2003-02-20 21:27                           ` Jaroslav Kysela
  0 siblings, 0 replies; 41+ messages in thread
From: Jaroslav Kysela @ 2003-02-20 21:27 UTC (permalink / raw)
  To: tomasz motylewski
  Cc: Abramo Bagnara, Jaroslaw Sobierski,
	alsa-devel@lists.sourceforge.net

On Thu, 20 Feb 2003, tomasz motylewski wrote:

> Do I understand it correctly that the server stores data in 32 bit buffer and
> then puts it in 16 bit DMA buffer of the card? This is one operation more
> compared with mixing directly in DMA buffer.

There is no server and 32-bit buffer is used for total sum of samples from
all clients. Otherwise you'll get saturation errors (wrong clipping) as
described in the previous discussion.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 18:26                       ` Paul Davis
  2003-02-20 19:23                         ` unterminated conditionals: @HAVE_JACK_TRUE@ tomasz motylewski
@ 2003-02-20 22:14                         ` Abramo Bagnara
  1 sibling, 0 replies; 41+ messages in thread
From: Abramo Bagnara @ 2003-02-20 22:14 UTC (permalink / raw)
  To: Paul Davis; +Cc: alsa-devel@lists.sourceforge.net

Paul Davis wrote:
> 
> >The server based approach has an added cost of an extra context switch
> >every period (about 1500 cycles on my machine i.e.), but this is fully
> >amortized by such an huge difference.
> 
> recall that (1) the context switch time is not a fixed cost but

Mine was only a very rough approximation for trivial audio generating
processes.

> depends on the memory behaviour between switches and (2) isn't it
> either two switches per participating client/application, or if they
> are chained (as in JACK), N+2 switches, where N is the number of
> clients/applications ?

I don't understand why...

Suppose that on an otherwise idle UP system we have 3 application
generating output for current pcm_dmix.
In this case we have something like ABCABCABCABC... etc.

In pcm_mix case we use a saturate/transfer/zero thread called M and the
we'll have something like ABCMABCMABCMABCM... etc.

Do you agree?

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 19:55                       ` Jaroslav Kysela
  2003-02-20 21:19                         ` tomasz motylewski
@ 2003-02-21 10:25                         ` Abramo Bagnara
  2003-02-21 14:08                         ` Jaroslaw Sobierski
  2 siblings, 0 replies; 41+ messages in thread
From: Abramo Bagnara @ 2003-02-21 10:25 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Thu, 20 Feb 2003, Abramo Bagnara wrote:
> 
> > Now I'm able to get the same results you see.
> >
> > However I think that we need to extract some results from this data.
> >
> > I'll leave alone MMX optimizations because I want to compare apples with
> > apples.
> >
> > The distributed saturation (also when it's missing the check/repeat
> > concurrency correctness part) costs more than 4 times the ticks needed
> > for a (fully correct wrt concurrency) saturate once approach for the
> > case 2048 8 32768.
> >
> > CPU clock: 1460477150.884593
> > mix_areas0: 86747 0.031975%
> > mix_areas1: 259424 0.095623% (0)
> > mix_areas1_mmx: 253894 0.093585% (0)
> > mix_areas2: 132321 0.048773% (365)
> > mix_areas3: 332411 0.122526% (0)
> >
> > The server based approach has an added cost of an extra context switch
> > every period (about 1500 cycles on my machine i.e.), but this is fully
> > amortized by such an huge difference.
> >
> > What's your opinion?
> 
> Interesting is that my Intel P3 CPU has slightly different times:
> 
> pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768
> Scheduler set to Round Robin with priority 99...
> CPU clock: 847.292487Mhz (UP)
> 
> Summary (the best times):
> mix_areas_srv : 576382 0.366206%
> mix_areas0    : 556852 0.353798%
> mix_areas1    : 867989 0.551480%
> mix_areas1_mmx: 625144 0.397187%
> mix_areas2    : 903335 0.573937%
> 
> areas1/srv ratio     : 1.505927
> areas1_mmx/srv ratio : 1.084600

This is due to cache poisoning effect. This is quite surprising for me.
With warm cache mix_areas_srv is 3 times faster than with cold cache,
while there's a smaller difference with other alternatives.

I've modified code.c to permit also to you to test such an effect.

However I think that the realistic scenario is neither 0 nor 1024KB
cache poison.

> I think that we can lose more in the client/server model. Also, note that
> we can use even futexes (if there's a hope that the possible context
> switch is acceptable) and then we can remove the cmpxchg trick and
> write-retry trick and use MMX for parallel saturation of two samples (this
> last can be used in the client/server model, too, indeed).

I really doubt that futex might be of some help, as it's very difficult
to choose the unit it protects. Also I like very much the fact that
concurring processes are totally independent. Using futex if one exit
badly you're screwed.

What seems more interesting for my eyes in dmix approach is (as Tomasz
has pointed out) the exceptional good latency (which is the other side
of the repeated saturation cost).

However we will enjoy this benefit *only* if pcm_dmix is the last PCM of
the chain.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 19:55                       ` Jaroslav Kysela
  2003-02-20 21:19                         ` tomasz motylewski
  2003-02-21 10:25                         ` Abramo Bagnara
@ 2003-02-21 14:08                         ` Jaroslaw Sobierski
  2 siblings, 0 replies; 41+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-21 14:08 UTC (permalink / raw)
  To: Jaroslav Kysela
  Cc: Abramo Bagnara, Tomasz Motylewski,
	alsa-devel@lists.sourceforge.net

Quoting Jaroslav Kysela <perex@suse.cz>:

> On Thu, 20 Feb 2003, Abramo Bagnara wrote:
> 
> > Now I'm able to get the same results you see.
> > 
> > However I think that we need to extract some results from this data.
> > 
> > I'll leave alone MMX optimizations because I want to compare apples with
> > apples.
> > 
> > The distributed saturation (also when it's missing the check/repeat
> > concurrency correctness part) costs more than 4 times the ticks needed
> > for a (fully correct wrt concurrency) saturate once approach for the
> > case 2048 8 32768.
> > 
> > CPU clock: 1460477150.884593
> > mix_areas0: 86747 0.031975%
> > mix_areas1: 259424 0.095623% (0)
> > mix_areas1_mmx: 253894 0.093585% (0)
> > mix_areas2: 132321 0.048773% (365)
> > mix_areas3: 332411 0.122526% (0)
> > 
> > The server based approach has an added cost of an extra context switch
> > every period (about 1500 cycles on my machine i.e.), but this is fully
> > amortized by such an huge difference.
> > 
> > What's your opinion?
> 
> Interesting is that my Intel P3 CPU has slightly different times:
> 
> pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768
> Scheduler set to Round Robin with priority 99...
> CPU clock: 847.292487Mhz (UP)
> 
> Summary (the best times):
> mix_areas_srv : 576382 0.366206%
> mix_areas0    : 556852 0.353798%
> mix_areas1    : 867989 0.551480%
> mix_areas1_mmx: 625144 0.397187%
> mix_areas2    : 903335 0.573937%
> 
> areas1/srv ratio     : 1.505927
> areas1_mmx/srv ratio : 1.084600
> 
> I think that we can lose more in the client/server model. Also, note that
> we can use even futexes (if there's a hope that the possible context
> switch is acceptable) and then we can remove the cmpxchg trick and
> write-retry trick and use MMX for parallel saturation of two samples (this
> last can be used in the client/server model, too, indeed).
> 
> 						Jaroslav
> 

I'm not sure what solution you're poroposing here exactly, but it seems to go
in line with my trail of thought after seeing the results of these tests.
It seems that a fast thread unsafe implementation could have such a huge
speed advantage, that the waiting imposed on other processes because of
global locking would still be compensated. To give an example, if we can
have a 4 times quicker mixing procedure, instead of having 3 threads write
concurrently for 12 seconds (that's 4 seconds cpu time per thread), they
would write in turns - 1 second each giving a total of 3 seconds. So the
1st thread to gain access could return after 1 sec., the 2nd thread after
2 seconds and 3rd after 3. That's still better than one thread writing
alone (for 4 seconds)! Yes, there is greater latency but it seems well
compensated, at least for a reasonable number of sound sources connected.
Anything above 4 doesn't make much sense anyway if our appropach is to
saturate, rather than average - above this distortions will be very
audiable. 

And if we devise a smart locking mechanism - this latency problem can
be reduced to a minimum. The locking and unlocking code would be within
the mixing function thus preventing a badly coded application from
blocking indefinitely.

A simple locking mechanism I'm considering is the following:
- we maintain a short table of ranges locked by each client (one for each).
- access to the table is synchronized with a single mutex
- a request to lock a region could be partially realized, i.e.
  if thread 1 has locked offsets 300-500 and thread 2 wants 200-400
  it will get access to 200-300, can mix there and then ask for the
  rest.
Additionally, the mixing function could be implemented to break the
buffer sent in into chunks of say, 1024 bytes and would try to
lock and mix those segments in sequence. This would minimize the
time spent waiting for other threads. It means a sound compromise
(excuse the pun) between the convenience of not waiting for other
threads by effectively synchronizing on a per pixel basis and the
speed affored by code which doesn't need to care about synchronization,
yet is not hindered by global blocking.

Am I making myself clear or does this sound totally convoluted?

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2003-02-21 14:08 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-17 15:32 Re: dmix plugin Jaroslaw Sobierski
2003-02-17 19:45 ` Jaroslav Kysela
2003-02-17 20:44   ` tomasz motylewski
2003-02-17 20:59     ` Jaroslav Kysela
2003-02-18 10:00   ` Abramo Bagnara
2003-02-18 12:52     ` Jaroslav Kysela
2003-02-18 13:10       ` Jaroslaw Sobierski
2003-02-18 13:19         ` Jaroslav Kysela
2003-02-18 14:51       ` Paul Davis
2003-02-18 16:51         ` Jaroslav Kysela
2003-02-18 21:07     ` Jaroslav Kysela
2003-02-19 10:20       ` Abramo Bagnara
2003-02-19 11:01         ` Jaroslav Kysela
2003-02-19 11:17           ` Abramo Bagnara
2003-02-19 13:49             ` Abramo Bagnara
2003-02-19 15:45               ` Jaroslaw Sobierski
2003-02-19 20:39                 ` Abramo Bagnara
2003-02-19 18:34               ` Jaroslav Kysela
2003-02-19 21:24                 ` Jaroslav Kysela
2003-02-20  8:28                 ` Abramo Bagnara
2003-02-20  8:30                 ` Jaroslaw Sobierski
2003-02-20  8:48                   ` Abramo Bagnara
2003-02-20  9:17                   ` Echoaudio drivers Giuliano Pochini
2003-02-20 14:37                     ` David Olofson
2003-02-20 15:40                       ` Giuliano Pochini
2003-02-20 16:03                         ` David Olofson
2003-02-20  8:53                 ` Re: dmix plugin Abramo Bagnara
2003-02-20 16:49                   ` Jaroslav Kysela
2003-02-20 17:57                     ` Abramo Bagnara
2003-02-20 18:26                       ` Paul Davis
2003-02-20 19:23                         ` unterminated conditionals: @HAVE_JACK_TRUE@ tomasz motylewski
2003-02-20 19:57                           ` Jaroslav Kysela
2003-02-20 20:30                             ` tomasz motylewski
2003-02-20 22:14                         ` Re: dmix plugin Abramo Bagnara
2003-02-20 19:55                       ` Jaroslav Kysela
2003-02-20 21:19                         ` tomasz motylewski
2003-02-20 21:27                           ` Jaroslav Kysela
2003-02-21 10:25                         ` Abramo Bagnara
2003-02-21 14:08                         ` Jaroslaw Sobierski
2003-02-19 10:33       ` Jaroslaw Sobierski
2003-02-19 11:08         ` Jaroslav Kysela

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.