All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Re: dmix plugin
  2003-02-17 10:04 Jaroslaw Sobierski
@ 2003-02-17 10:15 ` Jaroslav Kysela
  2003-02-17 12:15   ` Abramo Bagnara
  2003-02-17 10:32 ` tomasz motylewski
  1 sibling, 1 reply; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-17 10:15 UTC (permalink / raw)
  To: Jaroslaw Sobierski; +Cc: alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:

> > > b) sum overflow: we can lower volume of samples before sum; I think that
> > >    hardware works in this way, too
> > 
> > Here I don't understand you. Suppose we have 3 samples to mix:
> > a = 0x7500
> > b = 0x7400
> > c = 0x8300
> > 
> > If you do a + b + c (in this order) you obtain:
> > d=0
> > d += a -> 7500
> > d += b -> 0xe900 -> 0x7fff
> > d += c -> 0x02ff
> > 
> > while the correct result is 0x6c00. You see?
> 
> AFAIK most hardware does not mix by reducing volume before the sum. On the
> contrary, it is usually summed "as is" to a wider register, and often even so
> used. For example, a sound card able to mix 16 chanels of 16 bits would have
> a 16+4 bits or 24 bit register were the channels are added and no saturation
> can occur. In good hardware this would not even be downscaled back to 16 bits,
> but a 24 bit D/A converter would be used instead. In older times (Gravis Ultra
> Sound and I think older SB AWE) this could easily be spotted by the difference
> in supported "hardware" channels and "software" channels. A card with a 32 bit
> sum register and 24 bit DA could support (as above) 16 hardware channels and 
> for example 64 software channels (mixed together in quadrouplets to the 16 hw).
> 
> In our case, such "solution" would have to affect the whole buffer, meaning 
> we would need 3 (or better yet 4) bytes per sample, which would eventually get
> reduced back to 2 bytes on the way out to the sound card. This seems neither
> elegant nor memory efficient but would work, and also solves the "a)" problem
> because we don't need to saturate so an atomic add can be performed on each
> sample. 

Yes, this solution is good. I've though about it, too. Unfortunately, it 
adds additional transfers including saturation from the "sum" ring buffer 
to the DMA buffer of hardware.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 10:04 Jaroslaw Sobierski
  2003-02-17 10:15 ` Jaroslav Kysela
@ 2003-02-17 10:32 ` tomasz motylewski
  1 sibling, 0 replies; 50+ messages in thread
From: tomasz motylewski @ 2003-02-17 10:32 UTC (permalink / raw)
  To: Jaroslaw Sobierski; +Cc: alsa-devel

On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:

> > Here I don't understand you. Suppose we have 3 samples to mix:
> > a = 0x7500
> > b = 0x7400
> > c = 0x8300
> > 
> > If you do a + b + c (in this order) you obtain:
> > d=0
> > d += a -> 7500
> > d += b -> 0xe900 -> 0x7fff
> > d += c -> 0x02ff
> > 
> > while the correct result is 0x6c00. You see?

Well, but when adding a+b we have no idea that that overlow will be compensated
by next very big negative sample. Also mixing signals which already fill 90% of
dynamic range is not a good idea. My "fix" is heuristic - it works for
occasional _small_ overflows like 0x4100+0x4000 -> 0x7fff is much better than
0x8100. 

The idea of dmix as I understand it is that buffer is already in the native
format for the sound card. So if sound card supports 24 bit, OK. But then
people will start mixing 24 bit samples :-)

> AFAIK most hardware does not mix by reducing volume before the sum. On the
> contrary, it is usually summed "as is" to a wider register, and often even so

And here our "wider register" is 16bit. That means end users should not expect
too much if thay mix full power signals on it.

BTW. If you have uncorrelated signals, then to mix 4 signals it may be good
enough to reduce the amplitude of them just factor 2, because power will drop
factor 4. Ocassionally there will be overrruns, but 0x7fff limit will make it
almost not hearable. Not a correct fix, but I can assure you that it works in
standard cases :-)

Best regards,
--
Tomasz Motylewski
BFAD GmbH & Co. KG



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
@ 2003-02-17 11:18 Jaroslaw Sobierski
  2003-02-17 11:53 ` Jaroslav Kysela
  0 siblings, 1 reply; 50+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-17 11:18 UTC (permalink / raw)
  To: perex; +Cc: alsa-devel

>> In our case, such "solution" would have to affect the whole buffer, meaning 
>> we would need 3 (or better yet 4) bytes per sample, which would eventually get
>> reduced back to 2 bytes on the way out to the sound card. This seems neither
>> elegant nor memory efficient but would work, and also solves the "a)" problem
>> because we don't need to saturate so an atomic add can be performed on each
>> sample. 
>
>Yes, this solution is good. I've though about it, too. Unfortunately, it 
>adds additional transfers including saturation from the "sum" ring buffer 
>to the DMA buffer of hardware.

Hmmm... Not exactly. This is not a problem. First of all: it is way
better to saturate once (i.e. just before the transfer) since this is
a costly operation involving a conditional jump (unless you optimize for
mmx) than do it for every channel individually. If you're mixing 4
channels you do it once, not 4 times. Just because you need to store the 
result in a different buffer, rather than putting it in it's original 
place seems hardly a big difference (except for cache hits maybe).

Also, if you insist on sparing memory (the buffer is not *that*
big is it?) you can lay it out as two separate (ring) buffers, one 
holding upper words, the other holding lower words. Now instead of 
shifting the samples right n-bits before adding to the buffer, you 
shift them left 16-n. In effect you will get a buffer (the upper part) 
which can be directly sent to the audio hw, and which was summed and 
divided without losing precision. The drawback is you lose the atomic 
add. If you don't shift, you can still saturate into the "upper" buffer 
and DMA from there.


--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 11:18 Jaroslaw Sobierski
@ 2003-02-17 11:53 ` Jaroslav Kysela
  0 siblings, 0 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-17 11:53 UTC (permalink / raw)
  To: Jaroslaw Sobierski; +Cc: alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:

> >> In our case, such "solution" would have to affect the whole buffer, meaning 
> >> we would need 3 (or better yet 4) bytes per sample, which would eventually get
> >> reduced back to 2 bytes on the way out to the sound card. This seems neither
> >> elegant nor memory efficient but would work, and also solves the "a)" problem
> >> because we don't need to saturate so an atomic add can be performed on each
> >> sample. 
> >
> >Yes, this solution is good. I've though about it, too. Unfortunately, it 
> >adds additional transfers including saturation from the "sum" ring buffer 
> >to the DMA buffer of hardware.
> 
> Hmmm... Not exactly. This is not a problem. First of all: it is way
> better to saturate once (i.e. just before the transfer) since this is
> a costly operation involving a conditional jump (unless you optimize for
> mmx) than do it for every channel individually. If you're mixing 4
> channels you do it once, not 4 times. Just because you need to store the 
> result in a different buffer, rather than putting it in it's original 
> place seems hardly a big difference (except for cache hits maybe).
> 
> Also, if you insist on sparing memory (the buffer is not *that*
> big is it?) you can lay it out as two separate (ring) buffers, one 
> holding upper words, the other holding lower words. Now instead of 
> shifting the samples right n-bits before adding to the buffer, you 
> shift them left 16-n. In effect you will get a buffer (the upper part) 
> which can be directly sent to the audio hw, and which was summed and 
> divided without losing precision. The drawback is you lose the atomic 
> add. If you don't shift, you can still saturate into the "upper" buffer 
> and DMA from there.

My point was that all processes operates simultaneously and independently.  
So if one process updates area in the "sum" ring buffer, then it MUST
transfer changed area (with saturation) to the DMA buffer. So there is no 
"once saturation" as you think. Anyway, the current implementation uses 
also saturation for all clients (processes) so the only drawback is the 
additional access to the "sum" ring buffer memory area.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 10:15 ` Jaroslav Kysela
@ 2003-02-17 12:15   ` Abramo Bagnara
  2003-02-17 13:12     ` Jaroslav Kysela
  0 siblings, 1 reply; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-17 12:15 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:
> 
> > > > b) sum overflow: we can lower volume of samples before sum; I think that
> > > >    hardware works in this way, too
> > >
> > > Here I don't understand you. Suppose we have 3 samples to mix:
> > > a = 0x7500
> > > b = 0x7400
> > > c = 0x8300
> > >
> > > If you do a + b + c (in this order) you obtain:
> > > d=0
> > > d += a -> 7500
> > > d += b -> 0xe900 -> 0x7fff
> > > d += c -> 0x02ff
> > >
> > > while the correct result is 0x6c00. You see?
> >
> > AFAIK most hardware does not mix by reducing volume before the sum. On the
> > contrary, it is usually summed "as is" to a wider register, and often even so
> > used. For example, a sound card able to mix 16 chanels of 16 bits would have
> > a 16+4 bits or 24 bit register were the channels are added and no saturation
> > can occur. In good hardware this would not even be downscaled back to 16 bits,
> > but a 24 bit D/A converter would be used instead. In older times (Gravis Ultra
> > Sound and I think older SB AWE) this could easily be spotted by the difference
> > in supported "hardware" channels and "software" channels. A card with a 32 bit
> > sum register and 24 bit DA could support (as above) 16 hardware channels and
> > for example 64 software channels (mixed together in quadrouplets to the 16 hw).
> >
> > In our case, such "solution" would have to affect the whole buffer, meaning
> > we would need 3 (or better yet 4) bytes per sample, which would eventually get
> > reduced back to 2 bytes on the way out to the sound card. This seems neither
> > elegant nor memory efficient but would work, and also solves the "a)" problem
> > because we don't need to saturate so an atomic add can be performed on each
> > sample.
> 
> Yes, this solution is good. I've though about it, too. Unfortunately, it
> adds additional transfers including saturation from the "sum" ring buffer
> to the DMA buffer of hardware.

I remember you that the main point of dmix existence is the "direct"
part.

If we'd need to use an intermediate buffer and a mixing thread, the dmix
approach lose our interest.

A solution might be to have a shared parallel sw ring buffer where to
store the exact value:

        xadd(sw, *src);
	do {
		v = *sw;
		if (v > 0x7fff)
			s = 0x7fff;
		else if (v < -0x8000)
			s = -0x8000;
		else
	     		s = v;
		*hw = v;
	} while (unlikely(v != *sw));
	
This should solve also the atomicity update.

Comments?

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 12:15   ` Abramo Bagnara
@ 2003-02-17 13:12     ` Jaroslav Kysela
  2003-02-17 13:29       ` Abramo Bagnara
  0 siblings, 1 reply; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-17 13:12 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, Abramo Bagnara wrote:

> Jaroslav Kysela wrote:
> > 
> > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:
> > 
> > > > > b) sum overflow: we can lower volume of samples before sum; I think that
> > > > >    hardware works in this way, too
> > > >
> > > > Here I don't understand you. Suppose we have 3 samples to mix:
> > > > a = 0x7500
> > > > b = 0x7400
> > > > c = 0x8300
> > > >
> > > > If you do a + b + c (in this order) you obtain:
> > > > d=0
> > > > d += a -> 7500
> > > > d += b -> 0xe900 -> 0x7fff
> > > > d += c -> 0x02ff
> > > >
> > > > while the correct result is 0x6c00. You see?
> > >
> > > AFAIK most hardware does not mix by reducing volume before the sum. On the
> > > contrary, it is usually summed "as is" to a wider register, and often even so
> > > used. For example, a sound card able to mix 16 chanels of 16 bits would have
> > > a 16+4 bits or 24 bit register were the channels are added and no saturation
> > > can occur. In good hardware this would not even be downscaled back to 16 bits,
> > > but a 24 bit D/A converter would be used instead. In older times (Gravis Ultra
> > > Sound and I think older SB AWE) this could easily be spotted by the difference
> > > in supported "hardware" channels and "software" channels. A card with a 32 bit
> > > sum register and 24 bit DA could support (as above) 16 hardware channels and
> > > for example 64 software channels (mixed together in quadrouplets to the 16 hw).
> > >
> > > In our case, such "solution" would have to affect the whole buffer, meaning
> > > we would need 3 (or better yet 4) bytes per sample, which would eventually get
> > > reduced back to 2 bytes on the way out to the sound card. This seems neither
> > > elegant nor memory efficient but would work, and also solves the "a)" problem
> > > because we don't need to saturate so an atomic add can be performed on each
> > > sample.
> > 
> > Yes, this solution is good. I've though about it, too. Unfortunately, it
> > adds additional transfers including saturation from the "sum" ring buffer
> > to the DMA buffer of hardware.
> 
> I remember you that the main point of dmix existence is the "direct"
> part.
> 
> If we'd need to use an intermediate buffer and a mixing thread, the dmix
> approach lose our interest.
> 
> A solution might be to have a shared parallel sw ring buffer where to
> store the exact value:
> 
>         xadd(sw, *src);
> 	do {
> 		v = *sw;
> 		if (v > 0x7fff)
> 			s = 0x7fff;
> 		else if (v < -0x8000)
> 			s = -0x8000;
> 		else
> 	     		s = v;
> 		*hw = v;
> 	} while (unlikely(v != *sw));
> 	
> This should solve also the atomicity update.
> 
> Comments?

We probably talk about same thing, but in different words. I also don't 
think that atomicity is an problem when xadd() is atomic (and it is atomic 
for i386).

Then you need to do the saturation and store to the hardware ring buffer, 
but if this operation is after xadd() then we don't care about atomicity, 
because we are 100% sure that we have a valid result.

Algorithm:

	while (count) {
		atomic_xadd(sum_ring_buffer[idx], local_buffer[idx]);
		hw_ring_buffer[idx] = saturate(sum_ring_buffer[idx]);
	}


						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
@ 2003-02-17 13:12 Jaroslaw Sobierski
  2003-02-17 13:22 ` Jaroslav Kysela
  2003-02-17 13:24 ` Jaroslav Kysela
  0 siblings, 2 replies; 50+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-17 13:12 UTC (permalink / raw)
  To: abramo.bagnara; +Cc: perex, alsa-devel

Abramo Bagnara wrote:
>If we'd need to use an intermediate buffer and a mixing thread, the dmix
>approach lose our interest.
>
>A solution might be to have a shared parallel sw ring buffer where to
>store the exact value:
>
>        xadd(sw, *src);
>	do {
>		v = *sw;
>		if (v > 0x7fff)
>			s = 0x7fff;
>		else if (v < -0x8000)
>			s = -0x8000;
>		else
>	     		s = v;
>		*hw = v;
>	} while (unlikely(v != *sw));
>	
>This should solve also the atomicity update.

Very true, and it is consistent with what
Jaroslav Kysela wrote:
> My point was that all processes operates simultaneously and independently.  
> So if one process updates area in the "sum" ring buffer, then it MUST
> transfer changed area (with saturation) to the DMA buffer. So there is no 
> "once saturation" as you think. Anyway, the current implementation uses 
> also saturation for all clients (processes) so the only drawback is the 
> additional access to the "sum" ring buffer memory area.

So it seems like a good compromise to solve all our problems :-). 

Still, don't we already *have* a feeding thread for the sound card? I mean
it doesn't just grab the memory buffer all by itself whenever it wants?
Excuse my ignorance on this topic I'm only just starting with ALSA, and I
did not have the time yet to go through the entire source code ;-).
I remember when I was writing a driver for an mpeg2 decoder card that I
had to create 2 threads, one for feeding video and one for audio. The
FIFO level was checked either by polling or via interrupt handlers but
I still had control over what and when is transferred. I could let the
card pull the data via DMA using bus mastering but I still new what 
and from where will be sent...
Does the problem lie in the fact that it is actually a plugin and has
no control of the transfer? Maybe it would be worth considering a callback
for the plugin from the main alsa module to infrom it that a new piece
of the DMA buffer must be "prepared" whatever that could mean for a
particular plugin. Anyway, just a thought.

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 13:12 Re: dmix plugin Jaroslaw Sobierski
@ 2003-02-17 13:22 ` Jaroslav Kysela
  2003-02-17 18:15   ` Paul Davis
  2003-02-17 13:24 ` Jaroslav Kysela
  1 sibling, 1 reply; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-17 13:22 UTC (permalink / raw)
  To: Jaroslaw Sobierski
  Cc: abramo.bagnara@libero.it, alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:

> Still, don't we already *have* a feeding thread for the sound card? I mean
> it doesn't just grab the memory buffer all by itself whenever it wants?

Nope. The idea for the dmix plugin is that we share the DMA ring buffer 
with more threads (processes). There is no "master" thread which operates 
exclusively with the DMA buffer.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 13:12 Re: dmix plugin Jaroslaw Sobierski
  2003-02-17 13:22 ` Jaroslav Kysela
@ 2003-02-17 13:24 ` Jaroslav Kysela
  1 sibling, 0 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-17 13:24 UTC (permalink / raw)
  To: Jaroslaw Sobierski
  Cc: abramo.bagnara@libero.it, alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:

> Does the problem lie in the fact that it is actually a plugin and has
> no control of the transfer? Maybe it would be worth considering a callback
> for the plugin from the main alsa module to infrom it that a new piece
> of the DMA buffer must be "prepared" whatever that could mean for a
> particular plugin. Anyway, just a thought.

We use the poll and slave timer source which generates ticks when an 
interrupt from the PCM hardware arrives. It's sufficient for our purpose.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 13:12     ` Jaroslav Kysela
@ 2003-02-17 13:29       ` Abramo Bagnara
  2003-02-17 15:00         ` Jaroslav Kysela
  0 siblings, 1 reply; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-17 13:29 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Mon, 17 Feb 2003, Abramo Bagnara wrote:
> 
> > Jaroslav Kysela wrote:
> > >
> > > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:
> > >
> > > > > > b) sum overflow: we can lower volume of samples before sum; I think that
> > > > > >    hardware works in this way, too
> > > > >
> > > > > Here I don't understand you. Suppose we have 3 samples to mix:
> > > > > a = 0x7500
> > > > > b = 0x7400
> > > > > c = 0x8300
> > > > >
> > > > > If you do a + b + c (in this order) you obtain:
> > > > > d=0
> > > > > d += a -> 7500
> > > > > d += b -> 0xe900 -> 0x7fff
> > > > > d += c -> 0x02ff
> > > > >
> > > > > while the correct result is 0x6c00. You see?
> > > >
> > > > AFAIK most hardware does not mix by reducing volume before the sum. On the
> > > > contrary, it is usually summed "as is" to a wider register, and often even so
> > > > used. For example, a sound card able to mix 16 chanels of 16 bits would have
> > > > a 16+4 bits or 24 bit register were the channels are added and no saturation
> > > > can occur. In good hardware this would not even be downscaled back to 16 bits,
> > > > but a 24 bit D/A converter would be used instead. In older times (Gravis Ultra
> > > > Sound and I think older SB AWE) this could easily be spotted by the difference
> > > > in supported "hardware" channels and "software" channels. A card with a 32 bit
> > > > sum register and 24 bit DA could support (as above) 16 hardware channels and
> > > > for example 64 software channels (mixed together in quadrouplets to the 16 hw).
> > > >
> > > > In our case, such "solution" would have to affect the whole buffer, meaning
> > > > we would need 3 (or better yet 4) bytes per sample, which would eventually get
> > > > reduced back to 2 bytes on the way out to the sound card. This seems neither
> > > > elegant nor memory efficient but would work, and also solves the "a)" problem
> > > > because we don't need to saturate so an atomic add can be performed on each
> > > > sample.
> > >
> > > Yes, this solution is good. I've though about it, too. Unfortunately, it
> > > adds additional transfers including saturation from the "sum" ring buffer
> > > to the DMA buffer of hardware.
> >
> > I remember you that the main point of dmix existence is the "direct"
> > part.
> >
> > If we'd need to use an intermediate buffer and a mixing thread, the dmix
> > approach lose our interest.
> >
> > A solution might be to have a shared parallel sw ring buffer where to
> > store the exact value:
> >
> >         xadd(sw, *src);
> >       do {
> >               v = *sw;
> >               if (v > 0x7fff)
> >                       s = 0x7fff;
> >               else if (v < -0x8000)
> >                       s = -0x8000;
> >               else
> >                       s = v;
> >               *hw = v;
> >       } while (unlikely(v != *sw));
> >
> > This should solve also the atomicity update.
> >
> > Comments?
> 
> We probably talk about same thing, but in different words. I also don't
> think that atomicity is an problem when xadd() is atomic (and it is atomic
> for i386).
> 
> Then you need to do the saturation and store to the hardware ring buffer,
> but if this operation is after xadd() then we don't care about atomicity,
> because we are 100% sure that we have a valid result.
> 
> Algorithm:
> 
>         while (count) {
>                 atomic_xadd(sum_ring_buffer[idx], local_buffer[idx]);
>                 hw_ring_buffer[idx] = saturate(sum_ring_buffer[idx]);
>         }

You're wrong: xadd is atomic but xadd/read/saturation/write is not.

Without the loop I've added you risk to write on hw_ring_buffer an *old*
value:

A:		B:
xadd
read
		xadd
		read
		saturate
		write
saturate
write

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 13:29       ` Abramo Bagnara
@ 2003-02-17 15:00         ` Jaroslav Kysela
  2003-02-17 15:21           ` Abramo Bagnara
  0 siblings, 1 reply; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-17 15:00 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, Abramo Bagnara wrote:

> You're wrong: xadd is atomic but xadd/read/saturation/write is not.
> 
> Without the loop I've added you risk to write on hw_ring_buffer an *old*
> value:
> 
> A:		B:
> xadd
> read
> 		xadd
> 		read
> 		saturate
> 		write
> saturate
> write

I see, the read/saturate/write must be atomic, too. In this case, it would
be better to use a global (or a set of) mutex(es) to lock the hardware
ring buffer. The futexes are nice.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 15:00         ` Jaroslav Kysela
@ 2003-02-17 15:21           ` Abramo Bagnara
  0 siblings, 0 replies; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-17 15:21 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Mon, 17 Feb 2003, Abramo Bagnara wrote:
> 
> > You're wrong: xadd is atomic but xadd/read/saturation/write is not.
> >
> > Without the loop I've added you risk to write on hw_ring_buffer an *old*
> > value:
> >
> > A:            B:
> > xadd
> > read
> >               xadd
> >               read
> >               saturate
> >               write
> > saturate
> > write
> 
> I see, the read/saturate/write must be atomic, too. In this case, it would
> be better to use a global (or a set of) mutex(es) to lock the hardware
> ring buffer. The futexes are nice.

They are nice indeed, but definitely not the right solution here.

Although I don't know if it's the absolute best solution, the 'retry'
approach I've proposed is far better and much more efficient.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
@ 2003-02-17 15:32 Jaroslaw Sobierski
  2003-02-17 19:45 ` Jaroslav Kysela
  0 siblings, 1 reply; 50+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-17 15:32 UTC (permalink / raw)
  To: abramo.bagnara; +Cc: perex, alsa-devel

>> I see, the read/saturate/write must be atomic, too. In this case, it would
>> be better to use a global (or a set of) mutex(es) to lock the hardware
>> ring buffer. The futexes are nice.
>
>They are nice indeed, but definitely not the right solution here.
>
>Although I don't know if it's the absolute best solution, the 'retry'
>approach I've proposed is far better and much more efficient.

I have to agree with Abramo. A global mutex would cause long and unnecessary 
waits for the processes trying to write to the plugin. Locking access to
individual parts of the buffer is messy. Notice that concurrent writes 
to the same sample in the buffer will occur sporadically, and the "re-read"
in the loop costs almost nothing, while synchronization mechanisms could 
block often.

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
@ 2003-02-17 16:18 Jaroslaw Sobierski
  0 siblings, 0 replies; 50+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-17 16:18 UTC (permalink / raw)
  To: T.Motylewski; +Cc: perex, abramo.bagnara, alsa-devel

>
>Well, but when adding a+b we have no idea that that overlow will be compensated
>by next very big negative sample. Also mixing signals which already fill 90% of
>dynamic range is not a good idea. My "fix" is heuristic - it works for
>occasional _small_ overflows like 0x4100+0x4000 -> 0x7fff is much better than
>0x8100. 
>
>The idea of dmix as I understand it is that buffer is already in the native
>format for the sound card. So if sound card supports 24 bit, OK. But then
>people will start mixing 24 bit samples :-)
>
>> AFAIK most hardware does not mix by reducing volume before the sum. On the
>> contrary, it is usually summed "as is" to a wider register, and often even so
>
>And here our "wider register" is 16bit. That means end users should not expect
>too much if thay mix full power signals on it.
>
>BTW. If you have uncorrelated signals, then to mix 4 signals it may be good
>enough to reduce the amplitude of them just factor 2, because power will drop
>factor 4. Ocassionally there will be overrruns, but 0x7fff limit will make it
>almost not hearable. Not a correct fix, but I can assure you that it works in
>standard cases :-)

That's a good point. As long as we're dealing with 2 or 3 channels we probably
can do with saturating. But we should consider adding a shift right by one
(after adding, before saturation) once we have 4 channels, by two at 8 
channels, or something similar.
Otherwise we will start getting some ugly clipping artifacts. The problem is,
this can cause a (noticable) sudden drop in volume when a "threshold" client
connects/disconnects. We could ramp, but that's a multiplication...

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 13:22 ` Jaroslav Kysela
@ 2003-02-17 18:15   ` Paul Davis
  2003-02-18 22:36     ` Abramo Bagnara
  0 siblings, 1 reply; 50+ messages in thread
From: Paul Davis @ 2003-02-17 18:15 UTC (permalink / raw)
  To: Jaroslav Kysela
  Cc: Jaroslaw Sobierski, abramo.bagnara@libero.it,
	alsa-devel@lists.sourceforge.net

>> Still, don't we already *have* a feeding thread for the sound card? I mean
>> it doesn't just grab the memory buffer all by itself whenever it wants?
>
>Nope. The idea for the dmix plugin is that we share the DMA ring buffer 
>with more threads (processes). There is no "master" thread which operates 
>exclusively with the DMA buffer.

that would be called "JACK", right ?

--p


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 15:32 Jaroslaw Sobierski
@ 2003-02-17 19:45 ` Jaroslav Kysela
  2003-02-17 20:44   ` tomasz motylewski
  2003-02-18 10:00   ` Abramo Bagnara
  0 siblings, 2 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-17 19:45 UTC (permalink / raw)
  To: Jaroslaw Sobierski
  Cc: abramo.bagnara@libero.it, alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:

> >> I see, the read/saturate/write must be atomic, too. In this case, it would
> >> be better to use a global (or a set of) mutex(es) to lock the hardware
> >> ring buffer. The futexes are nice.
> >
> >They are nice indeed, but definitely not the right solution here.
> >
> >Although I don't know if it's the absolute best solution, the 'retry'
> >approach I've proposed is far better and much more efficient.
> 
> I have to agree with Abramo. A global mutex would cause long and unnecessary 
> waits for the processes trying to write to the plugin. Locking access to
> individual parts of the buffer is messy. Notice that concurrent writes 
> to the same sample in the buffer will occur sporadically, and the "re-read"
> in the loop costs almost nothing, while synchronization mechanisms could 
> block often.

Note that your all nice ideas go to some blind alley. Who will silence the 
sum buffer? Driver silences only hardware buffer which will not be used 
for the calculation in your algorithm.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 19:45 ` Jaroslav Kysela
@ 2003-02-17 20:44   ` tomasz motylewski
  2003-02-17 20:59     ` Jaroslav Kysela
  2003-02-18 10:00   ` Abramo Bagnara
  1 sibling, 1 reply; 50+ messages in thread
From: tomasz motylewski @ 2003-02-17 20:44 UTC (permalink / raw)
  To: Jaroslav Kysela
  Cc: Jaroslaw Sobierski, abramo.bagnara@libero.it,
	alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, Jaroslav Kysela wrote:

> Note that your all nice ideas go to some blind alley. Who will silence the 
> sum buffer? Driver silences only hardware buffer which will not be used 
> for the calculation in your algorithm.

Silencing is not time critical, if buffer is big enough it does not matter
whether is it done 1 ms or 100 ms after the card has played the data. Therefore
it may be done by a separate thread/process/kernel task without any
interference with other processes writing to the buffer.

Anyway, I strongly support writing/adding directly to DMA buffer - lowest
latency possible. Precise information about current position of HW pointer
should be available to each application so it may tune the delay (synchronize
the data coming from the source with slightly different clock frequency!) by
adding/deleting single samples (with interpolation). Mutexes optional.

Best regards,
--
Tomasz Motylewski
BFAD GmbH & Co. KG



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 20:44   ` tomasz motylewski
@ 2003-02-17 20:59     ` Jaroslav Kysela
  0 siblings, 0 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-17 20:59 UTC (permalink / raw)
  To: tomasz motylewski
  Cc: Jaroslaw Sobierski, abramo.bagnara@libero.it,
	alsa-devel@lists.sourceforge.net

On Mon, 17 Feb 2003, tomasz motylewski wrote:

> On Mon, 17 Feb 2003, Jaroslav Kysela wrote:
> 
> > Note that your all nice ideas go to some blind alley. Who will silence the 
> > sum buffer? Driver silences only hardware buffer which will not be used 
> > for the calculation in your algorithm.
> 
> Silencing is not time critical, if buffer is big enough it does not matter
> whether is it done 1 ms or 100 ms after the card has played the data. Therefore
> it may be done by a separate thread/process/kernel task without any
> interference with other processes writing to the buffer.

It is time critical for the dmix plugin, because other processes might 
write new samples to "empty" areas.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
@ 2003-02-17 22:28 Jaroslaw Sobierski
  0 siblings, 0 replies; 50+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-17 22:28 UTC (permalink / raw)
  To: perex; +Cc: T.Motylewski, abramo.bagnara, alsa-devel


>> On Mon, 17 Feb 2003, Jaroslav Kysela wrote:
>> 
>> > Note that your all nice ideas go to some blind alley. Who will silence the 
>> > sum buffer? Driver silences only hardware buffer which will not be used 
>> > for the calculation in your algorithm.
>> 
>> Silencing is not time critical, if buffer is big enough it does not matter
>> whether is it done 1 ms or 100 ms after the card has played the data. Therefore
>> it may be done by a separate thread/process/kernel task without any
>> interference with other processes writing to the buffer.
>
>It is time critical for the dmix plugin, because other processes might 
>write new samples to "empty" areas.
>

Clearing the sum buffer would be a task analogous, or I should probably say
reverse, to the saturation operation. You see, before you take the value in
the sum buffer and add your sample and so forth, you can check if the 
destination sample in the DMA buffer is zero. If it is, you disregard the
value in the sum (it is now considered stale), overwrite it with your sample
and proceed to saturate it normally. If another thread has already written
something there - the final buffer will be non-zero, and you proceed as
discussed before, if another thread has written zeroes,or the result has
summed up to zero - it still doesn't matter, because then the sum buffer 
would also have to contain a zero so it is right to disregard it's value. 
And that's it. OK, some synchronization would be in order so that you don't 
kill a sample just written by some other thread as in:

A                         B
check hw buff 0? yes
                          check hw buff 0? yes
                          write B sample to sum/hw
write A sample to sum/hw

A re-read after the write does not solve a problem this time, because
thread B could (though it is very unlikely) have the same sample value.
But I'm sure we can come up with something for this.

That said, I still think it would be a better solution altogether to have
a buffer in an alsa-native not hardware-native format and have the driver
do the translation / saturation and the like. Yeah, I know that's not what
you want, I got it ;-).

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 19:45 ` Jaroslav Kysela
  2003-02-17 20:44   ` tomasz motylewski
@ 2003-02-18 10:00   ` Abramo Bagnara
  2003-02-18 12:52     ` Jaroslav Kysela
  2003-02-18 21:07     ` Jaroslav Kysela
  1 sibling, 2 replies; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-18 10:00 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:
> 
> > >> I see, the read/saturate/write must be atomic, too. In this case, it would
> > >> be better to use a global (or a set of) mutex(es) to lock the hardware
> > >> ring buffer. The futexes are nice.
> > >
> > >They are nice indeed, but definitely not the right solution here.
> > >
> > >Although I don't know if it's the absolute best solution, the 'retry'
> > >approach I've proposed is far better and much more efficient.
> >
> > I have to agree with Abramo. A global mutex would cause long and unnecessary
> > waits for the processes trying to write to the plugin. Locking access to
> > individual parts of the buffer is messy. Notice that concurrent writes
> > to the same sample in the buffer will occur sporadically, and the "re-read"
> > in the loop costs almost nothing, while synchronization mechanisms could
> > block often.
> 
> Note that your all nice ideas go to some blind alley. Who will silence the
> sum buffer? Driver silences only hardware buffer which will not be used
> for the calculation in your algorithm.


Not so blind ;-)

	v = *src;
	if (cmpxchg(hw, 0, 1) == 0)
		v -= *sw;
        xadd(sw, v);
        do {
                v = *sw;
                if (v > 0x7fff)
                        s = 0x7fff;
                else if (v < -0x8000)
                        s = -0x8000;
                else
                        s = v;
                *hw = s;
        } while (unlikely(v != *sw));

I've convinced you?

However as I've written in the my first message the evil of dmix
approach lies in details: they might destroy efficiency of approach
rather easily.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 10:00   ` Abramo Bagnara
@ 2003-02-18 12:52     ` Jaroslav Kysela
  2003-02-18 13:10       ` Jaroslaw Sobierski
  2003-02-18 14:51       ` Paul Davis
  2003-02-18 21:07     ` Jaroslav Kysela
  1 sibling, 2 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-18 12:52 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Tue, 18 Feb 2003, Abramo Bagnara wrote:

> Jaroslav Kysela wrote:
> > 
> > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:
> > 
> > > >> I see, the read/saturate/write must be atomic, too. In this case, it would
> > > >> be better to use a global (or a set of) mutex(es) to lock the hardware
> > > >> ring buffer. The futexes are nice.
> > > >
> > > >They are nice indeed, but definitely not the right solution here.
> > > >
> > > >Although I don't know if it's the absolute best solution, the 'retry'
> > > >approach I've proposed is far better and much more efficient.
> > >
> > > I have to agree with Abramo. A global mutex would cause long and unnecessary
> > > waits for the processes trying to write to the plugin. Locking access to
> > > individual parts of the buffer is messy. Notice that concurrent writes
> > > to the same sample in the buffer will occur sporadically, and the "re-read"
> > > in the loop costs almost nothing, while synchronization mechanisms could
> > > block often.
> > 
> > Note that your all nice ideas go to some blind alley. Who will silence the
> > sum buffer? Driver silences only hardware buffer which will not be used
> > for the calculation in your algorithm.
> 
> 
> Not so blind ;-)
> 
> 	v = *src;
> 	if (cmpxchg(hw, 0, 1) == 0)
> 		v -= *sw;
>         xadd(sw, v);
>         do {
>                 v = *sw;
>                 if (v > 0x7fff)
>                         s = 0x7fff;
>                 else if (v < -0x8000)
>                         s = -0x8000;
>                 else
>                         s = v;

A bit correction (we have to avoid zero results in hw buffer):

		  else if (v == 0)
			s = 1;
		  else
			s = v;

>                 *hw = s;
>         } while (unlikely(v != *sw));
> 
> I've convinced you?
> 
> However as I've written in the my first message the evil of dmix
> approach lies in details: they might destroy efficiency of approach
> rather easily.

Yes, but it seems that we can still do proper task without global locks 
which seems pretty nice. Thank you for your help.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 12:52     ` Jaroslav Kysela
@ 2003-02-18 13:10       ` Jaroslaw Sobierski
  2003-02-18 13:19         ` Jaroslav Kysela
  2003-02-18 14:51       ` Paul Davis
  1 sibling, 1 reply; 50+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-18 13:10 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net

Quoting Jaroslav Kysela:
[...]
> > 
> > 	v = *src;
> > 	if (cmpxchg(hw, 0, 1) == 0)
> > 		v -= *sw;
> >         xadd(sw, v);
> >         do {
> >                 v = *sw;
> >                 if (v > 0x7fff)
> >                         s = 0x7fff;
> >                 else if (v < -0x8000)
> >                         s = -0x8000;
> >                 else
> >                         s = v;
> 
> A bit correction (we have to avoid zero results in hw buffer):
> 
> 		  else if (v == 0)
> 			s = 1;
> 		  else
> 			s = v;
> 

Why?! It's like I've written yesterday : even if the outcoming sample
is zero, we can still treat the hw buffer as cleared. It makes no
difference whether it was reset by the driver or the samples just
added up to zero. If we have zero in the hw not because of a reset
we must also have 0 in sw, so the clearing code will have no effect.

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 13:10       ` Jaroslaw Sobierski
@ 2003-02-18 13:19         ` Jaroslav Kysela
  0 siblings, 0 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-18 13:19 UTC (permalink / raw)
  To: Jaroslaw Sobierski; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net

On Tue, 18 Feb 2003, Jaroslaw Sobierski wrote:

> Quoting Jaroslav Kysela:
> [...]
> > > 
> > > 	v = *src;
> > > 	if (cmpxchg(hw, 0, 1) == 0)
> > > 		v -= *sw;
> > >         xadd(sw, v);
> > >         do {
> > >                 v = *sw;
> > >                 if (v > 0x7fff)
> > >                         s = 0x7fff;
> > >                 else if (v < -0x8000)
> > >                         s = -0x8000;
> > >                 else
> > >                         s = v;
> > 
> > A bit correction (we have to avoid zero results in hw buffer):
> > 
> > 		  else if (v == 0)
> > 			s = 1;
> > 		  else
> > 			s = v;
> > 
> 
> Why?! It's like I've written yesterday : even if the outcoming sample
> is zero, we can still treat the hw buffer as cleared. It makes no
> difference whether it was reset by the driver or the samples just
> added up to zero. If we have zero in the hw not because of a reset
> we must also have 0 in sw, so the clearing code will have no effect.

Thanks for correction.. Some things are not visible at first glance.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 12:52     ` Jaroslav Kysela
  2003-02-18 13:10       ` Jaroslaw Sobierski
@ 2003-02-18 14:51       ` Paul Davis
  2003-02-18 16:51         ` Jaroslav Kysela
  1 sibling, 1 reply; 50+ messages in thread
From: Paul Davis @ 2003-02-18 14:51 UTC (permalink / raw)
  To: alsa-devel@lists.sourceforge.net

>> 	v = *src;
>> 	if (cmpxchg(hw, 0, 1) == 0)
>> 		v -= *sw;
>>         xadd(sw, v);
>>         do {
>>                 v = *sw;
>>                 if (v > 0x7fff)
>>                         s = 0x7fff;
>>                 else if (v < -0x8000)
>>                         s = -0x8000;
>>                 else
>>                         s = v;
>
>A bit correction (we have to avoid zero results in hw buffer):
>
>		  else if (v == 0)
>			s = 1;
>		  else
>			s = v;
>
>>                 *hw = s;
>>         } while (unlikely(v != *sw));

help me out here. is this the code path that has be followed to write
a single sample to the buffer?


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 14:51       ` Paul Davis
@ 2003-02-18 16:51         ` Jaroslav Kysela
  0 siblings, 0 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-18 16:51 UTC (permalink / raw)
  To: Paul Davis; +Cc: alsa-devel@lists.sourceforge.net

On Tue, 18 Feb 2003, Paul Davis wrote:

> >> 	v = *src;
> >> 	if (cmpxchg(hw, 0, 1) == 0)
> >> 		v -= *sw;
> >>         xadd(sw, v);
> >>         do {
> >>                 v = *sw;
> >>                 if (v > 0x7fff)
> >>                         s = 0x7fff;
> >>                 else if (v < -0x8000)
> >>                         s = -0x8000;
> >>                 else
> >>                         s = v;
> >
> >A bit correction (we have to avoid zero results in hw buffer):
> >
> >		  else if (v == 0)
> >			s = 1;
> >		  else
> >			s = v;
> >
> >>                 *hw = s;
> >>         } while (unlikely(v != *sw));
> 
> help me out here. is this the code path that has be followed to write
> a single sample to the buffer?

Yes, this code updates one sample in the hardware buffer.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 10:00   ` Abramo Bagnara
  2003-02-18 12:52     ` Jaroslav Kysela
@ 2003-02-18 21:07     ` Jaroslav Kysela
  2003-02-19 10:20       ` Abramo Bagnara
  2003-02-19 10:33       ` Jaroslaw Sobierski
  1 sibling, 2 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-18 21:07 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Tue, 18 Feb 2003, Abramo Bagnara wrote:

> Jaroslav Kysela wrote:
> > 
> > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote:
> > 
> > > >> I see, the read/saturate/write must be atomic, too. In this case, it would
> > > >> be better to use a global (or a set of) mutex(es) to lock the hardware
> > > >> ring buffer. The futexes are nice.
> > > >
> > > >They are nice indeed, but definitely not the right solution here.
> > > >
> > > >Although I don't know if it's the absolute best solution, the 'retry'
> > > >approach I've proposed is far better and much more efficient.
> > >
> > > I have to agree with Abramo. A global mutex would cause long and unnecessary
> > > waits for the processes trying to write to the plugin. Locking access to
> > > individual parts of the buffer is messy. Notice that concurrent writes
> > > to the same sample in the buffer will occur sporadically, and the "re-read"
> > > in the loop costs almost nothing, while synchronization mechanisms could
> > > block often.
> > 
> > Note that your all nice ideas go to some blind alley. Who will silence the
> > sum buffer? Driver silences only hardware buffer which will not be used
> > for the calculation in your algorithm.
> 
> 
> Not so blind ;-)
> 
> 	v = *src;
> 	if (cmpxchg(hw, 0, 1) == 0)
> 		v -= *sw;
>         xadd(sw, v);
>         do {
>                 v = *sw;
>                 if (v > 0x7fff)
>                         s = 0x7fff;
>                 else if (v < -0x8000)
>                         s = -0x8000;
>                 else
>                         s = v;
>                 *hw = s;
>         } while (unlikely(v != *sw));
> 
> I've convinced you?
> 
> However as I've written in the my first message the evil of dmix
> approach lies in details: they might destroy efficiency of approach
> rather easily.

I've implemented the whole transfer and mix loop in assembly and it works
without any drastic impact on CPU usage. I tried to optimize the assembler
part as much as I can, but if some assembler guru want to give a glance,
I'll appreciate it. The function is named mix_areas1() in
alsa-lib/src/pcm/pcm_dmix.c.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-17 18:15   ` Paul Davis
@ 2003-02-18 22:36     ` Abramo Bagnara
  0 siblings, 0 replies; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-18 22:36 UTC (permalink / raw)
  To: alsa-devel

Paul Davis wrote:
> 
> >> Still, don't we already *have* a feeding thread for the sound card? I mean
> >> it doesn't just grab the memory buffer all by itself whenever it wants?
> >
> >Nope. The idea for the dmix plugin is that we share the DMA ring buffer
> >with more threads (processes). There is no "master" thread which operates
> >exclusively with the DMA buffer.
> 
> that would be called "JACK", right ?

Not necessarily, sorry.

I've just explained in many ways that IMO the callback-only model choice
will doom Jack to remain in a niche.

And I say this with grief: Jack is the nicest acronym I've ever heard
;-)

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 21:07     ` Jaroslav Kysela
@ 2003-02-19 10:20       ` Abramo Bagnara
  2003-02-19 11:01         ` Jaroslav Kysela
  2003-02-19 10:33       ` Jaroslaw Sobierski
  1 sibling, 1 reply; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-19 10:20 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> I've implemented the whole transfer and mix loop in assembly and it works
> without any drastic impact on CPU usage. I tried to optimize the assembler
> part as much as I can, but if some assembler guru want to give a glance,
> I'll appreciate it. The function is named mix_areas1() in
> alsa-lib/src/pcm/pcm_dmix.c.

one comment:

It's better to execute interleaved check once and not in mix_areas

one objection:

I doubt very much that you gain anything coding the mixing loop in
assembler, you've data showing that?


-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-18 21:07     ` Jaroslav Kysela
  2003-02-19 10:20       ` Abramo Bagnara
@ 2003-02-19 10:33       ` Jaroslaw Sobierski
  2003-02-19 11:08         ` Jaroslav Kysela
  1 sibling, 1 reply; 50+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-19 10:33 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net

Quoting Jaroslav Kysela <perex@suse.cz>:
> 
> I've implemented the whole transfer and mix loop in assembly and it works
> without any drastic impact on CPU usage. I tried to optimize the assembler
> part as much as I can, but if some assembler guru want to give a glance,
> I'll appreciate it. The function is named mix_areas1() in
> alsa-lib/src/pcm/pcm_dmix.c.
> 

It seems to me it would make sens to code it for mmx (to use the saturation
it offers for example). If you go for pure 386 there's little to win.
Did you look at the assembly generated by gcc when compiling with 
optimiazations? I usually make this a start point when moving time-critical 
code to assembly, and if it looks optimized enough - I leave it at that,
unless I can use tricks not available to the compiler - like, again, mmx.

I don't know how well gcc is optimized for intels, but I remember that you
really had to work your ass of to beat inner loops optimized by Watcomm
compilers (BTW I heard they're coming back with open source compilers :-). 
Not to mention proprietary Intel compilers which can take into
account things like word alignment for data and code, cache hit / miss 
situations, branch preditiction and all kinds of magical stuff.

I'll take a closer look at the code when I have more time though.

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 10:20       ` Abramo Bagnara
@ 2003-02-19 11:01         ` Jaroslav Kysela
  2003-02-19 11:17           ` Abramo Bagnara
  0 siblings, 1 reply; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-19 11:01 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Wed, 19 Feb 2003, Abramo Bagnara wrote:

> Jaroslav Kysela wrote:
> > 
> > I've implemented the whole transfer and mix loop in assembly and it works
> > without any drastic impact on CPU usage. I tried to optimize the assembler
> > part as much as I can, but if some assembler guru want to give a glance,
> > I'll appreciate it. The function is named mix_areas1() in
> > alsa-lib/src/pcm/pcm_dmix.c.
> 
> one comment:
> 
> It's better to execute interleaved check once and not in mix_areas

Done. I was tired enough yesterday to bother with these details.

> one objection:
> 
> I doubt very much that you gain anything coding the mixing loop in
> assembler, you've data showing that?

I think that I spent some ticks by duplicating code for saturation and 
also the main while{} loop is more effective than GCC generates. But it's 
only guess.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs




-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 10:33       ` Jaroslaw Sobierski
@ 2003-02-19 11:08         ` Jaroslav Kysela
  0 siblings, 0 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-19 11:08 UTC (permalink / raw)
  To: Jaroslaw Sobierski; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net

On Wed, 19 Feb 2003, Jaroslaw Sobierski wrote:

> Quoting Jaroslav Kysela <perex@suse.cz>:
> > 
> > I've implemented the whole transfer and mix loop in assembly and it works
> > without any drastic impact on CPU usage. I tried to optimize the assembler
> > part as much as I can, but if some assembler guru want to give a glance,
> > I'll appreciate it. The function is named mix_areas1() in
> > alsa-lib/src/pcm/pcm_dmix.c.
> > 
> 
> It seems to me it would make sens to code it for mmx (to use the saturation
> it offers for example). If you go for pure 386 there's little to win.

Yes and no. I don't think that there will be enough need for the
saturations, so the saturation code path mostly takes 4 instructions (two
compare, two skipped conditional jumps).

> Did you look at the assembly generated by gcc when compiling with 
> optimiazations? I usually make this a start point when moving time-critical 

Yes, my code is based on the code from GCC.

> code to assembly, and if it looks optimized enough - I leave it at that,
> unless I can use tricks not available to the compiler - like, again, mmx.
> 
> I don't know how well gcc is optimized for intels, but I remember that you
> really had to work your ass of to beat inner loops optimized by Watcomm
> compilers (BTW I heard they're coming back with open source compilers :-). 
> Not to mention proprietary Intel compilers which can take into
> account things like word alignment for data and code, cache hit / miss 
> situations, branch preditiction and all kinds of magical stuff.

Yes, of course. I've not claimed that I wrote the best code in the world ;-)
But something we can start with.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 11:01         ` Jaroslav Kysela
@ 2003-02-19 11:17           ` Abramo Bagnara
  2003-02-19 13:49             ` Abramo Bagnara
  0 siblings, 1 reply; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-19 11:17 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> 
> > Jaroslav Kysela wrote:
> > >
> > > I've implemented the whole transfer and mix loop in assembly and it works
> > > without any drastic impact on CPU usage. I tried to optimize the assembler
> > > part as much as I can, but if some assembler guru want to give a glance,
> > > I'll appreciate it. The function is named mix_areas1() in
> > > alsa-lib/src/pcm/pcm_dmix.c.
> >
> > one comment:
> >
> > It's better to execute interleaved check once and not in mix_areas
> 
> Done. I was tired enough yesterday to bother with these details.
> 
> > one objection:
> >
> > I doubt very much that you gain anything coding the mixing loop in
> > assembler, you've data showing that?
> 
> I think that I spent some ticks by duplicating code for saturation and
> also the main while{} loop is more effective than GCC generates. But it's
> only guess.

I hope to find the time to check it this evening

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 11:17           ` Abramo Bagnara
@ 2003-02-19 13:49             ` Abramo Bagnara
  2003-02-19 15:45               ` Jaroslaw Sobierski
  2003-02-19 18:34               ` Jaroslav Kysela
  0 siblings, 2 replies; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-19 13:49 UTC (permalink / raw)
  To: Jaroslav Kysela, Jaroslaw Sobierski,
	alsa-devel@lists.sourceforge.net

[-- Attachment #1: Type: text/plain, Size: 2816 bytes --]

Abramo Bagnara wrote:
> 
> Jaroslav Kysela wrote:
> >
> > On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> >
> > > Jaroslav Kysela wrote:
> > > >
> > > > I've implemented the whole transfer and mix loop in assembly and it works
> > > > without any drastic impact on CPU usage. I tried to optimize the assembler
> > > > part as much as I can, but if some assembler guru want to give a glance,
> > > > I'll appreciate it. The function is named mix_areas1() in
> > > > alsa-lib/src/pcm/pcm_dmix.c.
> > >
> > > one comment:
> > >
> > > It's better to execute interleaved check once and not in mix_areas
> >
> > Done. I was tired enough yesterday to bother with these details.
> >
> > > one objection:
> > >
> > > I doubt very much that you gain anything coding the mixing loop in
> > > assembler, you've data showing that?
> >
> > I think that I spent some ticks by duplicating code for saturation and
> > also the main while{} loop is more effective than GCC generates. But it's
> > only guess.
> 
> I hope to find the time to check it this evening

I've stolen some time to paid work.

The results are amazing and I'd say Jaroslav has done some mistakes in
his handmade asm.

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 6
model name      : AMD Athlon(tm) XP 1700+
stepping        : 2
cpu MHz         : 1460.471
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca
cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 2916.35
$ gcc -v
Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/3.2.1/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--disable-checking --with-system-zlib --enable-__cxa_atexit
--host=i386-redhat-linux
Thread model: posix
gcc version 3.2.1 20021125 (Red Hat Linux 8.0 3.2.1-1)
$ make
gcc -O6 -W -Wall   -c -o sum.o sum.c
sum.c: In function `main':
sum.c:242: warning: implicit declaration of function `printf'
sum.c:219: warning: unused parameter `argc'
sum.c:255: warning: control reaches end of non-void function
sum.c: In function `mix_areas0':
sum.c:64: warning: unused parameter `sum'
gcc   sum.o   -o sum
$ ./sum 2048 4 32767
mix_areas0: 110603
mix_areas1: 1512610
mix_areas2: 157597

mix_areas0 is the naive, incorrect version
mix_areas1 is Jaroslav asm
mix_areas2 is my best attempt

Time in clock ticks.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

[-- Attachment #2: sum.c --]
[-- Type: text/plain, Size: 5168 bytes --]

#include <stdlib.h>
#include <stdlib.h>
#include <string.h>

#define rdtscll(val) \
     __asm__ __volatile__("rdtsc" : "=A" (val))

#define likely(x)       __builtin_expect((x),1)
#define unlikely(x)     __builtin_expect((x),0)

typedef short int s16;
typedef int s32;

#ifdef CONFIG_SMP
#define LOCK_PREFIX "lock ; "
#else
#define LOCK_PREFIX ""
#endif

struct __xchg_dummy { unsigned long a[100]; };
#define __xg(x) ((struct __xchg_dummy *)(x))

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
				      unsigned long new, int size)
{
	unsigned long prev;
	switch (size) {
	case 1:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	case 2:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	case 4:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	}
	return old;
}

#define cmpxchg(ptr,o,n)\
	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
					(unsigned long)(n),sizeof(*(ptr))))

static inline void atomic_add(volatile int *dst, int v)
{
	__asm__ __volatile__(
		LOCK_PREFIX "addl %0,%1"
		:"=m" (*dst)
		:"ir" (v));
}

void mix_areas0(unsigned int size,
		volatile s16 *dst, s16 *src,
		volatile s32 *sum,
		unsigned int dst_step, unsigned int src_step)
{
	while (size-- > 0) {
		s32 sample = *dst + *src;
		if (unlikely(sample & 0xffff0000))
			*dst = sample > 0 ? 0x7fff : -0x8000;
		else
			*dst = sample;
		dst += dst_step;
		src += src_step;
	}
}

void mix_areas1(unsigned int size,
		volatile s16 *dst, s16 *src,
		volatile s32 *sum, unsigned int dst_step,
		unsigned int src_step, unsigned int sum_step)
{
	/*
	 *  ESI - src
	 *  EDI - dst
	 *  EBX - sum
	 *  ECX - old sample
	 *  EAX - sample / temporary
	 *  EDX - size
	 */
	__asm__ __volatile__ (
		"\n"

		/*
		 *  initialization, load EDX, ESI, EDI, EBX registers
		 */
		"\tmovl %0, %%edx\n"
		"\tmovl %1, %%edi\n"
		"\tmovl %2, %%esi\n"
		"\tmovl %3, %%ebx\n"

		/*
		 * while (size-- > 0) {
		 */
		"\tcmp $0, %%edx\n"
		"jz 6f\n"

		"1:"

		/*
		 *   sample = *src;
		 *   if (cmpxchg(*dst, 0, 1) == 0)
		 *     sample -= *sum;
		 *   xadd(*sum, sample);
		 */
		"\tmovw $0, %%ax\n"
		"\tmovw $1, %%cx\n"
		"\tlock; cmpxchgw %%cx, (%%edi)\n"
		"\tmovswl (%%esi), %%ecx\n"
		"\tjnz 2f\n"
		"\tsubl (%%ebx), %%ecx\n"
		"2:"
		"\tlock; addl %%ecx, (%%ebx)\n"

		/*
		 *   do {
		 *     sample = old_sample = *sum;
		 *     saturate(v);
		 *     *dst = sample;
		 *   } while (v != *sum);
		 */

		"3:"
		"\tmovl (%%ebx), %%ecx\n"
		"\tcmpl $0x7fff,%%ecx\n"
		"\tjg 4f\n"
		"\tcmpl $-0x8000,%%ecx\n"
		"\tjl 5f\n"
		"\tmovw %%cx, (%%edi)\n"
		"\tcmpl %%ecx, (%%ebx)\n"
		"\tjnz 3b\n"

		/*
		 * while (size-- > 0)
		 */
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		"\tjmp 6f\n"

		/*
		 *  sample > 0x7fff
		 */

		"4:"
		"\tmovw $0x7fff, %%ax\n"
		"\tmovw %%ax, (%%edi)\n"
		"\tcmpl %%ecx,(%%ebx)\n"
		"\tjnz 3b\n"
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		"\tjmp 6f\n"

		/*
		 *  sample < -0x8000
		 */

		"5:"
		"\tmovw $-0x8000, %%ax\n"
		"\tmovw %%ax, (%%edi)\n"
		"\tcmpl %%ecx, (%%ebx)\n"
		"\tjnz 3b\n"
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		// "\tjmp 6f\n"
		
		"6:"

		: /* no output regs */
		: "m" (size), "m" (dst), "m" (src), "m" (sum), "m" (dst_step), "m" (src_step), "m" (sum_step)
		: "esi", "edi", "edx", "ecx", "ebx", "eax"
	);
}


void mix_areas2(unsigned int size,
		volatile s16 *dst, s16 *src,
		volatile s32 *sum,
		unsigned int dst_step, unsigned int src_step)
{
	while (size-- > 0) {
		s32 sample = *src;
		if (cmpxchg(dst, 0, 1) == 0)
			sample -= *sum;
		atomic_add(sum, sample);
		do {
			sample = *sum;
			s16 s;
			if (unlikely(sample & 0xffff0000))
				s = sample > 0 ? 0x7fff : -0x8000;
			else
				s = sample;
			*dst = s;
		} while (unlikely(sample != *sum));
		sum++;
		dst += dst_step;
		src += src_step;
	}
}

int main(int argc, char **argv)
{
	int size = atoi(argv[1]);
	int n = atoi(argv[2]);
	int max = atoi(argv[3]);
	int i;
	unsigned long long begin, end;
	s16 *dst = malloc(sizeof(*dst) * size);
	s32 *sum = calloc(size, sizeof(*sum));
	s16 **srcs = malloc(sizeof(*srcs) * n);
	for (i = 0; i < n; i++) {
		int k;
		s16 *s;
		srcs[i] = s = malloc(sizeof(s16) * size);
		for (k = 0; k < size; ++k, ++s) {
			*s = (rand() % (max * 2)) - max;
		}
	}
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas0(size, dst, srcs[i], sum, 1, 1);
	}
	rdtscll(end);
	printf("mix_areas0: %lld\n", end - begin);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas1(size, dst, srcs[i], sum, 1, 1, 1);
	}
	rdtscll(end);
	printf("mix_areas1: %lld\n", end - begin);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas2(size, dst, srcs[i], sum, 1, 1);
	}
	rdtscll(end);
	printf("mix_areas2: %lld\n", end - begin);
}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 13:49             ` Abramo Bagnara
@ 2003-02-19 15:45               ` Jaroslaw Sobierski
  2003-02-19 20:39                 ` Abramo Bagnara
  2003-02-19 18:34               ` Jaroslav Kysela
  1 sibling, 1 reply; 50+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-19 15:45 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslav Kysela, alsa-devel@lists.sourceforge.net

Quoting Abramo Bagnara <abramo.bagnara@libero.it>:
> 
> The results are amazing and I'd say Jaroslav has done some mistakes in
> his handmade asm.
> 

This may be true, but I think you're trying to be a little too quick yourself.
Did you *test* your code? I only had time to take a short glance at it, but
too me it seems that this is not the correct check for overflow on signed
numbers:

>                       if (unlikely(sample & 0xffff0000))
>                                s = sample > 0 ? 0x7fff : -0x8000;
>                        else
>                                s = sample;

I noticed it because this is the first thought I had, but it only works
for unsgined. Notice that -1 will be 0xffffffff in a 32 bit sample. So
your code will "saturate" all negative samples to -8000 effectively
killing half of the wave, the way a diode does. I'm pretty sure this
would not sound good ;-). Still, even if you change this to two normal
ifs I assume the speed will not be affected by an order of magnitude.

Secondly, the test code is hardly a good representation of our "working"
environment because we're expecting multiple processes to write
concurrently to the buffer. I think you sholud have a "verification"
procedure which carefully mixes the waves one by one and then the 
n test mixes should be run in m processes concurrently. And the result
compared to the "verification" table.

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 13:49             ` Abramo Bagnara
  2003-02-19 15:45               ` Jaroslaw Sobierski
@ 2003-02-19 18:34               ` Jaroslav Kysela
  2003-02-19 21:24                 ` Jaroslav Kysela
                                   ` (3 more replies)
  1 sibling, 4 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-19 18:34 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Wed, 19 Feb 2003, Abramo Bagnara wrote:

> The results are amazing and I'd say Jaroslav has done some mistakes in
> his handmade asm.

I don't think so. It seems that my brain still remembers assembler ;-)
You passed wrong values to my code so it did unaligned accesses.

Fixes to make things same:

--- sum.c	2003-02-19 18:55:20.000000000 +0100
+++ a.c	2003-02-19 19:31:00.000000000 +0100
@@ -11,6 +11,8 @@
 typedef short int s16;
 typedef int s32;
 
+#define CONFIG_SMP
+
 #ifdef CONFIG_SMP
 #define LOCK_PREFIX "lock ; "
 #else
@@ -54,7 +56,7 @@
 static inline void atomic_add(volatile int *dst, int v)
 {
 	__asm__ __volatile__(
-		LOCK_PREFIX "addl %0,%1"
+		LOCK_PREFIX "addl %1,%0"
 		:"=m" (*dst)
 		:"ir" (v));
 }
@@ -62,7 +64,9 @@
 void mix_areas0(unsigned int size,
 		volatile s16 *dst, s16 *src,
 		volatile s32 *sum,
-		unsigned int dst_step, unsigned int src_step)
+		unsigned int dst_step,
+		unsigned int src_step,
+		unsigned int sum_step)
 {
 	while (size-- > 0) {
 		s32 sample = *dst + *src;
@@ -70,8 +74,8 @@
 			*dst = sample > 0 ? 0x7fff : -0x8000;
 		else
 			*dst = sample;
-		dst += dst_step;
-		src += src_step;
+		((char *)dst) += dst_step;
+		((char *)src) += src_step;
 	}
 }
 
@@ -194,7 +198,9 @@
 void mix_areas2(unsigned int size,
 		volatile s16 *dst, s16 *src,
 		volatile s32 *sum,
-		unsigned int dst_step, unsigned int src_step)
+		unsigned int dst_step,
+		unsigned int src_step,
+		unsigned int sum_step)
 {
 	while (size-- > 0) {
 		s32 sample = *src;
@@ -204,15 +210,15 @@
 		do {
 			sample = *sum;
 			s16 s;
-			if (unlikely(sample & 0xffff0000))
+			if (unlikely(sample & 0x7fff0000))
 				s = sample > 0 ? 0x7fff : -0x8000;
 			else
 				s = sample;
 			*dst = s;
 		} while (unlikely(sample != *sum));
-		sum++;
-		dst += dst_step;
-		src += src_step;
+		((char *)sum) += sum_step;
+		((char *)dst) += dst_step;
+		((char *)src) += src_step;
 	}
 }
 
@@ -236,19 +242,19 @@
 	}
 	rdtscll(begin);
 	for (i = 0; i < n; i++) {
-		mix_areas0(size, dst, srcs[i], sum, 1, 1);
+		mix_areas0(size, dst, srcs[i], sum, 2, 2, 4);
 	}
 	rdtscll(end);
 	printf("mix_areas0: %lld\n", end - begin);
 	rdtscll(begin);
 	for (i = 0; i < n; i++) {
-		mix_areas1(size, dst, srcs[i], sum, 1, 1, 1);
+		mix_areas1(size, dst, srcs[i], sum, 2, 2, 4);
 	}
 	rdtscll(end);
 	printf("mix_areas1: %lld\n", end - begin);
 	rdtscll(begin);
 	for (i = 0; i < n; i++) {
-		mix_areas2(size, dst, srcs[i], sum, 1, 1);
+		mix_areas2(size, dst, srcs[i], sum, 2, 2, 4);
 	}
 	rdtscll(end);
 	printf("mix_areas2: %lld\n", end - begin);

perex@pnote:~> cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 6
cpu MHz         : 847.473
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov 
pat pse36 mmx fxsr sse
bogomips        : 1679.36

perex@pnote:~> ./a.out 2048 4 32267
mix_areas0: 170691
mix_areas1: 675795
mix_areas2: 708995


					Have fun,
						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 15:45               ` Jaroslaw Sobierski
@ 2003-02-19 20:39                 ` Abramo Bagnara
  0 siblings, 0 replies; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-19 20:39 UTC (permalink / raw)
  To: Jaroslaw Sobierski; +Cc: Jaroslav Kysela, alsa-devel@lists.sourceforge.net

Jaroslaw Sobierski wrote:
> 
> Quoting Abramo Bagnara <abramo.bagnara@libero.it>:
> >
> > The results are amazing and I'd say Jaroslav has done some mistakes in
> > his handmade asm.
> >
> 
> This may be true, but I think you're trying to be a little too quick yourself.

No doubts about that, I was in a hurry.

> Did you *test* your code? I only had time to take a short glance at it, but
> too me it seems that this is not the correct check for overflow on signed
> numbers:
> 
> >                       if (unlikely(sample & 0xffff0000))
> >                                s = sample > 0 ? 0x7fff : -0x8000;
> >                        else
> >                                s = sample;
> 
> I noticed it because this is the first thought I had, but it only works
> for unsgined. Notice that -1 will be 0xffffffff in a 32 bit sample. So
> your code will "saturate" all negative samples to -8000 effectively
> killing half of the wave, the way a diode does. I'm pretty sure this
> would not sound good ;-). Still, even if you change this to two normal
> ifs I assume the speed will not be affected by an order of magnitude.
> 
> Secondly, the test code is hardly a good representation of our "working"
> environment because we're expecting multiple processes to write
> concurrently to the buffer. I think you sholud have a "verification"
> procedure which carefully mixes the waves one by one and then the
> n test mixes should be run in m processes concurrently. And the result
> compared to the "verification" table.

This is best tested with an SMP machine and I've not an easy access to
it.

That's apart you're perfectly right and this was exactly my intentions.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 18:34               ` Jaroslav Kysela
@ 2003-02-19 21:24                 ` Jaroslav Kysela
  2003-02-20  8:28                 ` Abramo Bagnara
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-19 21:24 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Wed, 19 Feb 2003, Jaroslav Kysela wrote:

> perex@pnote:~> cat /proc/cpuinfo
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 8
> model name      : Pentium III (Coppermine)
> stepping        : 6
> cpu MHz         : 847.473
> cache size      : 256 KB
> fdiv_bug        : no
> hlt_bug         : no
> f00f_bug        : no
> coma_bug        : no
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 2
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov 
> pat pse36 mmx fxsr sse
> bogomips        : 1679.36
> 
> perex@pnote:~> ./a.out 2048 4 32267
> mix_areas0: 170691
> mix_areas1: 675795
> mix_areas2: 708995

More results (with MMX code):

perex@pnote:~/alsa/alsa-lib/test> ./code 2048 4 32767
mix_areas0    : 172345
mix_areas1    : 677021
mix_areas1_mmx: 620597
mix_areas2    : 702227

Note - the test utility is in CVS - alsa-lib/test/code.c - now.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 18:34               ` Jaroslav Kysela
  2003-02-19 21:24                 ` Jaroslav Kysela
@ 2003-02-20  8:28                 ` Abramo Bagnara
  2003-02-20  8:30                 ` Jaroslaw Sobierski
  2003-02-20  8:53                 ` Abramo Bagnara
  3 siblings, 0 replies; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-20  8:28 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> 
> > The results are amazing and I'd say Jaroslav has done some mistakes in
> > his handmade asm.
> 
> I don't think so. It seems that my brain still remembers assembler ;-)

I've no doubts about that ;-)

> You passed wrong values to my code so it did unaligned accesses.

I guessed that but I was too lazy to deeply analyze your asm.

> Fixes to make things same:

>                 volatile s32 *sum,
> -               unsigned int dst_step, unsigned int src_step)
> +               unsigned int dst_step,
> +               unsigned int src_step,
> +               unsigned int sum_step)

sum_step is useless I've deliberately removed it.
Please do it also on your code.

> +               ((char *)dst) += dst_step;
> +               ((char *)src) += src_step;

IMHO it's a sane assumption suppose that step is multiple of sample
size.
However this should not have any impact on efficiency (at least I
believe).

> -                       if (unlikely(sample & 0xffff0000))
> +                       if (unlikely(sample & 0x7fff0000))

As Jaroslaw has written this is a mistake and I've verified the right
version has no speed benefits.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 18:34               ` Jaroslav Kysela
  2003-02-19 21:24                 ` Jaroslav Kysela
  2003-02-20  8:28                 ` Abramo Bagnara
@ 2003-02-20  8:30                 ` Jaroslaw Sobierski
  2003-02-20  8:48                   ` Abramo Bagnara
  2003-02-20  8:53                 ` Abramo Bagnara
  3 siblings, 1 reply; 50+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-20  8:30 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net

Quoting Jaroslav Kysela <perex@suse.cz>:

> I don't think so. It seems that my brain still remembers assembler ;-)
...
>  			sample = *sum;
>  			s16 s;
> -			if (unlikely(sample & 0xffff0000))
> +			if (unlikely(sample & 0x7fff0000))
>  				s = sample > 0 ? 0x7fff : -0x8000;
>  			else
>  				s = sample;

I think I remember some of the x86 assembly myself and this correction
does not fix the problem. This code will still "saturate" all negative
samples to -8000. You cannot detect an overflow into the upper half of
the register with a simple bitwise and. The actual test should be as
follows : 
- extend the sign of the lower half
- check if the upper half is the same as the effect of expansion
 if it is - there is no overflow
 if it differs - there was overflow and you need to saturate.
examples : 
value 0x 0000 0335
ext   0x 0000 0335
  -> no overflow

value 0x 0002 43b1
ext   0x 0000 43b1
  -> overflow

value 0x ffff f25b
ext   0x ffff f25b
  -> no overflow

value 0x ff1c 35c9
ext   0x 0000 35c9
  -> overflow

to put it in asm:

mov ebx,eax
cwde
cmp eax,ebx

The problem is cwde operates only on ax/eax.
This may sound complicated but in fact it amounts to a very simple
question : does the sample fit in a 16 bit int, or does it not, so
I guess in C it could look something like :

    s16 s=sample;
    if (unlikely(sample != (s32)s))

The cast is just there for clarity I believe it would be done
implicitly anyway. But don't take my word for it - I did not
test this.

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-20  8:30                 ` Jaroslaw Sobierski
@ 2003-02-20  8:48                   ` Abramo Bagnara
  0 siblings, 0 replies; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-20  8:48 UTC (permalink / raw)
  To: Jaroslaw Sobierski; +Cc: Jaroslav Kysela, alsa-devel

Jaroslaw Sobierski wrote:
> 

> 
>     s16 s=sample;
>     if (unlikely(sample != (s32)s))
> 

I've verified exactly this yesterday evening, but it's less efficient
than ordinary boundary check.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-19 18:34               ` Jaroslav Kysela
                                   ` (2 preceding siblings ...)
  2003-02-20  8:30                 ` Jaroslaw Sobierski
@ 2003-02-20  8:53                 ` Abramo Bagnara
  2003-02-20 16:49                   ` Jaroslav Kysela
  3 siblings, 1 reply; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-20  8:53 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

[-- Attachment #1: Type: text/plain, Size: 893 bytes --]

Jaroslav Kysela wrote:
> 
> On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> 
> > The results are amazing and I'd say Jaroslav has done some mistakes in
> > his handmade asm.
> 
> I don't think so. It seems that my brain still remembers assembler ;-)
> You passed wrong values to my code so it did unaligned accesses.
> 
> Fixes to make things same:

I've done the needed changes in my version of sum.c to get correct
results from asm version, but I'm still unable to get from it good
performance numbers.

I'm puzzled...

$ ./sum 2048 8 32768
CPU clock: 1460474444.671998
mix_areas0: 90773 0.033459%
mix_areas1: 141173 0.052036% (1103)
mix_areas2: 870134 0.320731% (0)
mix_areas3: 343792 0.126722% (0)


-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy

[-- Attachment #2: sum.c --]
[-- Type: text/plain, Size: 7213 bytes --]

#include <stdlib.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>

#define rdtscll(val) \
     __asm__ __volatile__("rdtsc" : "=A" (val))

#define likely(x)       __builtin_expect((x),1)
#define unlikely(x)     __builtin_expect((x),0)

typedef short int s16;
typedef int s32;

#ifdef CONFIG_SMP
#define LOCK_PREFIX "lock ; "
#else
#define LOCK_PREFIX ""
#endif

struct __xchg_dummy { unsigned long a[100]; };
#define __xg(x) ((struct __xchg_dummy *)(x))

static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
				      unsigned long new, int size)
{
	unsigned long prev;
	switch (size) {
	case 1:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	case 2:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	case 4:
		__asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2"
				     : "=a"(prev)
				     : "q"(new), "m"(*__xg(ptr)), "0"(old)
				     : "memory");
		return prev;
	}
	return old;
}

#define cmpxchg(ptr,o,n)\
	((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\
					(unsigned long)(n),sizeof(*(ptr))))

static inline void atomic_add(volatile int *dst, int v)
{
	__asm__ __volatile__(
		LOCK_PREFIX "addl %1,%0"
		:"=m" (*dst)
		:"ir" (v), "m" (*dst));
}


static double
detect_cpu_clock()
{
        struct timeval tm_begin, tm_end;
        unsigned long long tsc_begin, tsc_end;

        /* Warm cache */
        gettimeofday(&tm_begin, 0);

        rdtscll(tsc_begin);
        gettimeofday(&tm_begin, 0);

        usleep(1000000);

        rdtscll(tsc_end);
        gettimeofday(&tm_end, 0);

        return (tsc_end - tsc_begin) / (tm_end.tv_sec - tm_begin.tv_sec + (tm_end.tv_usec - tm_begin.tv_usec) / 1e6);
}

void mix_areas0(unsigned int size,
		const s16 *src,
		volatile s32 *sum,
		unsigned int src_step)
{
	while (size-- > 0) {
		atomic_add(sum, *src);
		(char*)src += src_step;
		sum++;
	}
}

void saturate(unsigned int size,
	      s16 *dst, const s32 *sum,
	      unsigned int dst_step)
{
	while (size-- > 0) {
		s32 sample = *sum;
		if (unlikely(sample < -0x8000))
			*dst = -0x8000;
		else if (unlikely(sample > 0x7fff))
			*dst = 0x7fff;
		else
			*dst = sample;
		(char*)dst += dst_step;
		sum++;
	}
}

void mix_areas1(unsigned int size,
		volatile s16 *dst, const s16 *src,
		unsigned int dst_step, unsigned int src_step)
{
	while (size-- > 0) {
		s32 sample = *dst + *src;
		if (unlikely(sample < -0x8000))
			*dst = -0x8000;
		else if (unlikely(sample > 0x7fff))
			*dst = 0x7fff;
		else
			*dst = sample;
		(char*)dst += dst_step;
		(char*)src += src_step;
	}
}

void mix_areas2(unsigned int size,
		volatile s16 *dst, const s16 *src,
		volatile s32 *sum, unsigned int dst_step,
		unsigned int src_step, unsigned int sum_step)
{
	/*
	 *  ESI - src
	 *  EDI - dst
	 *  EBX - sum
	 *  ECX - old sample
	 *  EAX - sample / temporary
	 *  EDX - size
	 */
	__asm__ __volatile__ (
		"\n"

		/*
		 *  initialization, load EDX, ESI, EDI, EBX registers
		 */
		"\tmovl %0, %%edx\n"
		"\tmovl %1, %%edi\n"
		"\tmovl %2, %%esi\n"
		"\tmovl %3, %%ebx\n"

		/*
		 * while (size-- > 0) {
		 */
		"\tcmp $0, %%edx\n"
		"jz 6f\n"

		"1:"

		/*
		 *   sample = *src;
		 *   if (cmpxchg(*dst, 0, 1) == 0)
		 *     sample -= *sum;
		 *   xadd(*sum, sample);
		 */
		"\tmovw $0, %%ax\n"
		"\tmovw $1, %%cx\n"
		"\tlock; cmpxchgw %%cx, (%%edi)\n"
		"\tmovswl (%%esi), %%ecx\n"
		"\tjnz 2f\n"
		"\tsubl (%%ebx), %%ecx\n"
		"2:"
		"\tlock; addl %%ecx, (%%ebx)\n"

		/*
		 *   do {
		 *     sample = old_sample = *sum;
		 *     saturate(v);
		 *     *dst = sample;
		 *   } while (v != *sum);
		 */

		"3:"
		"\tmovl (%%ebx), %%ecx\n"
		"\tcmpl $0x7fff,%%ecx\n"
		"\tjg 4f\n"
		"\tcmpl $-0x8000,%%ecx\n"
		"\tjl 5f\n"
		"\tmovw %%cx, (%%edi)\n"
		"\tcmpl %%ecx, (%%ebx)\n"
		"\tjnz 3b\n"

		/*
		 * while (size-- > 0)
		 */
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		"\tjmp 6f\n"

		/*
		 *  sample > 0x7fff
		 */

		"4:"
		"\tmovw $0x7fff, %%ax\n"
		"\tmovw %%ax, (%%edi)\n"
		"\tcmpl %%ecx,(%%ebx)\n"
		"\tjnz 3b\n"
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		"\tjmp 6f\n"

		/*
		 *  sample < -0x8000
		 */

		"5:"
		"\tmovw $-0x8000, %%ax\n"
		"\tmovw %%ax, (%%edi)\n"
		"\tcmpl %%ecx, (%%ebx)\n"
		"\tjnz 3b\n"
		"\tadd %4, %%edi\n"
		"\tadd %5, %%esi\n"
		"\tadd %6, %%ebx\n"
		"\tdecl %%edx\n"
		"\tjnz 1b\n"
		// "\tjmp 6f\n"
		
		"6:"

		: /* no output regs */
		: "m" (size), "m" (dst), "m" (src), "m" (sum), "m" (dst_step), "m" (src_step), "m" (sum_step)
		: "esi", "edi", "edx", "ecx", "ebx", "eax"
	);
}


void mix_areas3(unsigned int size,
		volatile s16 *dst, const s16 *src,
		volatile s32 *sum,
		unsigned int dst_step, unsigned int src_step)
{
	while (size-- > 0) {
		s32 sample = *src;
		if (cmpxchg(dst, 0, 1) == 0)
			sample -= *sum;
		atomic_add(sum, sample);
		do {
			sample = *sum;
			if (unlikely(sample < -0x8000))
				*dst = -0x8000;
			else if (unlikely(sample > 0x7fff))
				*dst = 0x7fff;
			else
				*dst = sample;
		} while (unlikely(sample != *sum));
		sum++;
		(char*)dst += dst_step;
		(char*)src += src_step;
	}
}

int compare(const s16* b1, const s16 *b2, unsigned int size)
{
	unsigned int c = 0;
	while (size-- > 0) {
		if (*b1 != *b2)
			c++;
		b1++;
		b2++;
	}
	return c;
}

int main(int argc, char **argv)
{
	int size = atoi(argv[1]);
	int n = atoi(argv[2]);
	int max = atoi(argv[3]);
	int i;
	unsigned long long begin, end;
	s16 *dst = malloc(sizeof(*dst) * size);
	s16 *check = malloc(sizeof(*check) * size);
	s32 *sum = malloc(sizeof(*sum) * size);
	s16 **srcs = malloc(sizeof(*srcs) * n);
	double cpu_clock = detect_cpu_clock();
	printf("CPU clock: %f\n", cpu_clock);
	for (i = 0; i < n; i++) {
		int k;
		s16 *s;
		srcs[i] = s = malloc(sizeof(s16) * size);
		for (k = 0; k < size; ++k, ++s) {
			*s = (rand() % (max * 2)) - max;
		}
	}

	memset(sum, 0, sizeof(*sum) * size);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas0(size, srcs[i], sum, 2);
	}
	saturate(size, check, sum, 2);
	rdtscll(end);
	printf("mix_areas0: %lld %f%%\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock));

	memset(dst, 0, sizeof(*dst) * size);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas1(size, dst, srcs[i], 2, 2);
	}
	rdtscll(end);
	printf("mix_areas1: %lld %f%% (%d)\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock), compare(dst, check, size));

	memset(sum, 0, sizeof(*sum) * size);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas2(size, dst, srcs[i], sum, 2, 2, 4);
	}
	rdtscll(end);
	printf("mix_areas2: %lld %f%% (%d)\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock), compare(dst, check, size));

	memset(sum, 0, sizeof(*sum) * size);
	rdtscll(begin);
	for (i = 0; i < n; i++) {
		mix_areas3(size, dst, srcs[i], sum, 2, 2);
	}
	rdtscll(end);
	printf("mix_areas3: %lld %f%% (%d)\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock), compare(dst, check, size));
	return 0;
}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-20  8:53                 ` Abramo Bagnara
@ 2003-02-20 16:49                   ` Jaroslav Kysela
  2003-02-20 17:57                     ` Abramo Bagnara
  0 siblings, 1 reply; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-20 16:49 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Thu, 20 Feb 2003, Abramo Bagnara wrote:

> Jaroslav Kysela wrote:
> > 
> > On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> > 
> > > The results are amazing and I'd say Jaroslav has done some mistakes in
> > > his handmade asm.
> > 
> > I don't think so. It seems that my brain still remembers assembler ;-)
> > You passed wrong values to my code so it did unaligned accesses.
> > 
> > Fixes to make things same:
> 
> I've done the needed changes in my version of sum.c to get correct
> results from asm version, but I'm still unable to get from it good
> performance numbers.
> 
> I'm puzzled...
> 
> $ ./sum 2048 8 32768
> CPU clock: 1460474444.671998
> mix_areas0: 90773 0.033459%
> mix_areas1: 141173 0.052036% (1103)
> mix_areas2: 870134 0.320731% (0)
> mix_areas3: 343792 0.126722% (0)

1) my asm code used lock prefix so there are huge differences in code for 
   UP and MP on i386
2) we need to clear dst and sum buffers to work with same values for all
   routines
3) we need to clear the CPU caches

I've commited updated alsa-lib/test/code.c which solves all these troubles 
and I've added next optimizations to my asm routine and results are (not 
impressive, but I'm better than GCC, especially using MMX 
saturation instruction):

pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768
Scheduler set to Round Robin with priority 99...
CPU clock: 847.293134Mhz (UP)

Summary (the best times):
mix_areas0    : 548456
mix_areas1    : 863636
mix_areas1_mmx: 629765
mix_areas2    : 910819

pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768
Scheduler set to Round Robin with priority 99...
CPU clock: 847.293395Mhz (SMP)

Summary (the best times):
mix_areas0    : 562342
mix_areas1    : 1705274
mix_areas1_mmx: 1565539
mix_areas2    : 1735491

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 16:49                   ` Jaroslav Kysela
@ 2003-02-20 17:57                     ` Abramo Bagnara
  2003-02-20 18:26                       ` Paul Davis
  2003-02-20 19:55                       ` Jaroslav Kysela
  0 siblings, 2 replies; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-20 17:57 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Thu, 20 Feb 2003, Abramo Bagnara wrote:
> 
> > Jaroslav Kysela wrote:
> > >
> > > On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> > >
> > > > The results are amazing and I'd say Jaroslav has done some mistakes in
> > > > his handmade asm.
> > >
> > > I don't think so. It seems that my brain still remembers assembler ;-)
> > > You passed wrong values to my code so it did unaligned accesses.
> > >
> > > Fixes to make things same:
> >
> > I've done the needed changes in my version of sum.c to get correct
> > results from asm version, but I'm still unable to get from it good
> > performance numbers.
> >
> > I'm puzzled...
> >
> > $ ./sum 2048 8 32768
> > CPU clock: 1460474444.671998
> > mix_areas0: 90773 0.033459%
> > mix_areas1: 141173 0.052036% (1103)
> > mix_areas2: 870134 0.320731% (0)
> > mix_areas3: 343792 0.126722% (0)
> 
> 1) my asm code used lock prefix so there are huge differences in code for
>    UP and MP on i386

Indeed, this made the difference.

> 2) we need to clear dst and sum buffers to work with same values for all
>    routines

This was present in sum.c

> 3) we need to clear the CPU caches

This has irrelevant impact in sum.c.

> I've commited updated alsa-lib/test/code.c which solves all these troubles
> and I've added next optimizations to my asm routine and results are (not
> impressive, but I'm better than GCC, especially using MMX
> saturation instruction):

Now I'm able to get the same results you see.

However I think that we need to extract some results from this data.

I'll leave alone MMX optimizations because I want to compare apples with
apples.

The distributed saturation (also when it's missing the check/repeat
concurrency correctness part) costs more than 4 times the ticks needed
for a (fully correct wrt concurrency) saturate once approach for the
case 2048 8 32768.

CPU clock: 1460477150.884593
mix_areas0: 86747 0.031975%
mix_areas1: 259424 0.095623% (0)
mix_areas1_mmx: 253894 0.093585% (0)
mix_areas2: 132321 0.048773% (365)
mix_areas3: 332411 0.122526% (0)

The server based approach has an added cost of an extra context switch
every period (about 1500 cycles on my machine i.e.), but this is fully
amortized by such an huge difference.

What's your opinion?

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 17:57                     ` Abramo Bagnara
@ 2003-02-20 18:26                       ` Paul Davis
  2003-02-20 22:14                         ` Abramo Bagnara
  2003-02-20 19:55                       ` Jaroslav Kysela
  1 sibling, 1 reply; 50+ messages in thread
From: Paul Davis @ 2003-02-20 18:26 UTC (permalink / raw)
  To: alsa-devel@lists.sourceforge.net

>The server based approach has an added cost of an extra context switch
>every period (about 1500 cycles on my machine i.e.), but this is fully
>amortized by such an huge difference.

recall that (1) the context switch time is not a fixed cost but
depends on the memory behaviour between switches and (2) isn't it
either two switches per participating client/application, or if they
are chained (as in JACK), N+2 switches, where N is the number of
clients/applications ?



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 17:57                     ` Abramo Bagnara
  2003-02-20 18:26                       ` Paul Davis
@ 2003-02-20 19:55                       ` Jaroslav Kysela
  2003-02-20 21:19                         ` tomasz motylewski
                                           ` (2 more replies)
  1 sibling, 3 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-20 19:55 UTC (permalink / raw)
  To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

On Thu, 20 Feb 2003, Abramo Bagnara wrote:

> Now I'm able to get the same results you see.
> 
> However I think that we need to extract some results from this data.
> 
> I'll leave alone MMX optimizations because I want to compare apples with
> apples.
> 
> The distributed saturation (also when it's missing the check/repeat
> concurrency correctness part) costs more than 4 times the ticks needed
> for a (fully correct wrt concurrency) saturate once approach for the
> case 2048 8 32768.
> 
> CPU clock: 1460477150.884593
> mix_areas0: 86747 0.031975%
> mix_areas1: 259424 0.095623% (0)
> mix_areas1_mmx: 253894 0.093585% (0)
> mix_areas2: 132321 0.048773% (365)
> mix_areas3: 332411 0.122526% (0)
> 
> The server based approach has an added cost of an extra context switch
> every period (about 1500 cycles on my machine i.e.), but this is fully
> amortized by such an huge difference.
> 
> What's your opinion?

Interesting is that my Intel P3 CPU has slightly different times:

pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768
Scheduler set to Round Robin with priority 99...
CPU clock: 847.292487Mhz (UP)

Summary (the best times):
mix_areas_srv : 576382 0.366206%
mix_areas0    : 556852 0.353798%
mix_areas1    : 867989 0.551480%
mix_areas1_mmx: 625144 0.397187%
mix_areas2    : 903335 0.573937%

areas1/srv ratio     : 1.505927
areas1_mmx/srv ratio : 1.084600

I think that we can lose more in the client/server model. Also, note that
we can use even futexes (if there's a hope that the possible context
switch is acceptable) and then we can remove the cmpxchg trick and
write-retry trick and use MMX for parallel saturation of two samples (this
last can be used in the client/server model, too, indeed).

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 19:55                       ` Jaroslav Kysela
@ 2003-02-20 21:19                         ` tomasz motylewski
  2003-02-20 21:27                           ` Jaroslav Kysela
  2003-02-21 10:25                         ` Abramo Bagnara
  2003-02-21 14:08                         ` Jaroslaw Sobierski
  2 siblings, 1 reply; 50+ messages in thread
From: tomasz motylewski @ 2003-02-20 21:19 UTC (permalink / raw)
  To: Jaroslav Kysela
  Cc: Abramo Bagnara, Jaroslaw Sobierski,
	alsa-devel@lists.sourceforge.net


Jaroslav:
> I think that we can lose more in the client/server model. Also, note that

client/server will have higher latency. The server has to copy the samples
"last minute" to DMA buffer and the client has to manage before the server
copies the data. In the direct model only the client's timing has to be within
the typical(maximum) system latency.

Please note that on many cards supporting DMA if the client is late just a few
samples but still adds the whole period, only these few samples will be
silence. The "nondestructive underrun detection" is the beauty here. The client
knows it is late (by comparing its pointer with HW pointer) but may continue
nevertheless if it knows next data will be coming on time. You know, throwing
out all samples or stopping the card in case of small underrun is like pulling
emergency brake because the train is a bit late. It only makes things worse.

With client/server either either all is good, or the whole period is lost. 

Do I understand it correctly that the server stores data in 32 bit buffer and
then puts it in 16 bit DMA buffer of the card? This is one operation more
compared with mixing directly in DMA buffer.

Best regards,
--
Tomek



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 21:19                         ` tomasz motylewski
@ 2003-02-20 21:27                           ` Jaroslav Kysela
  0 siblings, 0 replies; 50+ messages in thread
From: Jaroslav Kysela @ 2003-02-20 21:27 UTC (permalink / raw)
  To: tomasz motylewski
  Cc: Abramo Bagnara, Jaroslaw Sobierski,
	alsa-devel@lists.sourceforge.net

On Thu, 20 Feb 2003, tomasz motylewski wrote:

> Do I understand it correctly that the server stores data in 32 bit buffer and
> then puts it in 16 bit DMA buffer of the card? This is one operation more
> compared with mixing directly in DMA buffer.

There is no server and 32-bit buffer is used for total sum of samples from
all clients. Otherwise you'll get saturation errors (wrong clipping) as
described in the previous discussion.

						Jaroslav

-----
Jaroslav Kysela <perex@suse.cz>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs



-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 18:26                       ` Paul Davis
@ 2003-02-20 22:14                         ` Abramo Bagnara
  0 siblings, 0 replies; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-20 22:14 UTC (permalink / raw)
  To: Paul Davis; +Cc: alsa-devel@lists.sourceforge.net

Paul Davis wrote:
> 
> >The server based approach has an added cost of an extra context switch
> >every period (about 1500 cycles on my machine i.e.), but this is fully
> >amortized by such an huge difference.
> 
> recall that (1) the context switch time is not a fixed cost but

Mine was only a very rough approximation for trivial audio generating
processes.

> depends on the memory behaviour between switches and (2) isn't it
> either two switches per participating client/application, or if they
> are chained (as in JACK), N+2 switches, where N is the number of
> clients/applications ?

I don't understand why...

Suppose that on an otherwise idle UP system we have 3 application
generating output for current pcm_dmix.
In this case we have something like ABCABCABCABC... etc.

In pcm_mix case we use a saturate/transfer/zero thread called M and the
we'll have something like ABCMABCMABCMABCM... etc.

Do you agree?

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 19:55                       ` Jaroslav Kysela
  2003-02-20 21:19                         ` tomasz motylewski
@ 2003-02-21 10:25                         ` Abramo Bagnara
  2003-02-21 14:08                         ` Jaroslaw Sobierski
  2 siblings, 0 replies; 50+ messages in thread
From: Abramo Bagnara @ 2003-02-21 10:25 UTC (permalink / raw)
  To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net

Jaroslav Kysela wrote:
> 
> On Thu, 20 Feb 2003, Abramo Bagnara wrote:
> 
> > Now I'm able to get the same results you see.
> >
> > However I think that we need to extract some results from this data.
> >
> > I'll leave alone MMX optimizations because I want to compare apples with
> > apples.
> >
> > The distributed saturation (also when it's missing the check/repeat
> > concurrency correctness part) costs more than 4 times the ticks needed
> > for a (fully correct wrt concurrency) saturate once approach for the
> > case 2048 8 32768.
> >
> > CPU clock: 1460477150.884593
> > mix_areas0: 86747 0.031975%
> > mix_areas1: 259424 0.095623% (0)
> > mix_areas1_mmx: 253894 0.093585% (0)
> > mix_areas2: 132321 0.048773% (365)
> > mix_areas3: 332411 0.122526% (0)
> >
> > The server based approach has an added cost of an extra context switch
> > every period (about 1500 cycles on my machine i.e.), but this is fully
> > amortized by such an huge difference.
> >
> > What's your opinion?
> 
> Interesting is that my Intel P3 CPU has slightly different times:
> 
> pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768
> Scheduler set to Round Robin with priority 99...
> CPU clock: 847.292487Mhz (UP)
> 
> Summary (the best times):
> mix_areas_srv : 576382 0.366206%
> mix_areas0    : 556852 0.353798%
> mix_areas1    : 867989 0.551480%
> mix_areas1_mmx: 625144 0.397187%
> mix_areas2    : 903335 0.573937%
> 
> areas1/srv ratio     : 1.505927
> areas1_mmx/srv ratio : 1.084600

This is due to cache poisoning effect. This is quite surprising for me.
With warm cache mix_areas_srv is 3 times faster than with cold cache,
while there's a smaller difference with other alternatives.

I've modified code.c to permit also to you to test such an effect.

However I think that the realistic scenario is neither 0 nor 1024KB
cache poison.

> I think that we can lose more in the client/server model. Also, note that
> we can use even futexes (if there's a hope that the possible context
> switch is acceptable) and then we can remove the cmpxchg trick and
> write-retry trick and use MMX for parallel saturation of two samples (this
> last can be used in the client/server model, too, indeed).

I really doubt that futex might be of some help, as it's very difficult
to choose the unit it protects. Also I like very much the fact that
concurring processes are totally independent. Using futex if one exit
badly you're screwed.

What seems more interesting for my eyes in dmix approach is (as Tomasz
has pointed out) the exceptional good latency (which is the other side
of the repeated saturation cost).

However we will enjoy this benefit *only* if pcm_dmix is the last PCM of
the chain.

-- 
Abramo Bagnara                       mailto:abramo.bagnara@libero.it

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Re: dmix plugin
  2003-02-20 19:55                       ` Jaroslav Kysela
  2003-02-20 21:19                         ` tomasz motylewski
  2003-02-21 10:25                         ` Abramo Bagnara
@ 2003-02-21 14:08                         ` Jaroslaw Sobierski
  2 siblings, 0 replies; 50+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-21 14:08 UTC (permalink / raw)
  To: Jaroslav Kysela
  Cc: Abramo Bagnara, Tomasz Motylewski,
	alsa-devel@lists.sourceforge.net

Quoting Jaroslav Kysela <perex@suse.cz>:

> On Thu, 20 Feb 2003, Abramo Bagnara wrote:
> 
> > Now I'm able to get the same results you see.
> > 
> > However I think that we need to extract some results from this data.
> > 
> > I'll leave alone MMX optimizations because I want to compare apples with
> > apples.
> > 
> > The distributed saturation (also when it's missing the check/repeat
> > concurrency correctness part) costs more than 4 times the ticks needed
> > for a (fully correct wrt concurrency) saturate once approach for the
> > case 2048 8 32768.
> > 
> > CPU clock: 1460477150.884593
> > mix_areas0: 86747 0.031975%
> > mix_areas1: 259424 0.095623% (0)
> > mix_areas1_mmx: 253894 0.093585% (0)
> > mix_areas2: 132321 0.048773% (365)
> > mix_areas3: 332411 0.122526% (0)
> > 
> > The server based approach has an added cost of an extra context switch
> > every period (about 1500 cycles on my machine i.e.), but this is fully
> > amortized by such an huge difference.
> > 
> > What's your opinion?
> 
> Interesting is that my Intel P3 CPU has slightly different times:
> 
> pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768
> Scheduler set to Round Robin with priority 99...
> CPU clock: 847.292487Mhz (UP)
> 
> Summary (the best times):
> mix_areas_srv : 576382 0.366206%
> mix_areas0    : 556852 0.353798%
> mix_areas1    : 867989 0.551480%
> mix_areas1_mmx: 625144 0.397187%
> mix_areas2    : 903335 0.573937%
> 
> areas1/srv ratio     : 1.505927
> areas1_mmx/srv ratio : 1.084600
> 
> I think that we can lose more in the client/server model. Also, note that
> we can use even futexes (if there's a hope that the possible context
> switch is acceptable) and then we can remove the cmpxchg trick and
> write-retry trick and use MMX for parallel saturation of two samples (this
> last can be used in the client/server model, too, indeed).
> 
> 						Jaroslav
> 

I'm not sure what solution you're poroposing here exactly, but it seems to go
in line with my trail of thought after seeing the results of these tests.
It seems that a fast thread unsafe implementation could have such a huge
speed advantage, that the waiting imposed on other processes because of
global locking would still be compensated. To give an example, if we can
have a 4 times quicker mixing procedure, instead of having 3 threads write
concurrently for 12 seconds (that's 4 seconds cpu time per thread), they
would write in turns - 1 second each giving a total of 3 seconds. So the
1st thread to gain access could return after 1 sec., the 2nd thread after
2 seconds and 3rd after 3. That's still better than one thread writing
alone (for 4 seconds)! Yes, there is greater latency but it seems well
compensated, at least for a reasonable number of sound sources connected.
Anything above 4 doesn't make much sense anyway if our appropach is to
saturate, rather than average - above this distortions will be very
audiable. 

And if we devise a smart locking mechanism - this latency problem can
be reduced to a minimum. The locking and unlocking code would be within
the mixing function thus preventing a badly coded application from
blocking indefinitely.

A simple locking mechanism I'm considering is the following:
- we maintain a short table of ranges locked by each client (one for each).
- access to the table is synchronized with a single mutex
- a request to lock a region could be partially realized, i.e.
  if thread 1 has locked offsets 300-500 and thread 2 wants 200-400
  it will get access to 200-300, can mix there and then ask for the
  rest.
Additionally, the mixing function could be implemented to break the
buffer sent in into chunks of say, 1024 bytes and would try to
lock and mix those segments in sequence. This would minimize the
time spent waiting for other threads. It means a sound compromise
(excuse the pun) between the convenience of not waiting for other
threads by effectively synchronizing on a per pixel basis and the
speed affored by code which doesn't need to care about synchronization,
yet is not hindered by global blocking.

Am I making myself clear or does this sound totally convoluted?

--------------
Fycio (J.Sobierski)
 fycio@gucio.com


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2003-02-21 14:08 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-17 13:12 Re: dmix plugin Jaroslaw Sobierski
2003-02-17 13:22 ` Jaroslav Kysela
2003-02-17 18:15   ` Paul Davis
2003-02-18 22:36     ` Abramo Bagnara
2003-02-17 13:24 ` Jaroslav Kysela
  -- strict thread matches above, loose matches on Subject: below --
2003-02-17 22:28 Jaroslaw Sobierski
2003-02-17 16:18 Jaroslaw Sobierski
2003-02-17 15:32 Jaroslaw Sobierski
2003-02-17 19:45 ` Jaroslav Kysela
2003-02-17 20:44   ` tomasz motylewski
2003-02-17 20:59     ` Jaroslav Kysela
2003-02-18 10:00   ` Abramo Bagnara
2003-02-18 12:52     ` Jaroslav Kysela
2003-02-18 13:10       ` Jaroslaw Sobierski
2003-02-18 13:19         ` Jaroslav Kysela
2003-02-18 14:51       ` Paul Davis
2003-02-18 16:51         ` Jaroslav Kysela
2003-02-18 21:07     ` Jaroslav Kysela
2003-02-19 10:20       ` Abramo Bagnara
2003-02-19 11:01         ` Jaroslav Kysela
2003-02-19 11:17           ` Abramo Bagnara
2003-02-19 13:49             ` Abramo Bagnara
2003-02-19 15:45               ` Jaroslaw Sobierski
2003-02-19 20:39                 ` Abramo Bagnara
2003-02-19 18:34               ` Jaroslav Kysela
2003-02-19 21:24                 ` Jaroslav Kysela
2003-02-20  8:28                 ` Abramo Bagnara
2003-02-20  8:30                 ` Jaroslaw Sobierski
2003-02-20  8:48                   ` Abramo Bagnara
2003-02-20  8:53                 ` Abramo Bagnara
2003-02-20 16:49                   ` Jaroslav Kysela
2003-02-20 17:57                     ` Abramo Bagnara
2003-02-20 18:26                       ` Paul Davis
2003-02-20 22:14                         ` Abramo Bagnara
2003-02-20 19:55                       ` Jaroslav Kysela
2003-02-20 21:19                         ` tomasz motylewski
2003-02-20 21:27                           ` Jaroslav Kysela
2003-02-21 10:25                         ` Abramo Bagnara
2003-02-21 14:08                         ` Jaroslaw Sobierski
2003-02-19 10:33       ` Jaroslaw Sobierski
2003-02-19 11:08         ` Jaroslav Kysela
2003-02-17 11:18 Jaroslaw Sobierski
2003-02-17 11:53 ` Jaroslav Kysela
2003-02-17 10:04 Jaroslaw Sobierski
2003-02-17 10:15 ` Jaroslav Kysela
2003-02-17 12:15   ` Abramo Bagnara
2003-02-17 13:12     ` Jaroslav Kysela
2003-02-17 13:29       ` Abramo Bagnara
2003-02-17 15:00         ` Jaroslav Kysela
2003-02-17 15:21           ` Abramo Bagnara
2003-02-17 10:32 ` tomasz motylewski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.