* Re: Re: dmix plugin
@ 2003-02-17 15:32 Jaroslaw Sobierski
2003-02-17 19:45 ` Jaroslav Kysela
0 siblings, 1 reply; 57+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-17 15:32 UTC (permalink / raw)
To: abramo.bagnara; +Cc: perex, alsa-devel
>> I see, the read/saturate/write must be atomic, too. In this case, it would
>> be better to use a global (or a set of) mutex(es) to lock the hardware
>> ring buffer. The futexes are nice.
>
>They are nice indeed, but definitely not the right solution here.
>
>Although I don't know if it's the absolute best solution, the 'retry'
>approach I've proposed is far better and much more efficient.
I have to agree with Abramo. A global mutex would cause long and unnecessary
waits for the processes trying to write to the plugin. Locking access to
individual parts of the buffer is messy. Notice that concurrent writes
to the same sample in the buffer will occur sporadically, and the "re-read"
in the loop costs almost nothing, while synchronization mechanisms could
block often.
--------------
Fycio (J.Sobierski)
fycio@gucio.com
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
^ permalink raw reply [flat|nested] 57+ messages in thread* Re: Re: dmix plugin 2003-02-17 15:32 Re: dmix plugin Jaroslaw Sobierski @ 2003-02-17 19:45 ` Jaroslav Kysela 2003-02-17 20:44 ` tomasz motylewski 2003-02-18 10:00 ` Abramo Bagnara 0 siblings, 2 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-17 19:45 UTC (permalink / raw) To: Jaroslaw Sobierski Cc: abramo.bagnara@libero.it, alsa-devel@lists.sourceforge.net On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > >> I see, the read/saturate/write must be atomic, too. In this case, it would > >> be better to use a global (or a set of) mutex(es) to lock the hardware > >> ring buffer. The futexes are nice. > > > >They are nice indeed, but definitely not the right solution here. > > > >Although I don't know if it's the absolute best solution, the 'retry' > >approach I've proposed is far better and much more efficient. > > I have to agree with Abramo. A global mutex would cause long and unnecessary > waits for the processes trying to write to the plugin. Locking access to > individual parts of the buffer is messy. Notice that concurrent writes > to the same sample in the buffer will occur sporadically, and the "re-read" > in the loop costs almost nothing, while synchronization mechanisms could > block often. Note that your all nice ideas go to some blind alley. Who will silence the sum buffer? Driver silences only hardware buffer which will not be used for the calculation in your algorithm. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 19:45 ` Jaroslav Kysela @ 2003-02-17 20:44 ` tomasz motylewski 2003-02-17 20:59 ` Jaroslav Kysela 2003-02-18 10:00 ` Abramo Bagnara 1 sibling, 1 reply; 57+ messages in thread From: tomasz motylewski @ 2003-02-17 20:44 UTC (permalink / raw) To: Jaroslav Kysela Cc: Jaroslaw Sobierski, abramo.bagnara@libero.it, alsa-devel@lists.sourceforge.net On Mon, 17 Feb 2003, Jaroslav Kysela wrote: > Note that your all nice ideas go to some blind alley. Who will silence the > sum buffer? Driver silences only hardware buffer which will not be used > for the calculation in your algorithm. Silencing is not time critical, if buffer is big enough it does not matter whether is it done 1 ms or 100 ms after the card has played the data. Therefore it may be done by a separate thread/process/kernel task without any interference with other processes writing to the buffer. Anyway, I strongly support writing/adding directly to DMA buffer - lowest latency possible. Precise information about current position of HW pointer should be available to each application so it may tune the delay (synchronize the data coming from the source with slightly different clock frequency!) by adding/deleting single samples (with interpolation). Mutexes optional. Best regards, -- Tomasz Motylewski BFAD GmbH & Co. KG ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 20:44 ` tomasz motylewski @ 2003-02-17 20:59 ` Jaroslav Kysela 0 siblings, 0 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-17 20:59 UTC (permalink / raw) To: tomasz motylewski Cc: Jaroslaw Sobierski, abramo.bagnara@libero.it, alsa-devel@lists.sourceforge.net On Mon, 17 Feb 2003, tomasz motylewski wrote: > On Mon, 17 Feb 2003, Jaroslav Kysela wrote: > > > Note that your all nice ideas go to some blind alley. Who will silence the > > sum buffer? Driver silences only hardware buffer which will not be used > > for the calculation in your algorithm. > > Silencing is not time critical, if buffer is big enough it does not matter > whether is it done 1 ms or 100 ms after the card has played the data. Therefore > it may be done by a separate thread/process/kernel task without any > interference with other processes writing to the buffer. It is time critical for the dmix plugin, because other processes might write new samples to "empty" areas. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 19:45 ` Jaroslav Kysela 2003-02-17 20:44 ` tomasz motylewski @ 2003-02-18 10:00 ` Abramo Bagnara 2003-02-18 12:52 ` Jaroslav Kysela 2003-02-18 21:07 ` Jaroslav Kysela 1 sibling, 2 replies; 57+ messages in thread From: Abramo Bagnara @ 2003-02-18 10:00 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net Jaroslav Kysela wrote: > > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > > > >> I see, the read/saturate/write must be atomic, too. In this case, it would > > >> be better to use a global (or a set of) mutex(es) to lock the hardware > > >> ring buffer. The futexes are nice. > > > > > >They are nice indeed, but definitely not the right solution here. > > > > > >Although I don't know if it's the absolute best solution, the 'retry' > > >approach I've proposed is far better and much more efficient. > > > > I have to agree with Abramo. A global mutex would cause long and unnecessary > > waits for the processes trying to write to the plugin. Locking access to > > individual parts of the buffer is messy. Notice that concurrent writes > > to the same sample in the buffer will occur sporadically, and the "re-read" > > in the loop costs almost nothing, while synchronization mechanisms could > > block often. > > Note that your all nice ideas go to some blind alley. Who will silence the > sum buffer? Driver silences only hardware buffer which will not be used > for the calculation in your algorithm. Not so blind ;-) v = *src; if (cmpxchg(hw, 0, 1) == 0) v -= *sw; xadd(sw, v); do { v = *sw; if (v > 0x7fff) s = 0x7fff; else if (v < -0x8000) s = -0x8000; else s = v; *hw = s; } while (unlikely(v != *sw)); I've convinced you? However as I've written in the my first message the evil of dmix approach lies in details: they might destroy efficiency of approach rather easily. -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-18 10:00 ` Abramo Bagnara @ 2003-02-18 12:52 ` Jaroslav Kysela 2003-02-18 13:10 ` Jaroslaw Sobierski 2003-02-18 14:51 ` Paul Davis 2003-02-18 21:07 ` Jaroslav Kysela 1 sibling, 2 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-18 12:52 UTC (permalink / raw) To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net On Tue, 18 Feb 2003, Abramo Bagnara wrote: > Jaroslav Kysela wrote: > > > > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > > > > > >> I see, the read/saturate/write must be atomic, too. In this case, it would > > > >> be better to use a global (or a set of) mutex(es) to lock the hardware > > > >> ring buffer. The futexes are nice. > > > > > > > >They are nice indeed, but definitely not the right solution here. > > > > > > > >Although I don't know if it's the absolute best solution, the 'retry' > > > >approach I've proposed is far better and much more efficient. > > > > > > I have to agree with Abramo. A global mutex would cause long and unnecessary > > > waits for the processes trying to write to the plugin. Locking access to > > > individual parts of the buffer is messy. Notice that concurrent writes > > > to the same sample in the buffer will occur sporadically, and the "re-read" > > > in the loop costs almost nothing, while synchronization mechanisms could > > > block often. > > > > Note that your all nice ideas go to some blind alley. Who will silence the > > sum buffer? Driver silences only hardware buffer which will not be used > > for the calculation in your algorithm. > > > Not so blind ;-) > > v = *src; > if (cmpxchg(hw, 0, 1) == 0) > v -= *sw; > xadd(sw, v); > do { > v = *sw; > if (v > 0x7fff) > s = 0x7fff; > else if (v < -0x8000) > s = -0x8000; > else > s = v; A bit correction (we have to avoid zero results in hw buffer): else if (v == 0) s = 1; else s = v; > *hw = s; > } while (unlikely(v != *sw)); > > I've convinced you? > > However as I've written in the my first message the evil of dmix > approach lies in details: they might destroy efficiency of approach > rather easily. Yes, but it seems that we can still do proper task without global locks which seems pretty nice. Thank you for your help. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-18 12:52 ` Jaroslav Kysela @ 2003-02-18 13:10 ` Jaroslaw Sobierski 2003-02-18 13:19 ` Jaroslav Kysela 2003-02-18 14:51 ` Paul Davis 1 sibling, 1 reply; 57+ messages in thread From: Jaroslaw Sobierski @ 2003-02-18 13:10 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net Quoting Jaroslav Kysela: [...] > > > > v = *src; > > if (cmpxchg(hw, 0, 1) == 0) > > v -= *sw; > > xadd(sw, v); > > do { > > v = *sw; > > if (v > 0x7fff) > > s = 0x7fff; > > else if (v < -0x8000) > > s = -0x8000; > > else > > s = v; > > A bit correction (we have to avoid zero results in hw buffer): > > else if (v == 0) > s = 1; > else > s = v; > Why?! It's like I've written yesterday : even if the outcoming sample is zero, we can still treat the hw buffer as cleared. It makes no difference whether it was reset by the driver or the samples just added up to zero. If we have zero in the hw not because of a reset we must also have 0 in sw, so the clearing code will have no effect. -------------- Fycio (J.Sobierski) fycio@gucio.com ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-18 13:10 ` Jaroslaw Sobierski @ 2003-02-18 13:19 ` Jaroslav Kysela 0 siblings, 0 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-18 13:19 UTC (permalink / raw) To: Jaroslaw Sobierski; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net On Tue, 18 Feb 2003, Jaroslaw Sobierski wrote: > Quoting Jaroslav Kysela: > [...] > > > > > > v = *src; > > > if (cmpxchg(hw, 0, 1) == 0) > > > v -= *sw; > > > xadd(sw, v); > > > do { > > > v = *sw; > > > if (v > 0x7fff) > > > s = 0x7fff; > > > else if (v < -0x8000) > > > s = -0x8000; > > > else > > > s = v; > > > > A bit correction (we have to avoid zero results in hw buffer): > > > > else if (v == 0) > > s = 1; > > else > > s = v; > > > > Why?! It's like I've written yesterday : even if the outcoming sample > is zero, we can still treat the hw buffer as cleared. It makes no > difference whether it was reset by the driver or the samples just > added up to zero. If we have zero in the hw not because of a reset > we must also have 0 in sw, so the clearing code will have no effect. Thanks for correction.. Some things are not visible at first glance. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-18 12:52 ` Jaroslav Kysela 2003-02-18 13:10 ` Jaroslaw Sobierski @ 2003-02-18 14:51 ` Paul Davis 2003-02-18 16:51 ` Jaroslav Kysela 1 sibling, 1 reply; 57+ messages in thread From: Paul Davis @ 2003-02-18 14:51 UTC (permalink / raw) To: alsa-devel@lists.sourceforge.net >> v = *src; >> if (cmpxchg(hw, 0, 1) == 0) >> v -= *sw; >> xadd(sw, v); >> do { >> v = *sw; >> if (v > 0x7fff) >> s = 0x7fff; >> else if (v < -0x8000) >> s = -0x8000; >> else >> s = v; > >A bit correction (we have to avoid zero results in hw buffer): > > else if (v == 0) > s = 1; > else > s = v; > >> *hw = s; >> } while (unlikely(v != *sw)); help me out here. is this the code path that has be followed to write a single sample to the buffer? ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-18 14:51 ` Paul Davis @ 2003-02-18 16:51 ` Jaroslav Kysela 0 siblings, 0 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-18 16:51 UTC (permalink / raw) To: Paul Davis; +Cc: alsa-devel@lists.sourceforge.net On Tue, 18 Feb 2003, Paul Davis wrote: > >> v = *src; > >> if (cmpxchg(hw, 0, 1) == 0) > >> v -= *sw; > >> xadd(sw, v); > >> do { > >> v = *sw; > >> if (v > 0x7fff) > >> s = 0x7fff; > >> else if (v < -0x8000) > >> s = -0x8000; > >> else > >> s = v; > > > >A bit correction (we have to avoid zero results in hw buffer): > > > > else if (v == 0) > > s = 1; > > else > > s = v; > > > >> *hw = s; > >> } while (unlikely(v != *sw)); > > help me out here. is this the code path that has be followed to write > a single sample to the buffer? Yes, this code updates one sample in the hardware buffer. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-18 10:00 ` Abramo Bagnara 2003-02-18 12:52 ` Jaroslav Kysela @ 2003-02-18 21:07 ` Jaroslav Kysela 2003-02-19 10:20 ` Abramo Bagnara 2003-02-19 10:33 ` Jaroslaw Sobierski 1 sibling, 2 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-18 21:07 UTC (permalink / raw) To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net On Tue, 18 Feb 2003, Abramo Bagnara wrote: > Jaroslav Kysela wrote: > > > > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > > > > > >> I see, the read/saturate/write must be atomic, too. In this case, it would > > > >> be better to use a global (or a set of) mutex(es) to lock the hardware > > > >> ring buffer. The futexes are nice. > > > > > > > >They are nice indeed, but definitely not the right solution here. > > > > > > > >Although I don't know if it's the absolute best solution, the 'retry' > > > >approach I've proposed is far better and much more efficient. > > > > > > I have to agree with Abramo. A global mutex would cause long and unnecessary > > > waits for the processes trying to write to the plugin. Locking access to > > > individual parts of the buffer is messy. Notice that concurrent writes > > > to the same sample in the buffer will occur sporadically, and the "re-read" > > > in the loop costs almost nothing, while synchronization mechanisms could > > > block often. > > > > Note that your all nice ideas go to some blind alley. Who will silence the > > sum buffer? Driver silences only hardware buffer which will not be used > > for the calculation in your algorithm. > > > Not so blind ;-) > > v = *src; > if (cmpxchg(hw, 0, 1) == 0) > v -= *sw; > xadd(sw, v); > do { > v = *sw; > if (v > 0x7fff) > s = 0x7fff; > else if (v < -0x8000) > s = -0x8000; > else > s = v; > *hw = s; > } while (unlikely(v != *sw)); > > I've convinced you? > > However as I've written in the my first message the evil of dmix > approach lies in details: they might destroy efficiency of approach > rather easily. I've implemented the whole transfer and mix loop in assembly and it works without any drastic impact on CPU usage. I tried to optimize the assembler part as much as I can, but if some assembler guru want to give a glance, I'll appreciate it. The function is named mix_areas1() in alsa-lib/src/pcm/pcm_dmix.c. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-18 21:07 ` Jaroslav Kysela @ 2003-02-19 10:20 ` Abramo Bagnara 2003-02-19 11:01 ` Jaroslav Kysela 2003-02-19 10:33 ` Jaroslaw Sobierski 1 sibling, 1 reply; 57+ messages in thread From: Abramo Bagnara @ 2003-02-19 10:20 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net Jaroslav Kysela wrote: > > I've implemented the whole transfer and mix loop in assembly and it works > without any drastic impact on CPU usage. I tried to optimize the assembler > part as much as I can, but if some assembler guru want to give a glance, > I'll appreciate it. The function is named mix_areas1() in > alsa-lib/src/pcm/pcm_dmix.c. one comment: It's better to execute interleaved check once and not in mix_areas one objection: I doubt very much that you gain anything coding the mixing loop in assembler, you've data showing that? -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-19 10:20 ` Abramo Bagnara @ 2003-02-19 11:01 ` Jaroslav Kysela 2003-02-19 11:17 ` Abramo Bagnara 0 siblings, 1 reply; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-19 11:01 UTC (permalink / raw) To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net On Wed, 19 Feb 2003, Abramo Bagnara wrote: > Jaroslav Kysela wrote: > > > > I've implemented the whole transfer and mix loop in assembly and it works > > without any drastic impact on CPU usage. I tried to optimize the assembler > > part as much as I can, but if some assembler guru want to give a glance, > > I'll appreciate it. The function is named mix_areas1() in > > alsa-lib/src/pcm/pcm_dmix.c. > > one comment: > > It's better to execute interleaved check once and not in mix_areas Done. I was tired enough yesterday to bother with these details. > one objection: > > I doubt very much that you gain anything coding the mixing loop in > assembler, you've data showing that? I think that I spent some ticks by duplicating code for saturation and also the main while{} loop is more effective than GCC generates. But it's only guess. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-19 11:01 ` Jaroslav Kysela @ 2003-02-19 11:17 ` Abramo Bagnara 2003-02-19 13:49 ` Abramo Bagnara 0 siblings, 1 reply; 57+ messages in thread From: Abramo Bagnara @ 2003-02-19 11:17 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net Jaroslav Kysela wrote: > > On Wed, 19 Feb 2003, Abramo Bagnara wrote: > > > Jaroslav Kysela wrote: > > > > > > I've implemented the whole transfer and mix loop in assembly and it works > > > without any drastic impact on CPU usage. I tried to optimize the assembler > > > part as much as I can, but if some assembler guru want to give a glance, > > > I'll appreciate it. The function is named mix_areas1() in > > > alsa-lib/src/pcm/pcm_dmix.c. > > > > one comment: > > > > It's better to execute interleaved check once and not in mix_areas > > Done. I was tired enough yesterday to bother with these details. > > > one objection: > > > > I doubt very much that you gain anything coding the mixing loop in > > assembler, you've data showing that? > > I think that I spent some ticks by duplicating code for saturation and > also the main while{} loop is more effective than GCC generates. But it's > only guess. I hope to find the time to check it this evening -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-19 11:17 ` Abramo Bagnara @ 2003-02-19 13:49 ` Abramo Bagnara 2003-02-19 15:45 ` Jaroslaw Sobierski 2003-02-19 18:34 ` Jaroslav Kysela 0 siblings, 2 replies; 57+ messages in thread From: Abramo Bagnara @ 2003-02-19 13:49 UTC (permalink / raw) To: Jaroslav Kysela, Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net [-- Attachment #1: Type: text/plain, Size: 2816 bytes --] Abramo Bagnara wrote: > > Jaroslav Kysela wrote: > > > > On Wed, 19 Feb 2003, Abramo Bagnara wrote: > > > > > Jaroslav Kysela wrote: > > > > > > > > I've implemented the whole transfer and mix loop in assembly and it works > > > > without any drastic impact on CPU usage. I tried to optimize the assembler > > > > part as much as I can, but if some assembler guru want to give a glance, > > > > I'll appreciate it. The function is named mix_areas1() in > > > > alsa-lib/src/pcm/pcm_dmix.c. > > > > > > one comment: > > > > > > It's better to execute interleaved check once and not in mix_areas > > > > Done. I was tired enough yesterday to bother with these details. > > > > > one objection: > > > > > > I doubt very much that you gain anything coding the mixing loop in > > > assembler, you've data showing that? > > > > I think that I spent some ticks by duplicating code for saturation and > > also the main while{} loop is more effective than GCC generates. But it's > > only guess. > > I hope to find the time to check it this evening I've stolen some time to paid work. The results are amazing and I'd say Jaroslav has done some mistakes in his handmade asm. $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) XP 1700+ stepping : 2 cpu MHz : 1460.471 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow bogomips : 2916.35 $ gcc -v Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/3.2.1/specs Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --host=i386-redhat-linux Thread model: posix gcc version 3.2.1 20021125 (Red Hat Linux 8.0 3.2.1-1) $ make gcc -O6 -W -Wall -c -o sum.o sum.c sum.c: In function `main': sum.c:242: warning: implicit declaration of function `printf' sum.c:219: warning: unused parameter `argc' sum.c:255: warning: control reaches end of non-void function sum.c: In function `mix_areas0': sum.c:64: warning: unused parameter `sum' gcc sum.o -o sum $ ./sum 2048 4 32767 mix_areas0: 110603 mix_areas1: 1512610 mix_areas2: 157597 mix_areas0 is the naive, incorrect version mix_areas1 is Jaroslav asm mix_areas2 is my best attempt Time in clock ticks. -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy [-- Attachment #2: sum.c --] [-- Type: text/plain, Size: 5168 bytes --] #include <stdlib.h> #include <stdlib.h> #include <string.h> #define rdtscll(val) \ __asm__ __volatile__("rdtsc" : "=A" (val)) #define likely(x) __builtin_expect((x),1) #define unlikely(x) __builtin_expect((x),0) typedef short int s16; typedef int s32; #ifdef CONFIG_SMP #define LOCK_PREFIX "lock ; " #else #define LOCK_PREFIX "" #endif struct __xchg_dummy { unsigned long a[100]; }; #define __xg(x) ((struct __xchg_dummy *)(x)) static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old, unsigned long new, int size) { unsigned long prev; switch (size) { case 1: __asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2" : "=a"(prev) : "q"(new), "m"(*__xg(ptr)), "0"(old) : "memory"); return prev; case 2: __asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2" : "=a"(prev) : "q"(new), "m"(*__xg(ptr)), "0"(old) : "memory"); return prev; case 4: __asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2" : "=a"(prev) : "q"(new), "m"(*__xg(ptr)), "0"(old) : "memory"); return prev; } return old; } #define cmpxchg(ptr,o,n)\ ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\ (unsigned long)(n),sizeof(*(ptr)))) static inline void atomic_add(volatile int *dst, int v) { __asm__ __volatile__( LOCK_PREFIX "addl %0,%1" :"=m" (*dst) :"ir" (v)); } void mix_areas0(unsigned int size, volatile s16 *dst, s16 *src, volatile s32 *sum, unsigned int dst_step, unsigned int src_step) { while (size-- > 0) { s32 sample = *dst + *src; if (unlikely(sample & 0xffff0000)) *dst = sample > 0 ? 0x7fff : -0x8000; else *dst = sample; dst += dst_step; src += src_step; } } void mix_areas1(unsigned int size, volatile s16 *dst, s16 *src, volatile s32 *sum, unsigned int dst_step, unsigned int src_step, unsigned int sum_step) { /* * ESI - src * EDI - dst * EBX - sum * ECX - old sample * EAX - sample / temporary * EDX - size */ __asm__ __volatile__ ( "\n" /* * initialization, load EDX, ESI, EDI, EBX registers */ "\tmovl %0, %%edx\n" "\tmovl %1, %%edi\n" "\tmovl %2, %%esi\n" "\tmovl %3, %%ebx\n" /* * while (size-- > 0) { */ "\tcmp $0, %%edx\n" "jz 6f\n" "1:" /* * sample = *src; * if (cmpxchg(*dst, 0, 1) == 0) * sample -= *sum; * xadd(*sum, sample); */ "\tmovw $0, %%ax\n" "\tmovw $1, %%cx\n" "\tlock; cmpxchgw %%cx, (%%edi)\n" "\tmovswl (%%esi), %%ecx\n" "\tjnz 2f\n" "\tsubl (%%ebx), %%ecx\n" "2:" "\tlock; addl %%ecx, (%%ebx)\n" /* * do { * sample = old_sample = *sum; * saturate(v); * *dst = sample; * } while (v != *sum); */ "3:" "\tmovl (%%ebx), %%ecx\n" "\tcmpl $0x7fff,%%ecx\n" "\tjg 4f\n" "\tcmpl $-0x8000,%%ecx\n" "\tjl 5f\n" "\tmovw %%cx, (%%edi)\n" "\tcmpl %%ecx, (%%ebx)\n" "\tjnz 3b\n" /* * while (size-- > 0) */ "\tadd %4, %%edi\n" "\tadd %5, %%esi\n" "\tadd %6, %%ebx\n" "\tdecl %%edx\n" "\tjnz 1b\n" "\tjmp 6f\n" /* * sample > 0x7fff */ "4:" "\tmovw $0x7fff, %%ax\n" "\tmovw %%ax, (%%edi)\n" "\tcmpl %%ecx,(%%ebx)\n" "\tjnz 3b\n" "\tadd %4, %%edi\n" "\tadd %5, %%esi\n" "\tadd %6, %%ebx\n" "\tdecl %%edx\n" "\tjnz 1b\n" "\tjmp 6f\n" /* * sample < -0x8000 */ "5:" "\tmovw $-0x8000, %%ax\n" "\tmovw %%ax, (%%edi)\n" "\tcmpl %%ecx, (%%ebx)\n" "\tjnz 3b\n" "\tadd %4, %%edi\n" "\tadd %5, %%esi\n" "\tadd %6, %%ebx\n" "\tdecl %%edx\n" "\tjnz 1b\n" // "\tjmp 6f\n" "6:" : /* no output regs */ : "m" (size), "m" (dst), "m" (src), "m" (sum), "m" (dst_step), "m" (src_step), "m" (sum_step) : "esi", "edi", "edx", "ecx", "ebx", "eax" ); } void mix_areas2(unsigned int size, volatile s16 *dst, s16 *src, volatile s32 *sum, unsigned int dst_step, unsigned int src_step) { while (size-- > 0) { s32 sample = *src; if (cmpxchg(dst, 0, 1) == 0) sample -= *sum; atomic_add(sum, sample); do { sample = *sum; s16 s; if (unlikely(sample & 0xffff0000)) s = sample > 0 ? 0x7fff : -0x8000; else s = sample; *dst = s; } while (unlikely(sample != *sum)); sum++; dst += dst_step; src += src_step; } } int main(int argc, char **argv) { int size = atoi(argv[1]); int n = atoi(argv[2]); int max = atoi(argv[3]); int i; unsigned long long begin, end; s16 *dst = malloc(sizeof(*dst) * size); s32 *sum = calloc(size, sizeof(*sum)); s16 **srcs = malloc(sizeof(*srcs) * n); for (i = 0; i < n; i++) { int k; s16 *s; srcs[i] = s = malloc(sizeof(s16) * size); for (k = 0; k < size; ++k, ++s) { *s = (rand() % (max * 2)) - max; } } rdtscll(begin); for (i = 0; i < n; i++) { mix_areas0(size, dst, srcs[i], sum, 1, 1); } rdtscll(end); printf("mix_areas0: %lld\n", end - begin); rdtscll(begin); for (i = 0; i < n; i++) { mix_areas1(size, dst, srcs[i], sum, 1, 1, 1); } rdtscll(end); printf("mix_areas1: %lld\n", end - begin); rdtscll(begin); for (i = 0; i < n; i++) { mix_areas2(size, dst, srcs[i], sum, 1, 1); } rdtscll(end); printf("mix_areas2: %lld\n", end - begin); } ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-19 13:49 ` Abramo Bagnara @ 2003-02-19 15:45 ` Jaroslaw Sobierski 2003-02-19 20:39 ` Abramo Bagnara 2003-02-19 18:34 ` Jaroslav Kysela 1 sibling, 1 reply; 57+ messages in thread From: Jaroslaw Sobierski @ 2003-02-19 15:45 UTC (permalink / raw) To: Abramo Bagnara; +Cc: Jaroslav Kysela, alsa-devel@lists.sourceforge.net Quoting Abramo Bagnara <abramo.bagnara@libero.it>: > > The results are amazing and I'd say Jaroslav has done some mistakes in > his handmade asm. > This may be true, but I think you're trying to be a little too quick yourself. Did you *test* your code? I only had time to take a short glance at it, but too me it seems that this is not the correct check for overflow on signed numbers: > if (unlikely(sample & 0xffff0000)) > s = sample > 0 ? 0x7fff : -0x8000; > else > s = sample; I noticed it because this is the first thought I had, but it only works for unsgined. Notice that -1 will be 0xffffffff in a 32 bit sample. So your code will "saturate" all negative samples to -8000 effectively killing half of the wave, the way a diode does. I'm pretty sure this would not sound good ;-). Still, even if you change this to two normal ifs I assume the speed will not be affected by an order of magnitude. Secondly, the test code is hardly a good representation of our "working" environment because we're expecting multiple processes to write concurrently to the buffer. I think you sholud have a "verification" procedure which carefully mixes the waves one by one and then the n test mixes should be run in m processes concurrently. And the result compared to the "verification" table. -------------- Fycio (J.Sobierski) fycio@gucio.com ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-19 15:45 ` Jaroslaw Sobierski @ 2003-02-19 20:39 ` Abramo Bagnara 0 siblings, 0 replies; 57+ messages in thread From: Abramo Bagnara @ 2003-02-19 20:39 UTC (permalink / raw) To: Jaroslaw Sobierski; +Cc: Jaroslav Kysela, alsa-devel@lists.sourceforge.net Jaroslaw Sobierski wrote: > > Quoting Abramo Bagnara <abramo.bagnara@libero.it>: > > > > The results are amazing and I'd say Jaroslav has done some mistakes in > > his handmade asm. > > > > This may be true, but I think you're trying to be a little too quick yourself. No doubts about that, I was in a hurry. > Did you *test* your code? I only had time to take a short glance at it, but > too me it seems that this is not the correct check for overflow on signed > numbers: > > > if (unlikely(sample & 0xffff0000)) > > s = sample > 0 ? 0x7fff : -0x8000; > > else > > s = sample; > > I noticed it because this is the first thought I had, but it only works > for unsgined. Notice that -1 will be 0xffffffff in a 32 bit sample. So > your code will "saturate" all negative samples to -8000 effectively > killing half of the wave, the way a diode does. I'm pretty sure this > would not sound good ;-). Still, even if you change this to two normal > ifs I assume the speed will not be affected by an order of magnitude. > > Secondly, the test code is hardly a good representation of our "working" > environment because we're expecting multiple processes to write > concurrently to the buffer. I think you sholud have a "verification" > procedure which carefully mixes the waves one by one and then the > n test mixes should be run in m processes concurrently. And the result > compared to the "verification" table. This is best tested with an SMP machine and I've not an easy access to it. That's apart you're perfectly right and this was exactly my intentions. -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-19 13:49 ` Abramo Bagnara 2003-02-19 15:45 ` Jaroslaw Sobierski @ 2003-02-19 18:34 ` Jaroslav Kysela 2003-02-19 21:24 ` Jaroslav Kysela ` (3 more replies) 1 sibling, 4 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-19 18:34 UTC (permalink / raw) To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net On Wed, 19 Feb 2003, Abramo Bagnara wrote: > The results are amazing and I'd say Jaroslav has done some mistakes in > his handmade asm. I don't think so. It seems that my brain still remembers assembler ;-) You passed wrong values to my code so it did unaligned accesses. Fixes to make things same: --- sum.c 2003-02-19 18:55:20.000000000 +0100 +++ a.c 2003-02-19 19:31:00.000000000 +0100 @@ -11,6 +11,8 @@ typedef short int s16; typedef int s32; +#define CONFIG_SMP + #ifdef CONFIG_SMP #define LOCK_PREFIX "lock ; " #else @@ -54,7 +56,7 @@ static inline void atomic_add(volatile int *dst, int v) { __asm__ __volatile__( - LOCK_PREFIX "addl %0,%1" + LOCK_PREFIX "addl %1,%0" :"=m" (*dst) :"ir" (v)); } @@ -62,7 +64,9 @@ void mix_areas0(unsigned int size, volatile s16 *dst, s16 *src, volatile s32 *sum, - unsigned int dst_step, unsigned int src_step) + unsigned int dst_step, + unsigned int src_step, + unsigned int sum_step) { while (size-- > 0) { s32 sample = *dst + *src; @@ -70,8 +74,8 @@ *dst = sample > 0 ? 0x7fff : -0x8000; else *dst = sample; - dst += dst_step; - src += src_step; + ((char *)dst) += dst_step; + ((char *)src) += src_step; } } @@ -194,7 +198,9 @@ void mix_areas2(unsigned int size, volatile s16 *dst, s16 *src, volatile s32 *sum, - unsigned int dst_step, unsigned int src_step) + unsigned int dst_step, + unsigned int src_step, + unsigned int sum_step) { while (size-- > 0) { s32 sample = *src; @@ -204,15 +210,15 @@ do { sample = *sum; s16 s; - if (unlikely(sample & 0xffff0000)) + if (unlikely(sample & 0x7fff0000)) s = sample > 0 ? 0x7fff : -0x8000; else s = sample; *dst = s; } while (unlikely(sample != *sum)); - sum++; - dst += dst_step; - src += src_step; + ((char *)sum) += sum_step; + ((char *)dst) += dst_step; + ((char *)src) += src_step; } } @@ -236,19 +242,19 @@ } rdtscll(begin); for (i = 0; i < n; i++) { - mix_areas0(size, dst, srcs[i], sum, 1, 1); + mix_areas0(size, dst, srcs[i], sum, 2, 2, 4); } rdtscll(end); printf("mix_areas0: %lld\n", end - begin); rdtscll(begin); for (i = 0; i < n; i++) { - mix_areas1(size, dst, srcs[i], sum, 1, 1, 1); + mix_areas1(size, dst, srcs[i], sum, 2, 2, 4); } rdtscll(end); printf("mix_areas1: %lld\n", end - begin); rdtscll(begin); for (i = 0; i < n; i++) { - mix_areas2(size, dst, srcs[i], sum, 1, 1); + mix_areas2(size, dst, srcs[i], sum, 2, 2, 4); } rdtscll(end); printf("mix_areas2: %lld\n", end - begin); perex@pnote:~> cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 6 cpu MHz : 847.473 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1679.36 perex@pnote:~> ./a.out 2048 4 32267 mix_areas0: 170691 mix_areas1: 675795 mix_areas2: 708995 Have fun, Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-19 18:34 ` Jaroslav Kysela @ 2003-02-19 21:24 ` Jaroslav Kysela 2003-02-20 8:28 ` Abramo Bagnara ` (2 subsequent siblings) 3 siblings, 0 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-19 21:24 UTC (permalink / raw) To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net On Wed, 19 Feb 2003, Jaroslav Kysela wrote: > perex@pnote:~> cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 8 > model name : Pentium III (Coppermine) > stepping : 6 > cpu MHz : 847.473 > cache size : 256 KB > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 2 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov > pat pse36 mmx fxsr sse > bogomips : 1679.36 > > perex@pnote:~> ./a.out 2048 4 32267 > mix_areas0: 170691 > mix_areas1: 675795 > mix_areas2: 708995 More results (with MMX code): perex@pnote:~/alsa/alsa-lib/test> ./code 2048 4 32767 mix_areas0 : 172345 mix_areas1 : 677021 mix_areas1_mmx: 620597 mix_areas2 : 702227 Note - the test utility is in CVS - alsa-lib/test/code.c - now. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-19 18:34 ` Jaroslav Kysela 2003-02-19 21:24 ` Jaroslav Kysela @ 2003-02-20 8:28 ` Abramo Bagnara 2003-02-20 8:30 ` Jaroslaw Sobierski 2003-02-20 8:53 ` Re: dmix plugin Abramo Bagnara 3 siblings, 0 replies; 57+ messages in thread From: Abramo Bagnara @ 2003-02-20 8:28 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net Jaroslav Kysela wrote: > > On Wed, 19 Feb 2003, Abramo Bagnara wrote: > > > The results are amazing and I'd say Jaroslav has done some mistakes in > > his handmade asm. > > I don't think so. It seems that my brain still remembers assembler ;-) I've no doubts about that ;-) > You passed wrong values to my code so it did unaligned accesses. I guessed that but I was too lazy to deeply analyze your asm. > Fixes to make things same: > volatile s32 *sum, > - unsigned int dst_step, unsigned int src_step) > + unsigned int dst_step, > + unsigned int src_step, > + unsigned int sum_step) sum_step is useless I've deliberately removed it. Please do it also on your code. > + ((char *)dst) += dst_step; > + ((char *)src) += src_step; IMHO it's a sane assumption suppose that step is multiple of sample size. However this should not have any impact on efficiency (at least I believe). > - if (unlikely(sample & 0xffff0000)) > + if (unlikely(sample & 0x7fff0000)) As Jaroslaw has written this is a mistake and I've verified the right version has no speed benefits. -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-19 18:34 ` Jaroslav Kysela 2003-02-19 21:24 ` Jaroslav Kysela 2003-02-20 8:28 ` Abramo Bagnara @ 2003-02-20 8:30 ` Jaroslaw Sobierski 2003-02-20 8:48 ` Abramo Bagnara 2003-02-20 9:17 ` Echoaudio drivers Giuliano Pochini 2003-02-20 8:53 ` Re: dmix plugin Abramo Bagnara 3 siblings, 2 replies; 57+ messages in thread From: Jaroslaw Sobierski @ 2003-02-20 8:30 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net Quoting Jaroslav Kysela <perex@suse.cz>: > I don't think so. It seems that my brain still remembers assembler ;-) ... > sample = *sum; > s16 s; > - if (unlikely(sample & 0xffff0000)) > + if (unlikely(sample & 0x7fff0000)) > s = sample > 0 ? 0x7fff : -0x8000; > else > s = sample; I think I remember some of the x86 assembly myself and this correction does not fix the problem. This code will still "saturate" all negative samples to -8000. You cannot detect an overflow into the upper half of the register with a simple bitwise and. The actual test should be as follows : - extend the sign of the lower half - check if the upper half is the same as the effect of expansion if it is - there is no overflow if it differs - there was overflow and you need to saturate. examples : value 0x 0000 0335 ext 0x 0000 0335 -> no overflow value 0x 0002 43b1 ext 0x 0000 43b1 -> overflow value 0x ffff f25b ext 0x ffff f25b -> no overflow value 0x ff1c 35c9 ext 0x 0000 35c9 -> overflow to put it in asm: mov ebx,eax cwde cmp eax,ebx The problem is cwde operates only on ax/eax. This may sound complicated but in fact it amounts to a very simple question : does the sample fit in a 16 bit int, or does it not, so I guess in C it could look something like : s16 s=sample; if (unlikely(sample != (s32)s)) The cast is just there for clarity I believe it would be done implicitly anyway. But don't take my word for it - I did not test this. -------------- Fycio (J.Sobierski) fycio@gucio.com ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-20 8:30 ` Jaroslaw Sobierski @ 2003-02-20 8:48 ` Abramo Bagnara 2003-02-20 9:17 ` Echoaudio drivers Giuliano Pochini 1 sibling, 0 replies; 57+ messages in thread From: Abramo Bagnara @ 2003-02-20 8:48 UTC (permalink / raw) To: Jaroslaw Sobierski; +Cc: Jaroslav Kysela, alsa-devel Jaroslaw Sobierski wrote: > > > s16 s=sample; > if (unlikely(sample != (s32)s)) > I've verified exactly this yesterday evening, but it's less efficient than ordinary boundary check. -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Echoaudio drivers 2003-02-20 8:30 ` Jaroslaw Sobierski 2003-02-20 8:48 ` Abramo Bagnara @ 2003-02-20 9:17 ` Giuliano Pochini 2003-02-20 14:37 ` David Olofson 1 sibling, 1 reply; 57+ messages in thread From: Giuliano Pochini @ 2003-02-20 9:17 UTC (permalink / raw) To: alsa-devel Is someone writing drivers for Echoaudio cards ? Perhaps I'll buy one soon and I can try to write drivers if nobody alse is working on it. Bye. ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Echoaudio drivers 2003-02-20 9:17 ` Echoaudio drivers Giuliano Pochini @ 2003-02-20 14:37 ` David Olofson 2003-02-20 15:40 ` Giuliano Pochini 0 siblings, 1 reply; 57+ messages in thread From: David Olofson @ 2003-02-20 14:37 UTC (permalink / raw) To: alsa-devel On Thursday 20 February 2003 10.17, Giuliano Pochini wrote: > Is someone writing drivers for Echoaudio cards ? Perhaps > I'll buy one soon and I can try to write drivers if nobody > alse is working on it. I have an old Layla20 and intend to write a driver for it. It shouldn't be too much work to get the other cards working I think, but I can't test on anything but Layla20 myself. Anyway, I'm short on hacking time these days, and I have some other projects I need to deal with first. //David Olofson - Programmer, Composer, Open Source Advocate .- The Return of Audiality! --------------------------------. | Free/Open Source Audio Engine for use in Games or Studio. | | RT and off-line synth. Scripting. Sample accurate timing. | `---------------------------> http://olofson.net/audiality -' --- http://olofson.net --- http://www.reologica.se --- ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Echoaudio drivers 2003-02-20 14:37 ` David Olofson @ 2003-02-20 15:40 ` Giuliano Pochini 2003-02-20 16:03 ` David Olofson 0 siblings, 1 reply; 57+ messages in thread From: Giuliano Pochini @ 2003-02-20 15:40 UTC (permalink / raw) To: David Olofson; +Cc: alsa-devel On 20-Feb-2003 David Olofson wrote: > On Thursday 20 February 2003 10.17, Giuliano Pochini wrote: >> Is someone writing drivers for Echoaudio cards ? Perhaps >> I'll buy one soon and I can try to write drivers if nobody >> alse is working on it. > > I have an old Layla20 and intend to write a driver for it. It > shouldn't be too much work to get the other cards working I think, > but I can't test on anything but Layla20 myself. According to official docs, 20 and 24bit versions have only a different DAC. The driver always sends data in the same format. I've not looked at the sources yet. Bye. ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Echoaudio drivers 2003-02-20 15:40 ` Giuliano Pochini @ 2003-02-20 16:03 ` David Olofson 0 siblings, 0 replies; 57+ messages in thread From: David Olofson @ 2003-02-20 16:03 UTC (permalink / raw) To: alsa-devel On Thursday 20 February 2003 16.40, Giuliano Pochini wrote: [...] > According to official docs, 20 and 24bit versions have only a > different DAC. The driver always sends data in the same format. > I've not looked at the sources yet. Yes, that seems to be the case. All models use 24 bit signal paths internally, and they seem to use the same multichannel DMA engine and stuff as well. There's specific firmware for pretty much every model in their driver, but on the host side, it seems like it's mostly about configurations and feature sets. //David Olofson - Programmer, Composer, Open Source Advocate .- The Return of Audiality! --------------------------------. | Free/Open Source Audio Engine for use in Games or Studio. | | RT and off-line synth. Scripting. Sample accurate timing. | `---------------------------> http://olofson.net/audiality -' --- http://olofson.net --- http://www.reologica.se --- ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-19 18:34 ` Jaroslav Kysela ` (2 preceding siblings ...) 2003-02-20 8:30 ` Jaroslaw Sobierski @ 2003-02-20 8:53 ` Abramo Bagnara 2003-02-20 16:49 ` Jaroslav Kysela 3 siblings, 1 reply; 57+ messages in thread From: Abramo Bagnara @ 2003-02-20 8:53 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net [-- Attachment #1: Type: text/plain, Size: 893 bytes --] Jaroslav Kysela wrote: > > On Wed, 19 Feb 2003, Abramo Bagnara wrote: > > > The results are amazing and I'd say Jaroslav has done some mistakes in > > his handmade asm. > > I don't think so. It seems that my brain still remembers assembler ;-) > You passed wrong values to my code so it did unaligned accesses. > > Fixes to make things same: I've done the needed changes in my version of sum.c to get correct results from asm version, but I'm still unable to get from it good performance numbers. I'm puzzled... $ ./sum 2048 8 32768 CPU clock: 1460474444.671998 mix_areas0: 90773 0.033459% mix_areas1: 141173 0.052036% (1103) mix_areas2: 870134 0.320731% (0) mix_areas3: 343792 0.126722% (0) -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy [-- Attachment #2: sum.c --] [-- Type: text/plain, Size: 7213 bytes --] #include <stdlib.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <unistd.h> #include <sys/time.h> #define rdtscll(val) \ __asm__ __volatile__("rdtsc" : "=A" (val)) #define likely(x) __builtin_expect((x),1) #define unlikely(x) __builtin_expect((x),0) typedef short int s16; typedef int s32; #ifdef CONFIG_SMP #define LOCK_PREFIX "lock ; " #else #define LOCK_PREFIX "" #endif struct __xchg_dummy { unsigned long a[100]; }; #define __xg(x) ((struct __xchg_dummy *)(x)) static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old, unsigned long new, int size) { unsigned long prev; switch (size) { case 1: __asm__ __volatile__(LOCK_PREFIX "cmpxchgb %b1,%2" : "=a"(prev) : "q"(new), "m"(*__xg(ptr)), "0"(old) : "memory"); return prev; case 2: __asm__ __volatile__(LOCK_PREFIX "cmpxchgw %w1,%2" : "=a"(prev) : "q"(new), "m"(*__xg(ptr)), "0"(old) : "memory"); return prev; case 4: __asm__ __volatile__(LOCK_PREFIX "cmpxchgl %1,%2" : "=a"(prev) : "q"(new), "m"(*__xg(ptr)), "0"(old) : "memory"); return prev; } return old; } #define cmpxchg(ptr,o,n)\ ((__typeof__(*(ptr)))__cmpxchg((ptr),(unsigned long)(o),\ (unsigned long)(n),sizeof(*(ptr)))) static inline void atomic_add(volatile int *dst, int v) { __asm__ __volatile__( LOCK_PREFIX "addl %1,%0" :"=m" (*dst) :"ir" (v), "m" (*dst)); } static double detect_cpu_clock() { struct timeval tm_begin, tm_end; unsigned long long tsc_begin, tsc_end; /* Warm cache */ gettimeofday(&tm_begin, 0); rdtscll(tsc_begin); gettimeofday(&tm_begin, 0); usleep(1000000); rdtscll(tsc_end); gettimeofday(&tm_end, 0); return (tsc_end - tsc_begin) / (tm_end.tv_sec - tm_begin.tv_sec + (tm_end.tv_usec - tm_begin.tv_usec) / 1e6); } void mix_areas0(unsigned int size, const s16 *src, volatile s32 *sum, unsigned int src_step) { while (size-- > 0) { atomic_add(sum, *src); (char*)src += src_step; sum++; } } void saturate(unsigned int size, s16 *dst, const s32 *sum, unsigned int dst_step) { while (size-- > 0) { s32 sample = *sum; if (unlikely(sample < -0x8000)) *dst = -0x8000; else if (unlikely(sample > 0x7fff)) *dst = 0x7fff; else *dst = sample; (char*)dst += dst_step; sum++; } } void mix_areas1(unsigned int size, volatile s16 *dst, const s16 *src, unsigned int dst_step, unsigned int src_step) { while (size-- > 0) { s32 sample = *dst + *src; if (unlikely(sample < -0x8000)) *dst = -0x8000; else if (unlikely(sample > 0x7fff)) *dst = 0x7fff; else *dst = sample; (char*)dst += dst_step; (char*)src += src_step; } } void mix_areas2(unsigned int size, volatile s16 *dst, const s16 *src, volatile s32 *sum, unsigned int dst_step, unsigned int src_step, unsigned int sum_step) { /* * ESI - src * EDI - dst * EBX - sum * ECX - old sample * EAX - sample / temporary * EDX - size */ __asm__ __volatile__ ( "\n" /* * initialization, load EDX, ESI, EDI, EBX registers */ "\tmovl %0, %%edx\n" "\tmovl %1, %%edi\n" "\tmovl %2, %%esi\n" "\tmovl %3, %%ebx\n" /* * while (size-- > 0) { */ "\tcmp $0, %%edx\n" "jz 6f\n" "1:" /* * sample = *src; * if (cmpxchg(*dst, 0, 1) == 0) * sample -= *sum; * xadd(*sum, sample); */ "\tmovw $0, %%ax\n" "\tmovw $1, %%cx\n" "\tlock; cmpxchgw %%cx, (%%edi)\n" "\tmovswl (%%esi), %%ecx\n" "\tjnz 2f\n" "\tsubl (%%ebx), %%ecx\n" "2:" "\tlock; addl %%ecx, (%%ebx)\n" /* * do { * sample = old_sample = *sum; * saturate(v); * *dst = sample; * } while (v != *sum); */ "3:" "\tmovl (%%ebx), %%ecx\n" "\tcmpl $0x7fff,%%ecx\n" "\tjg 4f\n" "\tcmpl $-0x8000,%%ecx\n" "\tjl 5f\n" "\tmovw %%cx, (%%edi)\n" "\tcmpl %%ecx, (%%ebx)\n" "\tjnz 3b\n" /* * while (size-- > 0) */ "\tadd %4, %%edi\n" "\tadd %5, %%esi\n" "\tadd %6, %%ebx\n" "\tdecl %%edx\n" "\tjnz 1b\n" "\tjmp 6f\n" /* * sample > 0x7fff */ "4:" "\tmovw $0x7fff, %%ax\n" "\tmovw %%ax, (%%edi)\n" "\tcmpl %%ecx,(%%ebx)\n" "\tjnz 3b\n" "\tadd %4, %%edi\n" "\tadd %5, %%esi\n" "\tadd %6, %%ebx\n" "\tdecl %%edx\n" "\tjnz 1b\n" "\tjmp 6f\n" /* * sample < -0x8000 */ "5:" "\tmovw $-0x8000, %%ax\n" "\tmovw %%ax, (%%edi)\n" "\tcmpl %%ecx, (%%ebx)\n" "\tjnz 3b\n" "\tadd %4, %%edi\n" "\tadd %5, %%esi\n" "\tadd %6, %%ebx\n" "\tdecl %%edx\n" "\tjnz 1b\n" // "\tjmp 6f\n" "6:" : /* no output regs */ : "m" (size), "m" (dst), "m" (src), "m" (sum), "m" (dst_step), "m" (src_step), "m" (sum_step) : "esi", "edi", "edx", "ecx", "ebx", "eax" ); } void mix_areas3(unsigned int size, volatile s16 *dst, const s16 *src, volatile s32 *sum, unsigned int dst_step, unsigned int src_step) { while (size-- > 0) { s32 sample = *src; if (cmpxchg(dst, 0, 1) == 0) sample -= *sum; atomic_add(sum, sample); do { sample = *sum; if (unlikely(sample < -0x8000)) *dst = -0x8000; else if (unlikely(sample > 0x7fff)) *dst = 0x7fff; else *dst = sample; } while (unlikely(sample != *sum)); sum++; (char*)dst += dst_step; (char*)src += src_step; } } int compare(const s16* b1, const s16 *b2, unsigned int size) { unsigned int c = 0; while (size-- > 0) { if (*b1 != *b2) c++; b1++; b2++; } return c; } int main(int argc, char **argv) { int size = atoi(argv[1]); int n = atoi(argv[2]); int max = atoi(argv[3]); int i; unsigned long long begin, end; s16 *dst = malloc(sizeof(*dst) * size); s16 *check = malloc(sizeof(*check) * size); s32 *sum = malloc(sizeof(*sum) * size); s16 **srcs = malloc(sizeof(*srcs) * n); double cpu_clock = detect_cpu_clock(); printf("CPU clock: %f\n", cpu_clock); for (i = 0; i < n; i++) { int k; s16 *s; srcs[i] = s = malloc(sizeof(s16) * size); for (k = 0; k < size; ++k, ++s) { *s = (rand() % (max * 2)) - max; } } memset(sum, 0, sizeof(*sum) * size); rdtscll(begin); for (i = 0; i < n; i++) { mix_areas0(size, srcs[i], sum, 2); } saturate(size, check, sum, 2); rdtscll(end); printf("mix_areas0: %lld %f%%\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock)); memset(dst, 0, sizeof(*dst) * size); rdtscll(begin); for (i = 0; i < n; i++) { mix_areas1(size, dst, srcs[i], 2, 2); } rdtscll(end); printf("mix_areas1: %lld %f%% (%d)\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock), compare(dst, check, size)); memset(sum, 0, sizeof(*sum) * size); rdtscll(begin); for (i = 0; i < n; i++) { mix_areas2(size, dst, srcs[i], sum, 2, 2, 4); } rdtscll(end); printf("mix_areas2: %lld %f%% (%d)\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock), compare(dst, check, size)); memset(sum, 0, sizeof(*sum) * size); rdtscll(begin); for (i = 0; i < n; i++) { mix_areas3(size, dst, srcs[i], sum, 2, 2); } rdtscll(end); printf("mix_areas3: %lld %f%% (%d)\n", end - begin, 100*2*44100.0*(end - begin)/(size*n*cpu_clock), compare(dst, check, size)); return 0; } ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-20 8:53 ` Re: dmix plugin Abramo Bagnara @ 2003-02-20 16:49 ` Jaroslav Kysela 2003-02-20 17:57 ` Abramo Bagnara 0 siblings, 1 reply; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-20 16:49 UTC (permalink / raw) To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net On Thu, 20 Feb 2003, Abramo Bagnara wrote: > Jaroslav Kysela wrote: > > > > On Wed, 19 Feb 2003, Abramo Bagnara wrote: > > > > > The results are amazing and I'd say Jaroslav has done some mistakes in > > > his handmade asm. > > > > I don't think so. It seems that my brain still remembers assembler ;-) > > You passed wrong values to my code so it did unaligned accesses. > > > > Fixes to make things same: > > I've done the needed changes in my version of sum.c to get correct > results from asm version, but I'm still unable to get from it good > performance numbers. > > I'm puzzled... > > $ ./sum 2048 8 32768 > CPU clock: 1460474444.671998 > mix_areas0: 90773 0.033459% > mix_areas1: 141173 0.052036% (1103) > mix_areas2: 870134 0.320731% (0) > mix_areas3: 343792 0.126722% (0) 1) my asm code used lock prefix so there are huge differences in code for UP and MP on i386 2) we need to clear dst and sum buffers to work with same values for all routines 3) we need to clear the CPU caches I've commited updated alsa-lib/test/code.c which solves all these troubles and I've added next optimizations to my asm routine and results are (not impressive, but I'm better than GCC, especially using MMX saturation instruction): pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768 Scheduler set to Round Robin with priority 99... CPU clock: 847.293134Mhz (UP) Summary (the best times): mix_areas0 : 548456 mix_areas1 : 863636 mix_areas1_mmx: 629765 mix_areas2 : 910819 pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768 Scheduler set to Round Robin with priority 99... CPU clock: 847.293395Mhz (SMP) Summary (the best times): mix_areas0 : 562342 mix_areas1 : 1705274 mix_areas1_mmx: 1565539 mix_areas2 : 1735491 Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-20 16:49 ` Jaroslav Kysela @ 2003-02-20 17:57 ` Abramo Bagnara 2003-02-20 18:26 ` Paul Davis 2003-02-20 19:55 ` Jaroslav Kysela 0 siblings, 2 replies; 57+ messages in thread From: Abramo Bagnara @ 2003-02-20 17:57 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net Jaroslav Kysela wrote: > > On Thu, 20 Feb 2003, Abramo Bagnara wrote: > > > Jaroslav Kysela wrote: > > > > > > On Wed, 19 Feb 2003, Abramo Bagnara wrote: > > > > > > > The results are amazing and I'd say Jaroslav has done some mistakes in > > > > his handmade asm. > > > > > > I don't think so. It seems that my brain still remembers assembler ;-) > > > You passed wrong values to my code so it did unaligned accesses. > > > > > > Fixes to make things same: > > > > I've done the needed changes in my version of sum.c to get correct > > results from asm version, but I'm still unable to get from it good > > performance numbers. > > > > I'm puzzled... > > > > $ ./sum 2048 8 32768 > > CPU clock: 1460474444.671998 > > mix_areas0: 90773 0.033459% > > mix_areas1: 141173 0.052036% (1103) > > mix_areas2: 870134 0.320731% (0) > > mix_areas3: 343792 0.126722% (0) > > 1) my asm code used lock prefix so there are huge differences in code for > UP and MP on i386 Indeed, this made the difference. > 2) we need to clear dst and sum buffers to work with same values for all > routines This was present in sum.c > 3) we need to clear the CPU caches This has irrelevant impact in sum.c. > I've commited updated alsa-lib/test/code.c which solves all these troubles > and I've added next optimizations to my asm routine and results are (not > impressive, but I'm better than GCC, especially using MMX > saturation instruction): Now I'm able to get the same results you see. However I think that we need to extract some results from this data. I'll leave alone MMX optimizations because I want to compare apples with apples. The distributed saturation (also when it's missing the check/repeat concurrency correctness part) costs more than 4 times the ticks needed for a (fully correct wrt concurrency) saturate once approach for the case 2048 8 32768. CPU clock: 1460477150.884593 mix_areas0: 86747 0.031975% mix_areas1: 259424 0.095623% (0) mix_areas1_mmx: 253894 0.093585% (0) mix_areas2: 132321 0.048773% (365) mix_areas3: 332411 0.122526% (0) The server based approach has an added cost of an extra context switch every period (about 1500 cycles on my machine i.e.), but this is fully amortized by such an huge difference. What's your opinion? -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-20 17:57 ` Abramo Bagnara @ 2003-02-20 18:26 ` Paul Davis 2003-02-20 19:23 ` unterminated conditionals: @HAVE_JACK_TRUE@ tomasz motylewski 2003-02-20 22:14 ` Re: dmix plugin Abramo Bagnara 2003-02-20 19:55 ` Jaroslav Kysela 1 sibling, 2 replies; 57+ messages in thread From: Paul Davis @ 2003-02-20 18:26 UTC (permalink / raw) To: alsa-devel@lists.sourceforge.net >The server based approach has an added cost of an extra context switch >every period (about 1500 cycles on my machine i.e.), but this is fully >amortized by such an huge difference. recall that (1) the context switch time is not a fixed cost but depends on the memory behaviour between switches and (2) isn't it either two switches per participating client/application, or if they are chained (as in JACK), N+2 switches, where N is the number of clients/applications ? ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* unterminated conditionals: @HAVE_JACK_TRUE@ 2003-02-20 18:26 ` Paul Davis @ 2003-02-20 19:23 ` tomasz motylewski 2003-02-20 19:57 ` Jaroslav Kysela 2003-02-20 22:14 ` Re: dmix plugin Abramo Bagnara 1 sibling, 1 reply; 57+ messages in thread From: tomasz motylewski @ 2003-02-20 19:23 UTC (permalink / raw) To: alsa-devel@lists.sourceforge.net Debian woody, current cvs: ./build prep Pre-configuring alsa-driver make: Nothing to be done for `all-deps'. Pre-configuring alsa-lib src/pcm/Makefile.am:6: JACK_PLUGIN multiply defined in condition automake: src/pcm/Makefile.am: unterminated conditionals: @HAVE_JACK_TRUE@ src/pcm/Makefile.am:9: warning: automake does not support conditional definition of JACK_PLUGIN in libpcm_la_SOURCES Then after ./build config I get in alsa-lib/src/pcm/Makefile @HAVE_JACK_TRUE@else !HAVE_JACK @HAVE_JACK_TRUE@endif !HAVE_JACK @HAVE_JACK_TRUE@all: libpcm.la Best regards, -- Tomasz Motylewski ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: unterminated conditionals: @HAVE_JACK_TRUE@ 2003-02-20 19:23 ` unterminated conditionals: @HAVE_JACK_TRUE@ tomasz motylewski @ 2003-02-20 19:57 ` Jaroslav Kysela 2003-02-20 20:30 ` tomasz motylewski 0 siblings, 1 reply; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-20 19:57 UTC (permalink / raw) To: tomasz motylewski; +Cc: alsa-devel@lists.sourceforge.net On Thu, 20 Feb 2003, tomasz motylewski wrote: > > Debian woody, current cvs: > > ./build prep > Pre-configuring alsa-driver > make: Nothing to be done for `all-deps'. > Pre-configuring alsa-lib > src/pcm/Makefile.am:6: JACK_PLUGIN multiply defined in condition > automake: src/pcm/Makefile.am: unterminated conditionals: @HAVE_JACK_TRUE@ > src/pcm/Makefile.am:9: warning: automake does not support conditional > definition of JACK_PLUGIN in libpcm_la_SOURCES > > Then after ./build config I get in alsa-lib/src/pcm/Makefile > > @HAVE_JACK_TRUE@else !HAVE_JACK > @HAVE_JACK_TRUE@endif !HAVE_JACK > > @HAVE_JACK_TRUE@all: libpcm.la Could you try to remove !HAVE_JACK string? Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: unterminated conditionals: @HAVE_JACK_TRUE@ 2003-02-20 19:57 ` Jaroslav Kysela @ 2003-02-20 20:30 ` tomasz motylewski 0 siblings, 0 replies; 57+ messages in thread From: tomasz motylewski @ 2003-02-20 20:30 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: alsa-devel@lists.sourceforge.net On Thu, 20 Feb 2003, Jaroslav Kysela wrote: > > Then after ./build config I get in alsa-lib/src/pcm/Makefile > > > > @HAVE_JACK_TRUE@else !HAVE_JACK > > @HAVE_JACK_TRUE@endif !HAVE_JACK > > > > @HAVE_JACK_TRUE@all: libpcm.la > > Could you try to remove !HAVE_JACK string? >From where? I have just removed that 3 lines from that Makefile amd run ./build all again. This time: Making all in alsamixer make[1]: Entering directory `/root/ALSA/alsa-utils/alsamixer' cd .. && automake --foreign alsamixer/Makefile cd .. \ && CONFIG_FILES=alsamixer/Makefile CONFIG_HEADERS= /bin/sh ./config.status creating alsamixer/Makefile make[1]: Leaving directory `/root/ALSA/alsa-utils/alsamixer' make[1]: Entering directory `/root/ALSA/alsa-utils/alsamixer' gcc -DHAVE_CONFIG_H -I. -I. -I../include -g -O2 -c alsamixer.c gcc -g -O2 -o alsamixer alsamixer.o -lncurses -lasound -lm -ldl -lpthread alsamixer.o: In function `update_enum_list': /root/ALSA/alsa-utils/alsamixer/alsamixer.c:513: undefined reference to `snd_mixer_selem_get_enum_item' /root/ALSA/alsa-utils/alsamixer/alsamixer.c:520: undefined reference to `snd_mixer_selem_get_enum_items' /root/ALSA/alsa-utils/alsamixer/alsamixer.c:527: undefined reference to `snd_mixer_selem_set_enum_item' alsamixer.o: In function `display_enum_list': /root/ALSA/alsa-utils/alsamixer/alsamixer.c:696: undefined reference to `snd_mixer_selem_get_enum_item' /root/ALSA/alsa-utils/alsamixer/alsamixer.c:699: undefined reference to `snd_mixer_selem_get_enum_item_name' alsamixer.o: In function `mixer_reinit': /root/ALSA/alsa-utils/alsamixer/alsamixer.c:1512: undefined reference to `snd_mixer_selem_is_enumerated' collect2: ld returned 1 exit status make[1]: *** [alsamixer] Error 1 make[1]: Leaving directory `/root/ALSA/alsa-utils/alsamixer' make: *** [all-recursive] Error 1 The problem is I have (previous version?) of /usr/lib/libasound.so.2 /usr/lib/libasound.so.2.0.0 But should not build script take care of it by linking ../../alsa-lib/ ? I have run "make install" in alsa-lib and then again ./build all it went through OK. Best regards, -- Tomek ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-20 18:26 ` Paul Davis 2003-02-20 19:23 ` unterminated conditionals: @HAVE_JACK_TRUE@ tomasz motylewski @ 2003-02-20 22:14 ` Abramo Bagnara 1 sibling, 0 replies; 57+ messages in thread From: Abramo Bagnara @ 2003-02-20 22:14 UTC (permalink / raw) To: Paul Davis; +Cc: alsa-devel@lists.sourceforge.net Paul Davis wrote: > > >The server based approach has an added cost of an extra context switch > >every period (about 1500 cycles on my machine i.e.), but this is fully > >amortized by such an huge difference. > > recall that (1) the context switch time is not a fixed cost but Mine was only a very rough approximation for trivial audio generating processes. > depends on the memory behaviour between switches and (2) isn't it > either two switches per participating client/application, or if they > are chained (as in JACK), N+2 switches, where N is the number of > clients/applications ? I don't understand why... Suppose that on an otherwise idle UP system we have 3 application generating output for current pcm_dmix. In this case we have something like ABCABCABCABC... etc. In pcm_mix case we use a saturate/transfer/zero thread called M and the we'll have something like ABCMABCMABCMABCM... etc. Do you agree? -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-20 17:57 ` Abramo Bagnara 2003-02-20 18:26 ` Paul Davis @ 2003-02-20 19:55 ` Jaroslav Kysela 2003-02-20 21:19 ` tomasz motylewski ` (2 more replies) 1 sibling, 3 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-20 19:55 UTC (permalink / raw) To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net On Thu, 20 Feb 2003, Abramo Bagnara wrote: > Now I'm able to get the same results you see. > > However I think that we need to extract some results from this data. > > I'll leave alone MMX optimizations because I want to compare apples with > apples. > > The distributed saturation (also when it's missing the check/repeat > concurrency correctness part) costs more than 4 times the ticks needed > for a (fully correct wrt concurrency) saturate once approach for the > case 2048 8 32768. > > CPU clock: 1460477150.884593 > mix_areas0: 86747 0.031975% > mix_areas1: 259424 0.095623% (0) > mix_areas1_mmx: 253894 0.093585% (0) > mix_areas2: 132321 0.048773% (365) > mix_areas3: 332411 0.122526% (0) > > The server based approach has an added cost of an extra context switch > every period (about 1500 cycles on my machine i.e.), but this is fully > amortized by such an huge difference. > > What's your opinion? Interesting is that my Intel P3 CPU has slightly different times: pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768 Scheduler set to Round Robin with priority 99... CPU clock: 847.292487Mhz (UP) Summary (the best times): mix_areas_srv : 576382 0.366206% mix_areas0 : 556852 0.353798% mix_areas1 : 867989 0.551480% mix_areas1_mmx: 625144 0.397187% mix_areas2 : 903335 0.573937% areas1/srv ratio : 1.505927 areas1_mmx/srv ratio : 1.084600 I think that we can lose more in the client/server model. Also, note that we can use even futexes (if there's a hope that the possible context switch is acceptable) and then we can remove the cmpxchg trick and write-retry trick and use MMX for parallel saturation of two samples (this last can be used in the client/server model, too, indeed). Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-20 19:55 ` Jaroslav Kysela @ 2003-02-20 21:19 ` tomasz motylewski 2003-02-20 21:27 ` Jaroslav Kysela 2003-02-21 10:25 ` Abramo Bagnara 2003-02-21 14:08 ` Jaroslaw Sobierski 2 siblings, 1 reply; 57+ messages in thread From: tomasz motylewski @ 2003-02-20 21:19 UTC (permalink / raw) To: Jaroslav Kysela Cc: Abramo Bagnara, Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net Jaroslav: > I think that we can lose more in the client/server model. Also, note that client/server will have higher latency. The server has to copy the samples "last minute" to DMA buffer and the client has to manage before the server copies the data. In the direct model only the client's timing has to be within the typical(maximum) system latency. Please note that on many cards supporting DMA if the client is late just a few samples but still adds the whole period, only these few samples will be silence. The "nondestructive underrun detection" is the beauty here. The client knows it is late (by comparing its pointer with HW pointer) but may continue nevertheless if it knows next data will be coming on time. You know, throwing out all samples or stopping the card in case of small underrun is like pulling emergency brake because the train is a bit late. It only makes things worse. With client/server either either all is good, or the whole period is lost. Do I understand it correctly that the server stores data in 32 bit buffer and then puts it in 16 bit DMA buffer of the card? This is one operation more compared with mixing directly in DMA buffer. Best regards, -- Tomek ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-20 21:19 ` tomasz motylewski @ 2003-02-20 21:27 ` Jaroslav Kysela 0 siblings, 0 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-20 21:27 UTC (permalink / raw) To: tomasz motylewski Cc: Abramo Bagnara, Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net On Thu, 20 Feb 2003, tomasz motylewski wrote: > Do I understand it correctly that the server stores data in 32 bit buffer and > then puts it in 16 bit DMA buffer of the card? This is one operation more > compared with mixing directly in DMA buffer. There is no server and 32-bit buffer is used for total sum of samples from all clients. Otherwise you'll get saturation errors (wrong clipping) as described in the previous discussion. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-20 19:55 ` Jaroslav Kysela 2003-02-20 21:19 ` tomasz motylewski @ 2003-02-21 10:25 ` Abramo Bagnara 2003-02-21 14:08 ` Jaroslaw Sobierski 2 siblings, 0 replies; 57+ messages in thread From: Abramo Bagnara @ 2003-02-21 10:25 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net Jaroslav Kysela wrote: > > On Thu, 20 Feb 2003, Abramo Bagnara wrote: > > > Now I'm able to get the same results you see. > > > > However I think that we need to extract some results from this data. > > > > I'll leave alone MMX optimizations because I want to compare apples with > > apples. > > > > The distributed saturation (also when it's missing the check/repeat > > concurrency correctness part) costs more than 4 times the ticks needed > > for a (fully correct wrt concurrency) saturate once approach for the > > case 2048 8 32768. > > > > CPU clock: 1460477150.884593 > > mix_areas0: 86747 0.031975% > > mix_areas1: 259424 0.095623% (0) > > mix_areas1_mmx: 253894 0.093585% (0) > > mix_areas2: 132321 0.048773% (365) > > mix_areas3: 332411 0.122526% (0) > > > > The server based approach has an added cost of an extra context switch > > every period (about 1500 cycles on my machine i.e.), but this is fully > > amortized by such an huge difference. > > > > What's your opinion? > > Interesting is that my Intel P3 CPU has slightly different times: > > pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768 > Scheduler set to Round Robin with priority 99... > CPU clock: 847.292487Mhz (UP) > > Summary (the best times): > mix_areas_srv : 576382 0.366206% > mix_areas0 : 556852 0.353798% > mix_areas1 : 867989 0.551480% > mix_areas1_mmx: 625144 0.397187% > mix_areas2 : 903335 0.573937% > > areas1/srv ratio : 1.505927 > areas1_mmx/srv ratio : 1.084600 This is due to cache poisoning effect. This is quite surprising for me. With warm cache mix_areas_srv is 3 times faster than with cold cache, while there's a smaller difference with other alternatives. I've modified code.c to permit also to you to test such an effect. However I think that the realistic scenario is neither 0 nor 1024KB cache poison. > I think that we can lose more in the client/server model. Also, note that > we can use even futexes (if there's a hope that the possible context > switch is acceptable) and then we can remove the cmpxchg trick and > write-retry trick and use MMX for parallel saturation of two samples (this > last can be used in the client/server model, too, indeed). I really doubt that futex might be of some help, as it's very difficult to choose the unit it protects. Also I like very much the fact that concurring processes are totally independent. Using futex if one exit badly you're screwed. What seems more interesting for my eyes in dmix approach is (as Tomasz has pointed out) the exceptional good latency (which is the other side of the repeated saturation cost). However we will enjoy this benefit *only* if pcm_dmix is the last PCM of the chain. -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-20 19:55 ` Jaroslav Kysela 2003-02-20 21:19 ` tomasz motylewski 2003-02-21 10:25 ` Abramo Bagnara @ 2003-02-21 14:08 ` Jaroslaw Sobierski 2 siblings, 0 replies; 57+ messages in thread From: Jaroslaw Sobierski @ 2003-02-21 14:08 UTC (permalink / raw) To: Jaroslav Kysela Cc: Abramo Bagnara, Tomasz Motylewski, alsa-devel@lists.sourceforge.net Quoting Jaroslav Kysela <perex@suse.cz>: > On Thu, 20 Feb 2003, Abramo Bagnara wrote: > > > Now I'm able to get the same results you see. > > > > However I think that we need to extract some results from this data. > > > > I'll leave alone MMX optimizations because I want to compare apples with > > apples. > > > > The distributed saturation (also when it's missing the check/repeat > > concurrency correctness part) costs more than 4 times the ticks needed > > for a (fully correct wrt concurrency) saturate once approach for the > > case 2048 8 32768. > > > > CPU clock: 1460477150.884593 > > mix_areas0: 86747 0.031975% > > mix_areas1: 259424 0.095623% (0) > > mix_areas1_mmx: 253894 0.093585% (0) > > mix_areas2: 132321 0.048773% (365) > > mix_areas3: 332411 0.122526% (0) > > > > The server based approach has an added cost of an extra context switch > > every period (about 1500 cycles on my machine i.e.), but this is fully > > amortized by such an huge difference. > > > > What's your opinion? > > Interesting is that my Intel P3 CPU has slightly different times: > > pnote:/home/perex/alsa/alsa-lib/test # ./code 2048 8 32768 > Scheduler set to Round Robin with priority 99... > CPU clock: 847.292487Mhz (UP) > > Summary (the best times): > mix_areas_srv : 576382 0.366206% > mix_areas0 : 556852 0.353798% > mix_areas1 : 867989 0.551480% > mix_areas1_mmx: 625144 0.397187% > mix_areas2 : 903335 0.573937% > > areas1/srv ratio : 1.505927 > areas1_mmx/srv ratio : 1.084600 > > I think that we can lose more in the client/server model. Also, note that > we can use even futexes (if there's a hope that the possible context > switch is acceptable) and then we can remove the cmpxchg trick and > write-retry trick and use MMX for parallel saturation of two samples (this > last can be used in the client/server model, too, indeed). > > Jaroslav > I'm not sure what solution you're poroposing here exactly, but it seems to go in line with my trail of thought after seeing the results of these tests. It seems that a fast thread unsafe implementation could have such a huge speed advantage, that the waiting imposed on other processes because of global locking would still be compensated. To give an example, if we can have a 4 times quicker mixing procedure, instead of having 3 threads write concurrently for 12 seconds (that's 4 seconds cpu time per thread), they would write in turns - 1 second each giving a total of 3 seconds. So the 1st thread to gain access could return after 1 sec., the 2nd thread after 2 seconds and 3rd after 3. That's still better than one thread writing alone (for 4 seconds)! Yes, there is greater latency but it seems well compensated, at least for a reasonable number of sound sources connected. Anything above 4 doesn't make much sense anyway if our appropach is to saturate, rather than average - above this distortions will be very audiable. And if we devise a smart locking mechanism - this latency problem can be reduced to a minimum. The locking and unlocking code would be within the mixing function thus preventing a badly coded application from blocking indefinitely. A simple locking mechanism I'm considering is the following: - we maintain a short table of ranges locked by each client (one for each). - access to the table is synchronized with a single mutex - a request to lock a region could be partially realized, i.e. if thread 1 has locked offsets 300-500 and thread 2 wants 200-400 it will get access to 200-300, can mix there and then ask for the rest. Additionally, the mixing function could be implemented to break the buffer sent in into chunks of say, 1024 bytes and would try to lock and mix those segments in sequence. This would minimize the time spent waiting for other threads. It means a sound compromise (excuse the pun) between the convenience of not waiting for other threads by effectively synchronizing on a per pixel basis and the speed affored by code which doesn't need to care about synchronization, yet is not hindered by global blocking. Am I making myself clear or does this sound totally convoluted? -------------- Fycio (J.Sobierski) fycio@gucio.com ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-18 21:07 ` Jaroslav Kysela 2003-02-19 10:20 ` Abramo Bagnara @ 2003-02-19 10:33 ` Jaroslaw Sobierski 2003-02-19 11:08 ` Jaroslav Kysela 1 sibling, 1 reply; 57+ messages in thread From: Jaroslaw Sobierski @ 2003-02-19 10:33 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net Quoting Jaroslav Kysela <perex@suse.cz>: > > I've implemented the whole transfer and mix loop in assembly and it works > without any drastic impact on CPU usage. I tried to optimize the assembler > part as much as I can, but if some assembler guru want to give a glance, > I'll appreciate it. The function is named mix_areas1() in > alsa-lib/src/pcm/pcm_dmix.c. > It seems to me it would make sens to code it for mmx (to use the saturation it offers for example). If you go for pure 386 there's little to win. Did you look at the assembly generated by gcc when compiling with optimiazations? I usually make this a start point when moving time-critical code to assembly, and if it looks optimized enough - I leave it at that, unless I can use tricks not available to the compiler - like, again, mmx. I don't know how well gcc is optimized for intels, but I remember that you really had to work your ass of to beat inner loops optimized by Watcomm compilers (BTW I heard they're coming back with open source compilers :-). Not to mention proprietary Intel compilers which can take into account things like word alignment for data and code, cache hit / miss situations, branch preditiction and all kinds of magical stuff. I'll take a closer look at the code when I have more time though. -------------- Fycio (J.Sobierski) fycio@gucio.com ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-19 10:33 ` Jaroslaw Sobierski @ 2003-02-19 11:08 ` Jaroslav Kysela 0 siblings, 0 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-19 11:08 UTC (permalink / raw) To: Jaroslaw Sobierski; +Cc: Abramo Bagnara, alsa-devel@lists.sourceforge.net On Wed, 19 Feb 2003, Jaroslaw Sobierski wrote: > Quoting Jaroslav Kysela <perex@suse.cz>: > > > > I've implemented the whole transfer and mix loop in assembly and it works > > without any drastic impact on CPU usage. I tried to optimize the assembler > > part as much as I can, but if some assembler guru want to give a glance, > > I'll appreciate it. The function is named mix_areas1() in > > alsa-lib/src/pcm/pcm_dmix.c. > > > > It seems to me it would make sens to code it for mmx (to use the saturation > it offers for example). If you go for pure 386 there's little to win. Yes and no. I don't think that there will be enough need for the saturations, so the saturation code path mostly takes 4 instructions (two compare, two skipped conditional jumps). > Did you look at the assembly generated by gcc when compiling with > optimiazations? I usually make this a start point when moving time-critical Yes, my code is based on the code from GCC. > code to assembly, and if it looks optimized enough - I leave it at that, > unless I can use tricks not available to the compiler - like, again, mmx. > > I don't know how well gcc is optimized for intels, but I remember that you > really had to work your ass of to beat inner loops optimized by Watcomm > compilers (BTW I heard they're coming back with open source compilers :-). > Not to mention proprietary Intel compilers which can take into > account things like word alignment for data and code, cache hit / miss > situations, branch preditiction and all kinds of magical stuff. Yes, of course. I've not claimed that I wrote the best code in the world ;-) But something we can start with. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. The most comprehensive and flexible code editor you can use. Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. www.slickedit.com/sourceforge ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin
@ 2003-02-17 22:28 Jaroslaw Sobierski
0 siblings, 0 replies; 57+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-17 22:28 UTC (permalink / raw)
To: perex; +Cc: T.Motylewski, abramo.bagnara, alsa-devel
>> On Mon, 17 Feb 2003, Jaroslav Kysela wrote:
>>
>> > Note that your all nice ideas go to some blind alley. Who will silence the
>> > sum buffer? Driver silences only hardware buffer which will not be used
>> > for the calculation in your algorithm.
>>
>> Silencing is not time critical, if buffer is big enough it does not matter
>> whether is it done 1 ms or 100 ms after the card has played the data. Therefore
>> it may be done by a separate thread/process/kernel task without any
>> interference with other processes writing to the buffer.
>
>It is time critical for the dmix plugin, because other processes might
>write new samples to "empty" areas.
>
Clearing the sum buffer would be a task analogous, or I should probably say
reverse, to the saturation operation. You see, before you take the value in
the sum buffer and add your sample and so forth, you can check if the
destination sample in the DMA buffer is zero. If it is, you disregard the
value in the sum (it is now considered stale), overwrite it with your sample
and proceed to saturate it normally. If another thread has already written
something there - the final buffer will be non-zero, and you proceed as
discussed before, if another thread has written zeroes,or the result has
summed up to zero - it still doesn't matter, because then the sum buffer
would also have to contain a zero so it is right to disregard it's value.
And that's it. OK, some synchronization would be in order so that you don't
kill a sample just written by some other thread as in:
A B
check hw buff 0? yes
check hw buff 0? yes
write B sample to sum/hw
write A sample to sum/hw
A re-read after the write does not solve a problem this time, because
thread B could (though it is very unlikely) have the same sample value.
But I'm sure we can come up with something for this.
That said, I still think it would be a better solution altogether to have
a buffer in an alsa-native not hardware-native format and have the driver
do the translation / saturation and the like. Yeah, I know that's not what
you want, I got it ;-).
--------------
Fycio (J.Sobierski)
fycio@gucio.com
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
^ permalink raw reply [flat|nested] 57+ messages in thread* Re: Re: dmix plugin
@ 2003-02-17 16:18 Jaroslaw Sobierski
0 siblings, 0 replies; 57+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-17 16:18 UTC (permalink / raw)
To: T.Motylewski; +Cc: perex, abramo.bagnara, alsa-devel
>
>Well, but when adding a+b we have no idea that that overlow will be compensated
>by next very big negative sample. Also mixing signals which already fill 90% of
>dynamic range is not a good idea. My "fix" is heuristic - it works for
>occasional _small_ overflows like 0x4100+0x4000 -> 0x7fff is much better than
>0x8100.
>
>The idea of dmix as I understand it is that buffer is already in the native
>format for the sound card. So if sound card supports 24 bit, OK. But then
>people will start mixing 24 bit samples :-)
>
>> AFAIK most hardware does not mix by reducing volume before the sum. On the
>> contrary, it is usually summed "as is" to a wider register, and often even so
>
>And here our "wider register" is 16bit. That means end users should not expect
>too much if thay mix full power signals on it.
>
>BTW. If you have uncorrelated signals, then to mix 4 signals it may be good
>enough to reduce the amplitude of them just factor 2, because power will drop
>factor 4. Ocassionally there will be overrruns, but 0x7fff limit will make it
>almost not hearable. Not a correct fix, but I can assure you that it works in
>standard cases :-)
That's a good point. As long as we're dealing with 2 or 3 channels we probably
can do with saturating. But we should consider adding a shift right by one
(after adding, before saturation) once we have 4 channels, by two at 8
channels, or something similar.
Otherwise we will start getting some ugly clipping artifacts. The problem is,
this can cause a (noticable) sudden drop in volume when a "threshold" client
connects/disconnects. We could ramp, but that's a multiplication...
--------------
Fycio (J.Sobierski)
fycio@gucio.com
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
^ permalink raw reply [flat|nested] 57+ messages in thread* Re: Re: dmix plugin @ 2003-02-17 13:12 Jaroslaw Sobierski 2003-02-17 13:22 ` Jaroslav Kysela 2003-02-17 13:24 ` Jaroslav Kysela 0 siblings, 2 replies; 57+ messages in thread From: Jaroslaw Sobierski @ 2003-02-17 13:12 UTC (permalink / raw) To: abramo.bagnara; +Cc: perex, alsa-devel Abramo Bagnara wrote: >If we'd need to use an intermediate buffer and a mixing thread, the dmix >approach lose our interest. > >A solution might be to have a shared parallel sw ring buffer where to >store the exact value: > > xadd(sw, *src); > do { > v = *sw; > if (v > 0x7fff) > s = 0x7fff; > else if (v < -0x8000) > s = -0x8000; > else > s = v; > *hw = v; > } while (unlikely(v != *sw)); > >This should solve also the atomicity update. Very true, and it is consistent with what Jaroslav Kysela wrote: > My point was that all processes operates simultaneously and independently. > So if one process updates area in the "sum" ring buffer, then it MUST > transfer changed area (with saturation) to the DMA buffer. So there is no > "once saturation" as you think. Anyway, the current implementation uses > also saturation for all clients (processes) so the only drawback is the > additional access to the "sum" ring buffer memory area. So it seems like a good compromise to solve all our problems :-). Still, don't we already *have* a feeding thread for the sound card? I mean it doesn't just grab the memory buffer all by itself whenever it wants? Excuse my ignorance on this topic I'm only just starting with ALSA, and I did not have the time yet to go through the entire source code ;-). I remember when I was writing a driver for an mpeg2 decoder card that I had to create 2 threads, one for feeding video and one for audio. The FIFO level was checked either by polling or via interrupt handlers but I still had control over what and when is transferred. I could let the card pull the data via DMA using bus mastering but I still new what and from where will be sent... Does the problem lie in the fact that it is actually a plugin and has no control of the transfer? Maybe it would be worth considering a callback for the plugin from the main alsa module to infrom it that a new piece of the DMA buffer must be "prepared" whatever that could mean for a particular plugin. Anyway, just a thought. -------------- Fycio (J.Sobierski) fycio@gucio.com ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 13:12 Jaroslaw Sobierski @ 2003-02-17 13:22 ` Jaroslav Kysela 2003-02-17 18:15 ` Paul Davis 2003-02-17 13:24 ` Jaroslav Kysela 1 sibling, 1 reply; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-17 13:22 UTC (permalink / raw) To: Jaroslaw Sobierski Cc: abramo.bagnara@libero.it, alsa-devel@lists.sourceforge.net On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > Still, don't we already *have* a feeding thread for the sound card? I mean > it doesn't just grab the memory buffer all by itself whenever it wants? Nope. The idea for the dmix plugin is that we share the DMA ring buffer with more threads (processes). There is no "master" thread which operates exclusively with the DMA buffer. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 13:22 ` Jaroslav Kysela @ 2003-02-17 18:15 ` Paul Davis 2003-02-18 22:36 ` Abramo Bagnara 0 siblings, 1 reply; 57+ messages in thread From: Paul Davis @ 2003-02-17 18:15 UTC (permalink / raw) To: Jaroslav Kysela Cc: Jaroslaw Sobierski, abramo.bagnara@libero.it, alsa-devel@lists.sourceforge.net >> Still, don't we already *have* a feeding thread for the sound card? I mean >> it doesn't just grab the memory buffer all by itself whenever it wants? > >Nope. The idea for the dmix plugin is that we share the DMA ring buffer >with more threads (processes). There is no "master" thread which operates >exclusively with the DMA buffer. that would be called "JACK", right ? --p ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 18:15 ` Paul Davis @ 2003-02-18 22:36 ` Abramo Bagnara 0 siblings, 0 replies; 57+ messages in thread From: Abramo Bagnara @ 2003-02-18 22:36 UTC (permalink / raw) To: alsa-devel Paul Davis wrote: > > >> Still, don't we already *have* a feeding thread for the sound card? I mean > >> it doesn't just grab the memory buffer all by itself whenever it wants? > > > >Nope. The idea for the dmix plugin is that we share the DMA ring buffer > >with more threads (processes). There is no "master" thread which operates > >exclusively with the DMA buffer. > > that would be called "JACK", right ? Not necessarily, sorry. I've just explained in many ways that IMO the callback-only model choice will doom Jack to remain in a niche. And I say this with grief: Jack is the nicest acronym I've ever heard ;-) -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 13:12 Jaroslaw Sobierski 2003-02-17 13:22 ` Jaroslav Kysela @ 2003-02-17 13:24 ` Jaroslav Kysela 1 sibling, 0 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-17 13:24 UTC (permalink / raw) To: Jaroslaw Sobierski Cc: abramo.bagnara@libero.it, alsa-devel@lists.sourceforge.net On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > Does the problem lie in the fact that it is actually a plugin and has > no control of the transfer? Maybe it would be worth considering a callback > for the plugin from the main alsa module to infrom it that a new piece > of the DMA buffer must be "prepared" whatever that could mean for a > particular plugin. Anyway, just a thought. We use the poll and slave timer source which generates ticks when an interrupt from the PCM hardware arrives. It's sufficient for our purpose. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin
@ 2003-02-17 11:18 Jaroslaw Sobierski
2003-02-17 11:53 ` Jaroslav Kysela
0 siblings, 1 reply; 57+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-17 11:18 UTC (permalink / raw)
To: perex; +Cc: alsa-devel
>> In our case, such "solution" would have to affect the whole buffer, meaning
>> we would need 3 (or better yet 4) bytes per sample, which would eventually get
>> reduced back to 2 bytes on the way out to the sound card. This seems neither
>> elegant nor memory efficient but would work, and also solves the "a)" problem
>> because we don't need to saturate so an atomic add can be performed on each
>> sample.
>
>Yes, this solution is good. I've though about it, too. Unfortunately, it
>adds additional transfers including saturation from the "sum" ring buffer
>to the DMA buffer of hardware.
Hmmm... Not exactly. This is not a problem. First of all: it is way
better to saturate once (i.e. just before the transfer) since this is
a costly operation involving a conditional jump (unless you optimize for
mmx) than do it for every channel individually. If you're mixing 4
channels you do it once, not 4 times. Just because you need to store the
result in a different buffer, rather than putting it in it's original
place seems hardly a big difference (except for cache hits maybe).
Also, if you insist on sparing memory (the buffer is not *that*
big is it?) you can lay it out as two separate (ring) buffers, one
holding upper words, the other holding lower words. Now instead of
shifting the samples right n-bits before adding to the buffer, you
shift them left 16-n. In effect you will get a buffer (the upper part)
which can be directly sent to the audio hw, and which was summed and
divided without losing precision. The drawback is you lose the atomic
add. If you don't shift, you can still saturate into the "upper" buffer
and DMA from there.
--------------
Fycio (J.Sobierski)
fycio@gucio.com
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
^ permalink raw reply [flat|nested] 57+ messages in thread* Re: Re: dmix plugin 2003-02-17 11:18 Jaroslaw Sobierski @ 2003-02-17 11:53 ` Jaroslav Kysela 0 siblings, 0 replies; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-17 11:53 UTC (permalink / raw) To: Jaroslaw Sobierski; +Cc: alsa-devel@lists.sourceforge.net On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > >> In our case, such "solution" would have to affect the whole buffer, meaning > >> we would need 3 (or better yet 4) bytes per sample, which would eventually get > >> reduced back to 2 bytes on the way out to the sound card. This seems neither > >> elegant nor memory efficient but would work, and also solves the "a)" problem > >> because we don't need to saturate so an atomic add can be performed on each > >> sample. > > > >Yes, this solution is good. I've though about it, too. Unfortunately, it > >adds additional transfers including saturation from the "sum" ring buffer > >to the DMA buffer of hardware. > > Hmmm... Not exactly. This is not a problem. First of all: it is way > better to saturate once (i.e. just before the transfer) since this is > a costly operation involving a conditional jump (unless you optimize for > mmx) than do it for every channel individually. If you're mixing 4 > channels you do it once, not 4 times. Just because you need to store the > result in a different buffer, rather than putting it in it's original > place seems hardly a big difference (except for cache hits maybe). > > Also, if you insist on sparing memory (the buffer is not *that* > big is it?) you can lay it out as two separate (ring) buffers, one > holding upper words, the other holding lower words. Now instead of > shifting the samples right n-bits before adding to the buffer, you > shift them left 16-n. In effect you will get a buffer (the upper part) > which can be directly sent to the audio hw, and which was summed and > divided without losing precision. The drawback is you lose the atomic > add. If you don't shift, you can still saturate into the "upper" buffer > and DMA from there. My point was that all processes operates simultaneously and independently. So if one process updates area in the "sum" ring buffer, then it MUST transfer changed area (with saturation) to the DMA buffer. So there is no "once saturation" as you think. Anyway, the current implementation uses also saturation for all clients (processes) so the only drawback is the additional access to the "sum" ring buffer memory area. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: dmix plugin
@ 2003-02-17 10:04 Jaroslaw Sobierski
2003-02-17 10:15 ` Jaroslav Kysela
2003-02-17 10:32 ` tomasz motylewski
0 siblings, 2 replies; 57+ messages in thread
From: Jaroslaw Sobierski @ 2003-02-17 10:04 UTC (permalink / raw)
To: alsa-devel
> > b) sum overflow: we can lower volume of samples before sum; I think that
> > hardware works in this way, too
>
> Here I don't understand you. Suppose we have 3 samples to mix:
> a = 0x7500
> b = 0x7400
> c = 0x8300
>
> If you do a + b + c (in this order) you obtain:
> d=0
> d += a -> 7500
> d += b -> 0xe900 -> 0x7fff
> d += c -> 0x02ff
>
> while the correct result is 0x6c00. You see?
AFAIK most hardware does not mix by reducing volume before the sum. On the
contrary, it is usually summed "as is" to a wider register, and often even so
used. For example, a sound card able to mix 16 chanels of 16 bits would have
a 16+4 bits or 24 bit register were the channels are added and no saturation
can occur. In good hardware this would not even be downscaled back to 16 bits,
but a 24 bit D/A converter would be used instead. In older times (Gravis Ultra
Sound and I think older SB AWE) this could easily be spotted by the difference
in supported "hardware" channels and "software" channels. A card with a 32 bit
sum register and 24 bit DA could support (as above) 16 hardware channels and
for example 64 software channels (mixed together in quadrouplets to the 16 hw).
In our case, such "solution" would have to affect the whole buffer, meaning
we would need 3 (or better yet 4) bytes per sample, which would eventually get
reduced back to 2 bytes on the way out to the sound card. This seems neither
elegant nor memory efficient but would work, and also solves the "a)" problem
because we don't need to saturate so an atomic add can be performed on each
sample.
--------------
Fycio (J.Sobierski)
fycio@gucio.com
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
^ permalink raw reply [flat|nested] 57+ messages in thread* Re: Re: dmix plugin 2003-02-17 10:04 Jaroslaw Sobierski @ 2003-02-17 10:15 ` Jaroslav Kysela 2003-02-17 12:15 ` Abramo Bagnara 2003-02-17 10:32 ` tomasz motylewski 1 sibling, 1 reply; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-17 10:15 UTC (permalink / raw) To: Jaroslaw Sobierski; +Cc: alsa-devel@lists.sourceforge.net On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > > > b) sum overflow: we can lower volume of samples before sum; I think that > > > hardware works in this way, too > > > > Here I don't understand you. Suppose we have 3 samples to mix: > > a = 0x7500 > > b = 0x7400 > > c = 0x8300 > > > > If you do a + b + c (in this order) you obtain: > > d=0 > > d += a -> 7500 > > d += b -> 0xe900 -> 0x7fff > > d += c -> 0x02ff > > > > while the correct result is 0x6c00. You see? > > AFAIK most hardware does not mix by reducing volume before the sum. On the > contrary, it is usually summed "as is" to a wider register, and often even so > used. For example, a sound card able to mix 16 chanels of 16 bits would have > a 16+4 bits or 24 bit register were the channels are added and no saturation > can occur. In good hardware this would not even be downscaled back to 16 bits, > but a 24 bit D/A converter would be used instead. In older times (Gravis Ultra > Sound and I think older SB AWE) this could easily be spotted by the difference > in supported "hardware" channels and "software" channels. A card with a 32 bit > sum register and 24 bit DA could support (as above) 16 hardware channels and > for example 64 software channels (mixed together in quadrouplets to the 16 hw). > > In our case, such "solution" would have to affect the whole buffer, meaning > we would need 3 (or better yet 4) bytes per sample, which would eventually get > reduced back to 2 bytes on the way out to the sound card. This seems neither > elegant nor memory efficient but would work, and also solves the "a)" problem > because we don't need to saturate so an atomic add can be performed on each > sample. Yes, this solution is good. I've though about it, too. Unfortunately, it adds additional transfers including saturation from the "sum" ring buffer to the DMA buffer of hardware. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 10:15 ` Jaroslav Kysela @ 2003-02-17 12:15 ` Abramo Bagnara 2003-02-17 13:12 ` Jaroslav Kysela 0 siblings, 1 reply; 57+ messages in thread From: Abramo Bagnara @ 2003-02-17 12:15 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net Jaroslav Kysela wrote: > > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > > > > > b) sum overflow: we can lower volume of samples before sum; I think that > > > > hardware works in this way, too > > > > > > Here I don't understand you. Suppose we have 3 samples to mix: > > > a = 0x7500 > > > b = 0x7400 > > > c = 0x8300 > > > > > > If you do a + b + c (in this order) you obtain: > > > d=0 > > > d += a -> 7500 > > > d += b -> 0xe900 -> 0x7fff > > > d += c -> 0x02ff > > > > > > while the correct result is 0x6c00. You see? > > > > AFAIK most hardware does not mix by reducing volume before the sum. On the > > contrary, it is usually summed "as is" to a wider register, and often even so > > used. For example, a sound card able to mix 16 chanels of 16 bits would have > > a 16+4 bits or 24 bit register were the channels are added and no saturation > > can occur. In good hardware this would not even be downscaled back to 16 bits, > > but a 24 bit D/A converter would be used instead. In older times (Gravis Ultra > > Sound and I think older SB AWE) this could easily be spotted by the difference > > in supported "hardware" channels and "software" channels. A card with a 32 bit > > sum register and 24 bit DA could support (as above) 16 hardware channels and > > for example 64 software channels (mixed together in quadrouplets to the 16 hw). > > > > In our case, such "solution" would have to affect the whole buffer, meaning > > we would need 3 (or better yet 4) bytes per sample, which would eventually get > > reduced back to 2 bytes on the way out to the sound card. This seems neither > > elegant nor memory efficient but would work, and also solves the "a)" problem > > because we don't need to saturate so an atomic add can be performed on each > > sample. > > Yes, this solution is good. I've though about it, too. Unfortunately, it > adds additional transfers including saturation from the "sum" ring buffer > to the DMA buffer of hardware. I remember you that the main point of dmix existence is the "direct" part. If we'd need to use an intermediate buffer and a mixing thread, the dmix approach lose our interest. A solution might be to have a shared parallel sw ring buffer where to store the exact value: xadd(sw, *src); do { v = *sw; if (v > 0x7fff) s = 0x7fff; else if (v < -0x8000) s = -0x8000; else s = v; *hw = v; } while (unlikely(v != *sw)); This should solve also the atomicity update. Comments? -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 12:15 ` Abramo Bagnara @ 2003-02-17 13:12 ` Jaroslav Kysela 2003-02-17 13:29 ` Abramo Bagnara 0 siblings, 1 reply; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-17 13:12 UTC (permalink / raw) To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net On Mon, 17 Feb 2003, Abramo Bagnara wrote: > Jaroslav Kysela wrote: > > > > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > > > > > > > b) sum overflow: we can lower volume of samples before sum; I think that > > > > > hardware works in this way, too > > > > > > > > Here I don't understand you. Suppose we have 3 samples to mix: > > > > a = 0x7500 > > > > b = 0x7400 > > > > c = 0x8300 > > > > > > > > If you do a + b + c (in this order) you obtain: > > > > d=0 > > > > d += a -> 7500 > > > > d += b -> 0xe900 -> 0x7fff > > > > d += c -> 0x02ff > > > > > > > > while the correct result is 0x6c00. You see? > > > > > > AFAIK most hardware does not mix by reducing volume before the sum. On the > > > contrary, it is usually summed "as is" to a wider register, and often even so > > > used. For example, a sound card able to mix 16 chanels of 16 bits would have > > > a 16+4 bits or 24 bit register were the channels are added and no saturation > > > can occur. In good hardware this would not even be downscaled back to 16 bits, > > > but a 24 bit D/A converter would be used instead. In older times (Gravis Ultra > > > Sound and I think older SB AWE) this could easily be spotted by the difference > > > in supported "hardware" channels and "software" channels. A card with a 32 bit > > > sum register and 24 bit DA could support (as above) 16 hardware channels and > > > for example 64 software channels (mixed together in quadrouplets to the 16 hw). > > > > > > In our case, such "solution" would have to affect the whole buffer, meaning > > > we would need 3 (or better yet 4) bytes per sample, which would eventually get > > > reduced back to 2 bytes on the way out to the sound card. This seems neither > > > elegant nor memory efficient but would work, and also solves the "a)" problem > > > because we don't need to saturate so an atomic add can be performed on each > > > sample. > > > > Yes, this solution is good. I've though about it, too. Unfortunately, it > > adds additional transfers including saturation from the "sum" ring buffer > > to the DMA buffer of hardware. > > I remember you that the main point of dmix existence is the "direct" > part. > > If we'd need to use an intermediate buffer and a mixing thread, the dmix > approach lose our interest. > > A solution might be to have a shared parallel sw ring buffer where to > store the exact value: > > xadd(sw, *src); > do { > v = *sw; > if (v > 0x7fff) > s = 0x7fff; > else if (v < -0x8000) > s = -0x8000; > else > s = v; > *hw = v; > } while (unlikely(v != *sw)); > > This should solve also the atomicity update. > > Comments? We probably talk about same thing, but in different words. I also don't think that atomicity is an problem when xadd() is atomic (and it is atomic for i386). Then you need to do the saturation and store to the hardware ring buffer, but if this operation is after xadd() then we don't care about atomicity, because we are 100% sure that we have a valid result. Algorithm: while (count) { atomic_xadd(sum_ring_buffer[idx], local_buffer[idx]); hw_ring_buffer[idx] = saturate(sum_ring_buffer[idx]); } Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 13:12 ` Jaroslav Kysela @ 2003-02-17 13:29 ` Abramo Bagnara 2003-02-17 15:00 ` Jaroslav Kysela 0 siblings, 1 reply; 57+ messages in thread From: Abramo Bagnara @ 2003-02-17 13:29 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net Jaroslav Kysela wrote: > > On Mon, 17 Feb 2003, Abramo Bagnara wrote: > > > Jaroslav Kysela wrote: > > > > > > On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > > > > > > > > > b) sum overflow: we can lower volume of samples before sum; I think that > > > > > > hardware works in this way, too > > > > > > > > > > Here I don't understand you. Suppose we have 3 samples to mix: > > > > > a = 0x7500 > > > > > b = 0x7400 > > > > > c = 0x8300 > > > > > > > > > > If you do a + b + c (in this order) you obtain: > > > > > d=0 > > > > > d += a -> 7500 > > > > > d += b -> 0xe900 -> 0x7fff > > > > > d += c -> 0x02ff > > > > > > > > > > while the correct result is 0x6c00. You see? > > > > > > > > AFAIK most hardware does not mix by reducing volume before the sum. On the > > > > contrary, it is usually summed "as is" to a wider register, and often even so > > > > used. For example, a sound card able to mix 16 chanels of 16 bits would have > > > > a 16+4 bits or 24 bit register were the channels are added and no saturation > > > > can occur. In good hardware this would not even be downscaled back to 16 bits, > > > > but a 24 bit D/A converter would be used instead. In older times (Gravis Ultra > > > > Sound and I think older SB AWE) this could easily be spotted by the difference > > > > in supported "hardware" channels and "software" channels. A card with a 32 bit > > > > sum register and 24 bit DA could support (as above) 16 hardware channels and > > > > for example 64 software channels (mixed together in quadrouplets to the 16 hw). > > > > > > > > In our case, such "solution" would have to affect the whole buffer, meaning > > > > we would need 3 (or better yet 4) bytes per sample, which would eventually get > > > > reduced back to 2 bytes on the way out to the sound card. This seems neither > > > > elegant nor memory efficient but would work, and also solves the "a)" problem > > > > because we don't need to saturate so an atomic add can be performed on each > > > > sample. > > > > > > Yes, this solution is good. I've though about it, too. Unfortunately, it > > > adds additional transfers including saturation from the "sum" ring buffer > > > to the DMA buffer of hardware. > > > > I remember you that the main point of dmix existence is the "direct" > > part. > > > > If we'd need to use an intermediate buffer and a mixing thread, the dmix > > approach lose our interest. > > > > A solution might be to have a shared parallel sw ring buffer where to > > store the exact value: > > > > xadd(sw, *src); > > do { > > v = *sw; > > if (v > 0x7fff) > > s = 0x7fff; > > else if (v < -0x8000) > > s = -0x8000; > > else > > s = v; > > *hw = v; > > } while (unlikely(v != *sw)); > > > > This should solve also the atomicity update. > > > > Comments? > > We probably talk about same thing, but in different words. I also don't > think that atomicity is an problem when xadd() is atomic (and it is atomic > for i386). > > Then you need to do the saturation and store to the hardware ring buffer, > but if this operation is after xadd() then we don't care about atomicity, > because we are 100% sure that we have a valid result. > > Algorithm: > > while (count) { > atomic_xadd(sum_ring_buffer[idx], local_buffer[idx]); > hw_ring_buffer[idx] = saturate(sum_ring_buffer[idx]); > } You're wrong: xadd is atomic but xadd/read/saturation/write is not. Without the loop I've added you risk to write on hw_ring_buffer an *old* value: A: B: xadd read xadd read saturate write saturate write -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 13:29 ` Abramo Bagnara @ 2003-02-17 15:00 ` Jaroslav Kysela 2003-02-17 15:21 ` Abramo Bagnara 0 siblings, 1 reply; 57+ messages in thread From: Jaroslav Kysela @ 2003-02-17 15:00 UTC (permalink / raw) To: Abramo Bagnara; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net On Mon, 17 Feb 2003, Abramo Bagnara wrote: > You're wrong: xadd is atomic but xadd/read/saturation/write is not. > > Without the loop I've added you risk to write on hw_ring_buffer an *old* > value: > > A: B: > xadd > read > xadd > read > saturate > write > saturate > write I see, the read/saturate/write must be atomic, too. In this case, it would be better to use a global (or a set of) mutex(es) to lock the hardware ring buffer. The futexes are nice. Jaroslav ----- Jaroslav Kysela <perex@suse.cz> Linux Kernel Sound Maintainer ALSA Project, SuSE Labs ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 15:00 ` Jaroslav Kysela @ 2003-02-17 15:21 ` Abramo Bagnara 0 siblings, 0 replies; 57+ messages in thread From: Abramo Bagnara @ 2003-02-17 15:21 UTC (permalink / raw) To: Jaroslav Kysela; +Cc: Jaroslaw Sobierski, alsa-devel@lists.sourceforge.net Jaroslav Kysela wrote: > > On Mon, 17 Feb 2003, Abramo Bagnara wrote: > > > You're wrong: xadd is atomic but xadd/read/saturation/write is not. > > > > Without the loop I've added you risk to write on hw_ring_buffer an *old* > > value: > > > > A: B: > > xadd > > read > > xadd > > read > > saturate > > write > > saturate > > write > > I see, the read/saturate/write must be atomic, too. In this case, it would > be better to use a global (or a set of) mutex(es) to lock the hardware > ring buffer. The futexes are nice. They are nice indeed, but definitely not the right solution here. Although I don't know if it's the absolute best solution, the 'retry' approach I've proposed is far better and much more efficient. -- Abramo Bagnara mailto:abramo.bagnara@libero.it Opera Unica Phone: +39.546.656023 Via Emilia Interna, 140 48014 Castel Bolognese (RA) - Italy ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: Re: dmix plugin 2003-02-17 10:04 Jaroslaw Sobierski 2003-02-17 10:15 ` Jaroslav Kysela @ 2003-02-17 10:32 ` tomasz motylewski 1 sibling, 0 replies; 57+ messages in thread From: tomasz motylewski @ 2003-02-17 10:32 UTC (permalink / raw) To: Jaroslaw Sobierski; +Cc: alsa-devel On Mon, 17 Feb 2003, Jaroslaw Sobierski wrote: > > Here I don't understand you. Suppose we have 3 samples to mix: > > a = 0x7500 > > b = 0x7400 > > c = 0x8300 > > > > If you do a + b + c (in this order) you obtain: > > d=0 > > d += a -> 7500 > > d += b -> 0xe900 -> 0x7fff > > d += c -> 0x02ff > > > > while the correct result is 0x6c00. You see? Well, but when adding a+b we have no idea that that overlow will be compensated by next very big negative sample. Also mixing signals which already fill 90% of dynamic range is not a good idea. My "fix" is heuristic - it works for occasional _small_ overflows like 0x4100+0x4000 -> 0x7fff is much better than 0x8100. The idea of dmix as I understand it is that buffer is already in the native format for the sound card. So if sound card supports 24 bit, OK. But then people will start mixing 24 bit samples :-) > AFAIK most hardware does not mix by reducing volume before the sum. On the > contrary, it is usually summed "as is" to a wider register, and often even so And here our "wider register" is 16bit. That means end users should not expect too much if thay mix full power signals on it. BTW. If you have uncorrelated signals, then to mix 4 signals it may be good enough to reduce the amplitude of them just factor 2, because power will drop factor 4. Ocassionally there will be overrruns, but 0x7fff limit will make it almost not hearable. Not a correct fix, but I can assure you that it works in standard cases :-) Best regards, -- Tomasz Motylewski BFAD GmbH & Co. KG ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ^ permalink raw reply [flat|nested] 57+ messages in thread
end of thread, other threads:[~2003-02-21 14:08 UTC | newest] Thread overview: 57+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-02-17 15:32 Re: dmix plugin Jaroslaw Sobierski 2003-02-17 19:45 ` Jaroslav Kysela 2003-02-17 20:44 ` tomasz motylewski 2003-02-17 20:59 ` Jaroslav Kysela 2003-02-18 10:00 ` Abramo Bagnara 2003-02-18 12:52 ` Jaroslav Kysela 2003-02-18 13:10 ` Jaroslaw Sobierski 2003-02-18 13:19 ` Jaroslav Kysela 2003-02-18 14:51 ` Paul Davis 2003-02-18 16:51 ` Jaroslav Kysela 2003-02-18 21:07 ` Jaroslav Kysela 2003-02-19 10:20 ` Abramo Bagnara 2003-02-19 11:01 ` Jaroslav Kysela 2003-02-19 11:17 ` Abramo Bagnara 2003-02-19 13:49 ` Abramo Bagnara 2003-02-19 15:45 ` Jaroslaw Sobierski 2003-02-19 20:39 ` Abramo Bagnara 2003-02-19 18:34 ` Jaroslav Kysela 2003-02-19 21:24 ` Jaroslav Kysela 2003-02-20 8:28 ` Abramo Bagnara 2003-02-20 8:30 ` Jaroslaw Sobierski 2003-02-20 8:48 ` Abramo Bagnara 2003-02-20 9:17 ` Echoaudio drivers Giuliano Pochini 2003-02-20 14:37 ` David Olofson 2003-02-20 15:40 ` Giuliano Pochini 2003-02-20 16:03 ` David Olofson 2003-02-20 8:53 ` Re: dmix plugin Abramo Bagnara 2003-02-20 16:49 ` Jaroslav Kysela 2003-02-20 17:57 ` Abramo Bagnara 2003-02-20 18:26 ` Paul Davis 2003-02-20 19:23 ` unterminated conditionals: @HAVE_JACK_TRUE@ tomasz motylewski 2003-02-20 19:57 ` Jaroslav Kysela 2003-02-20 20:30 ` tomasz motylewski 2003-02-20 22:14 ` Re: dmix plugin Abramo Bagnara 2003-02-20 19:55 ` Jaroslav Kysela 2003-02-20 21:19 ` tomasz motylewski 2003-02-20 21:27 ` Jaroslav Kysela 2003-02-21 10:25 ` Abramo Bagnara 2003-02-21 14:08 ` Jaroslaw Sobierski 2003-02-19 10:33 ` Jaroslaw Sobierski 2003-02-19 11:08 ` Jaroslav Kysela -- strict thread matches above, loose matches on Subject: below -- 2003-02-17 22:28 Jaroslaw Sobierski 2003-02-17 16:18 Jaroslaw Sobierski 2003-02-17 13:12 Jaroslaw Sobierski 2003-02-17 13:22 ` Jaroslav Kysela 2003-02-17 18:15 ` Paul Davis 2003-02-18 22:36 ` Abramo Bagnara 2003-02-17 13:24 ` Jaroslav Kysela 2003-02-17 11:18 Jaroslaw Sobierski 2003-02-17 11:53 ` Jaroslav Kysela 2003-02-17 10:04 Jaroslaw Sobierski 2003-02-17 10:15 ` Jaroslav Kysela 2003-02-17 12:15 ` Abramo Bagnara 2003-02-17 13:12 ` Jaroslav Kysela 2003-02-17 13:29 ` Abramo Bagnara 2003-02-17 15:00 ` Jaroslav Kysela 2003-02-17 15:21 ` Abramo Bagnara 2003-02-17 10:32 ` tomasz motylewski
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.