kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* can I make this work… (Foundation for accessibility project)
@ 2014-11-18  6:48 Eric S. Johansson
  2014-11-18 13:50 ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Eric S. Johansson @ 2014-11-18  6:48 UTC (permalink / raw)
  To: kvm

this is a rather different use case than what you've been thinking of 
for KVM. It could mean significant improvement of the quality of life of 
disabled programs like myself. It's difficult to convey what it's like 
to try to use computers with speech recognition for something other than 
writing so, bear with me when I say something is real but don't quite 
prove it yet.  also, please take it as read that the only really usable 
speech recognition environment out there is NaturallySpeaking with 
Google close behind in terms of accuracy but not even in the same planet 
for ability to extend for speech enabled applications.

I'm trying to figure out ways of making it possible to drive Linux from 
Windows speech recognition (NaturallySpeaking).  The goal is a system 
where Windows runs in a virtual machine (Linux host), audio is passed 
through from a USB headset to the Windows environment. And the output of 
the recognition engine is piped through some magic back to the Linux host.

the hardest part of all of this without question is getting clean 
uninterrupted audio from the USB device all the way through to the 
Windows virtual machine. virtual box, VMware fail mostly in delivering  
reliable  audio to the virtual machine.

I expect KVM to not  work right  with regards to getting clean 
audio/real-time USB but I'm asking in case I'm wrong. if it doesn't work 
or can't work yet, what would it take to make it possible for clean 
audio to be passed through to a guest?

--- Why this is important, approaches that failed, why think this will 
work. Boring accessibility info ---

The history of trying to make Windows or DOS based speech recognition 
drive Linux has a long and tortured history. almost all of them involve 
some form of an open loop system that ignores system context and counts 
on the grammar to specify the context and the subsequent keystrokes 
injected into the target system.

This model fails because it effectively speaking keyboard functions 
which wastes the majority of the power of a good grammar in a speech 
recognition environment.

Most common configuration for speech recognition in a virtualized 
environment today is that Windows is the host with speech recognition 
and Linux is the guest. It's just a reimplementation of the open-loop 
system described above where your dictation results are keystrokes 
injected into the virtual machine console window. Sometimes works, 
sometimes drops characters.

One big failing of the Windows host/Linux guest environments is in 
addition to dropping characters,it seems to drop segments of the audio 
stream on the Windows side. It's  common but not frequent for this to 
happen anyway when running Windows with any sort of CPU utilization but 
it's almost guaranteed as soon as a virtual machine starts up.

Another failing is that the context the recognition application is aware 
of is the window of the console. It knows nothing about the internal 
context of the virtual machine (what application has focus). And 
unfortunately it can't know anything more because of the way that 
NaturallySpeaking uses the local Windows context.

Inverting the relationship between guest and host where Linux is the 
host and Windows is the guest solves at least the focus problem. In the 
virtual machine, you have a portal application the canal control the 
perception of context and tunnels the character stream from the 
recognition engine into the host OS to drive it open loop. The portal 
application[1] can also communicate which grammar sequence has been 
parsed and what action should be taken on the host site. At this point, 
we now have the capabilities of a closed-loop speech recognition 
environment where a grammar can read context to generate a new grammar 
to fit the applications state. This means smaller utterances which can 
be disambiguated versus the more traditional large utterance 
disambiguation technique.

A couple other advantages of Windows as a guest is that it only run 
speech recognition in the portal. There's no browsers, no flash, 
JavaScript, viruses and other "stuff" taking up resources and 
distracting from speech recognition working as well as possible. The 
downside is that the host running the virtual machine needs to make the 
VM very high almost real-time priority[2] so that it doesn't stall and 
speech recognition works as quickly and as accurately as possible.

Hope I didn't bore you too badly. Thank you for reading and I hope we 
can make this work.
--- eric



[1] should I call it cake?
[2]  I'm looking at you Firefox, sucking down 30% of the CPU doing nothing

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-11-21 19:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-18  6:48 can I make this work… (Foundation for accessibility project) Eric S. Johansson
2014-11-18 13:50 ` Paolo Bonzini
2014-11-18 13:53   ` Hans de Goede
2014-11-18 14:57     ` Eric S. Johansson
2014-11-20 16:28       ` Eric S. Johansson
2014-11-20 21:48         ` Paolo Bonzini
2014-11-20 22:22           ` Eric S. Johansson
2014-11-21 14:06             ` Paolo Bonzini
2014-11-21 16:52               ` Eric S. Johansson
2014-11-21 18:22                 ` Paolo Bonzini
2014-11-21 18:24                 ` next puzzle: " Eric S. Johansson
2014-11-21 18:47                   ` Eric S. Johansson
2014-11-18 14:51   ` Eric S. Johansson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).