I've been having some issues with blktrace on large(-ish) systems that
are primarily due to the N(devs) X N(cpus) nature of blktrace: we fire
off 1 thread for each device on each CPU. With (say) 32 cores and (say)
100 devices, we're seeing 3,200 threads competing for cores just to
handle the data transfer from the relay interface out to long-term storage.

Today I whipped up a prototype new implementation of blktrace (only
handles the "standard" read-from-relay and write-to-file mode, *not*
network or piped modes). This implementation fires off a single thread
per CPU only - each thread than manages all the relay files for its CPU
(opens them all, and uses poll() to determine which ones need
processing). This cuts down a lot on the scheduling issue, and hopefully
will cause less of an impact on the overall system.

I've down some testing on a small-ish machine - using valgrind to ensure
proper handling of all memory, as well as a 16-way w/ >100 disks and
things are working as expected. (At least blkrawverify, blkparse, btt,
... are all working fine for some test cases.)

I've attached the current sources - a lot of this is just ripped out of
blktrace.c in the tree currently, some of it is reformatted... Comments?

Thanks,
Alan