I've been having some issues with blktrace on large(-ish) systems that are primarily due to the N(devs) X N(cpus) nature of blktrace: we fire off 1 thread for each device on each CPU. With (say) 32 cores and (say) 100 devices, we're seeing 3,200 threads competing for cores just to handle the data transfer from the relay interface out to long-term storage. Today I whipped up a prototype new implementation of blktrace (only handles the "standard" read-from-relay and write-to-file mode, *not* network or piped modes). This implementation fires off a single thread per CPU only - each thread than manages all the relay files for its CPU (opens them all, and uses poll() to determine which ones need processing). This cuts down a lot on the scheduling issue, and hopefully will cause less of an impact on the overall system. I've down some testing on a small-ish machine - using valgrind to ensure proper handling of all memory, as well as a 16-way w/ >100 disks and things are working as expected. (At least blkrawverify, blkparse, btt, ... are all working fine for some test cases.) I've attached the current sources - a lot of this is just ripped out of blktrace.c in the tree currently, some of it is reformatted... Comments? Thanks, Alan