The bulk tester certainly could do with some polishing. It was mostly code copied from the Stack plugin, but I thought of it as primarily something for administrators and never put much effort into making it look nice. However, I find it's getting used increasingly by question authors, too, so it would be worth investing some effort.
I like the sound of all your suggestions. However, I'm in the middle of a busy teaching spell at present so wouldn't be able to put any time into CodeRunner development for a while - possibly not until the end of the teaching year in November. I'd certainly be grateful for any pull requests.
In the past I've seen cases where a question appeared to fail in the bulk run but was OK when I went to look at it, too. But I haven't had any cases recently. It was never repeatable so debugging was problematic. The questions that failed were always fine in production use so whatever caused the problem was specific to the bulk tester. Let me know if it reoccurs for you.