Thanks Sam. Some interesting questions.
Let me explain how we deal with some of the issues you raise and then we can then discuss whether CodeRunner itself needs any extra functionality to provide better feedback.
Firstly we often put comments within the test code to explain to the student just what we're testing for. For example we might have a test
# Check that function returns the
# correct type (should be bool not str)
print(type(is_allowed_to_drink(age)))
Since such tests tend to just clutter the result table for most students, whose functions do return the right type, we would probably set this test to Hide if succeed. That way, only those students whose functions fail the test will see it, and we hope the comment on the test is sufficient feedback for them to fix their code. Even without the comment, the test code itself certainly shows the error and some students might actually learn more without the comment, as they then have to figure out the test code itself.
In tests and assignments we usually set both the Hide if succeed and the Hide rest if fail checkboxes on all except the For example tests, so that what students see either a green result table or a red one with the example tests (which you would hope they had already tested with) and the first failing test.
When the test code really does get too complex to expose to students, e.g. with tkinter questions, I do tricks like you describe - the Test column of the result table describes what I'm testing for and the actual test code is in the (hidden) TEST.extra field, using a template that runs both the TEST.testcode and TEST.extra code for each test case. The output from the code in the TEST.extra field is then just something like "OK" or it is a specific message saying what failed. The TEST.expected field is then just set to "OK".
Another trick I have is to use a template that first checks for the existence of a file _prefix.py among the support files; if this is found, its contents are inserted into the test run before the test code. That file can include quite complex test functions that you can call from the per-test-case test code. This saves you having to repeat the complex code for each test or customise the template to include it.
Testing code with floating point numbers is always problematic. We teach the format method of a string quite early and use that in tests. For example
print("Average speed {:.2f}".format(student_answer))
You still need to ensure that your test data avoids situations where small errors might change the rounded output. In the example above, if the correct were 23.155, the computed answer might be printed as either 23.15 or 23.16.
A much better solution is used by my colleague Jenny Harlow in Matlab courses. She has a sophisticated combinator template grader that extracts the numbers from both the expected answer (generated by running the sample answer) and the student answer and compares them all to a given tolerance. With this approach you have complete control over the feedback but at the cost of a huge increase in complexity.
Those ideas may help with some of your problems?
Richard