In fact you just need good made stereo Headset. take a shot:http://www.youtube.com/watch?v=0M6SELmKn1w
Audio system are already processing the 3D position of sound source
Ego shooters take your position for processing the sound, moving it to screen depth would result in a feeling that your ears are behind your head.
Many 3rd person games take the Camera position as hearing spot, that would be also it for your bouncing box...
I think the most of the distance is percept by loudness only the direction by run length/time difference between ears.
To your Positioning problem: Ask the user for viewing distance and read out the separation of and example object and then work with the z-buffer