One of the major challenges in the development of an immersive system is handling the delay between the tracking of the user’s head position and the updated projection of a 3D image or auralised sound, also called end-to-end delay. Excessive end-to-end delay can result in the general decrement of the “feeling of presence”, the occurrence of motion sickness and poor performance in perception-action tasks. These latencies must be known in order to provide insights on the technological (hardware/software optimization) or psychophysical (recalibration sessions) strategies to deal with them. Our goal was to develop a new measurement method of end-to-end delay that is both precise and easily replicated. We used a Head and Torso simulator (HATS) as an auditory signal sensor, a fast response photo-sensor to detect a visual stimulus response from a Motion Capture System, and a voltage input trigger as real-time event. The HATS was mounted in a turntable which allowed us to precisely change the 3D sound relative to the head position. When the virtual sound source was at 90º azimuth, the correspondent HRTF would set all the intensity values to zero, at the same time a trigger would register the real-time event of turning the HATS 90º azimuth. Furthermore, with the HATS turned 90º to the left, the motion capture marker visualization would fell exactly in the photo-sensor receptor. This method allowed us to precisely measure the delay from tracking to displaying. Moreover, our results show that the method of tracking, its tracking frequency, and the rendering of the sound reflections are the main predictors of end-to-end delay.