Imitation is a powerful capability of infants, relevant for bootstrapping many cognitive capabilities like communication, language and learning under supervision. In infants, this skill relies on establishing a joint attentional link with the teaching party. In this work we propose a method for establishing the joint attention between an experimenter and an embodied agent. The agent first estimates the head pose of the experimenter, based on tracking with a cylindrical head model. Then two separate neural network regressors are used to interpolate the gaze direction and the target object depth from the computed head pose estimates. A bottom-up feature-based saliency model is used to select and attend to objects in a restricted visual field indicated by the gaze direction. We demonstrate our system on a number of recordings where the experimenter selects and attends to an object among several alternatives. Our results suggest that rapid gaze estimation can be achieved for establishing joint attention in interaction-driven robot training, which is a very promising testbed for hypotheses of cognitive development and genesis of visual communication.