In order to determine the meanings of words of their native languages, learners must extract information from noisy, complex environments. To be successful, they must direct their attention moment-by-moment to the most informative subset of this data. Using eye-tracking information from participants engaged in a cross-situational language learning task, we ask how attention and learning are dynamically coupled in real-time. In the cross-situational word-learning paradigm, learners are exposed to a series of trials containing multiple objects and multiple words each. While each trial is individually ambiguous (as it contains many potential word-object mappings), correct mappings can be determined over time by computing association frequencies between words and objects across the whole training set. Synchrony between spoken word onsets and object fixation inform a model in which word-object associations are increased as a function of visual attention. This model is then used to predict the learning outcomes of individual participants.