In the present study we hypothesized that the gist representation of a picture (extracted from brief initial inspection) supports inference generation from subsequent text, which in turn should foster comprehension. Moreover, we proposed that longer inspection of a picture is necessary to provide learners with an alternative representation that fosters mental animation and recall. Participants (N=76) learned from a text about pulley systems, and in three out of four conditions from an additional picture of a pulley system. Students saw either the text only, the picture preceding the text for 150ms or 2sec, or received a self-paced presentation of the picture before the text. Results confirm our assumptions that presenting the picture for the time to extract the gist (2sec) before the text fostered comprehension, whereas only the self-paced presentation of the picture fostered mental animation and recall.