MDLChunker: a MDL-based Model of Word Segmentation

Abstract

This paper applies a MDL-based computational model of inductive learning to the problem of word segmentation. The main idea is that syllables are grouped into words as soon as this operation decreases the size of the overall representation of the data, that is the codelength of information. When exposed to a stream of artificial words, our model (MDLChunker) is able to reproduce Giroud & Rey (in press) effect: humans learn sub-words as well as real words at the beginning, but after a while they learn real words better than sub-words. In order to better mimic human learning, a limited-size short-term memory was added to the model and estimates of its size are given.


Back to Saturday Posters