Information, Geometry, and Physics Seminar
Human language is a unique form of communication in the natural world. Most fundamentally, it has systematic structure, meaning that signals can be broken down into component parts that are individually meaningful -- roughly, words -- which are combined in a regular, hierarchical way to form sentences. Furthermore, the way in which these parts are combined maintains a kind of locality: words are usually concatenated together, and they form contiguous phrases. I argue that natural-language-like systematicity arises in codes that minimize predictive information, a measure of statistical complexity that represents the minimum amount of information necessary for predicting the future of a sequence based on its past (Bialek, Nenenman & Tishby, 2001). In simulations, I show that codes that minimize excess entropy factorize their source distributions into groups of approximately independent components which are expressed systematically and locally, corresponding to words and phrases. Next, drawing on large bodies of naturalistic text, I show that human languages are structured in a way that reduces predictive information at the level of phonology, morphology, syntax, and semantics. These results establish a link between the statistical and algebraic structure of human language.