Journées scientifiques du langage du GDR Adyloc

Corpus, outils et modélisation statistique dans l'étude de l'acquisition du langage
Corpora, tools, and statistical models for language acquisition research

Le GDR Adyloc a organisé les 12 et 13 novembre 2013 deux journées consacrées aux recherche menées autour de corpus de langage d'acquisition.

Ces journées se sont déroulées à la Salle CNRS Pouchet, située 59 rue Pouchet, 75017 Paris (voir plan).

Télécharger le programme des journées
Téléchargements du matériel pour les formations
Lien vers la page des journées de formation IRCOM: Prise de vue et utilisation de ELAN.

Les deux matinées seront consacrées à un atelier de formation corpus d'acquisition:
le 12 novembre : CLAN transcription et utilisation des corpus.
le 13 novembre : Outils d'études de corpus - Excel, Commandes CLAN, ELAN, introduction rapide aux logiciels de textométrie TXM et de statistiques R
-la participation aux deux ateliers est conditionnée par une inscription préalable (20 places par atelier) auprès de Christophe Parisse :

Les deux après-midi, ouvertes à tout public, seront consacrées à des exposés sur les méthodes et recherches portant sur les corpus d'acquisition de langage. Les intervenants seront:
Anna Theakston (Université de Manchester)
Stefan Gries (Université de Californie, Santa Barbara)
Colin Bannard (Université de Texas at Austin)
Thomas Hills (Université de Warwick)
Dylan Glynn (Université Paris 8)


Résumé des présentations

Colin Bannard: Rethinking the role of imitation in language development

This talk is concerned with children's use of imitation as a strategy in language learning. I will describe a range of different studies, all of which are concerned in some way with understanding when children will choose to imitate language they have heard others use, and when they will choose to be selective or creative in their productions. I will explain the utility of imitation via a statistical analysis of the language that children hear, and particularly a discussion of the bias-variance problem, a central concept in machine learning. I will use this to motivate corpus-derived statistical models that allow us to predict when children will imitate and when they will innovate, and will discuss the utility of such models in accounting for grammatical development.

Dylan Glynn : Correspondence Analysis. Exploring categorical data and identifying patterns

Correspondence analysis as an exploratory technique that reveals frequency-based associations in complex categorical data. The technique visualises these associations in ‘biplots’, or maps that depict degrees of correlation and variation through the relative proximity of data points (which represent linguistic usage features and / or the actual examples of use). Linguists often wish to find relations between given linguistic forms, between their meanings and in what situations those forms and meanings are used. Correspondence analysis is especially designed for identifying such usage patterning.

Stefan Gries : Some suggestions for better statistics in corpus-based language acquisition research

Compared to many other areas in linguistics - especially formal linguistics - research in language acquisition has had a history of being based on empirical data from observational and/or experimental approaches. This in turn resulted in language acquisition research featuring more and more advanced statistical analyses than many other sub-disciplines of linguistics. However, given recent developments in quantitative corpus linguistics and statistical methodology, it is maybe time to take stock, to discuss some common methodological choices in corpus-based language acquisition research, and to explore options of improving on them conceptually and methodologically. In this talk, I will discuss a few methodological choices that have frequently been made or that, more or less implicitly, underlie much corpus-based language acquisition research with an eye to then propose alternative ways to think about such data and methods. Among other things, I will be concerned with the multifactoriality of corpus-based language acquisition data, the question of acquisitional stages, and the identification of trends.

Thomas Hills : Word learning: Growing Semantic Networks on the Statistical Structure of Language

Children learn language in a sea of words. In this talk, I will report on my recent research attempting to predict word learning based on the structure of child-directed speech. This work is based on a new theory of language acquisition called the associative structure of language, which posits that word associations and contextual diversity work hand-in-hand to help children learn language. This work involves using network analysis to compete computational models of language acquisition against one another, using large corpora of adult- and child-directed speech.

Anna Theakston : Learning grammatical constructions: insights from corpus data

From a constructivist perspective, children are thought to acquire the grammatical constructions of their language from the language to which they are exposed – caregiver input. From this perspective, first, the distributional properties of the input are centrally important in determining the pattern of acquisition observed in early child speech and second, the development of adult-like grammatical knowledge is assumed to emerge gradually as a function of a growing and increasingly more connected network of representations. In this talk I will give an overview of a number of research projects in which we have investigated the acquisition of grammatical constructions of different kinds through the analysis of corpus data, including a consideration of grammatical constructions (the transitive), grammatical errors (case marking) and morphological systems (the past tense).

Round table : Using corpora of spontaneous child-adult interaction: How much does the child's production match the child's input?

Corpora of child language interaction are very often used as a means to study the child's production in natural setting and to evaluate the degree of correspondence between the child's production and the child's input. The most recent dense corpora provide an even better image of the correspondence between the child and the adult. However, there is always a part of the data that is missing in such corpora. Is possible to estimate this part and to evaluate how far we can expect to find valid correspondences and how far we should expect the child to be creative or to remember what she heard in the input?

Téléchargements, outils

Modyco UMR CNRS - PARIS OUEST Nanterre