Processing linguistic corpus: Tools and methods

The conference will take place on Thursday and Friday, 4th and 5th October 2012
Amphithéâtre Durkheim, Université Paris Descartes, 7 rue de la Sorbonne, Paris.

Deadline for abstracts: 21th Mai 2012? (12:00 GMT)
Author's notification: 15th June
Definite program: July

Final paper submission deadline: 3rd September 2012


Proposals are to be sent to Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. . Both papers and posters can be presented in English or preferably in French.

  • Papers: Please send a proposal of two pages, including a title, an abstract, five bibliographical references and a list of five keywords (font 12, margins 2,5, line spacing 1,5). There will be twenty minutes of oral presentation followed by ten minutes of discussion.

  • Posters: There will also be a poster (A1) session for shorter or work-in-progress presentations. Please send a proposal of one page, including a title, an abstract, five bibliographical references and a list of five keywords (font 12, margins 2,5, single line spacing).

Call for papers

COLDOC is a conference organized every year by postgraduate students and young researchers of the MoDyCo laboratory (UMR 7114 – CNRS/Université Paris Ouest Nanterre/Université Paris Descartes). This year our aim is to explore tools and methods which has emerged around corpus-based studies. Over the last decades, linguistics has undergone a considerable evolution in its object of study : it tends to focus less on language itself (as an a priori unlimited and introspective object) and more on corpus (as an attested sample of language). Today, the central position of corpus in linguistic research has an important effect on the majority of linguistic studies made by both linguistic experts and postgraduate students.

This rise of corpus-related issues fuels a latent informal debate: the new perspective is often presented either in an exaggeratedly negative light (as a simplistic "fashion" that inhibits theoretical studies), or in a too positive one (as a revolution that makes linguistics more "scientific" and "real").

We would like to go over this reductive conflict and invite all willing postgraduate students and young researchers to examine the range of methods and tools that has emerged with this "new age" of corpus studies. It is our hope to highlight the connection between observation and analysis, attempting to follow the idea of a complementarity of the empirical and theoretical ways, as was already emphasized in his time by Francis Bacon:

Those who have handled sciences have been either men of experiment or men of dogmas. The men of experiment are like the ant, they only collect and use; the reasoners resemble spiders, who make cobwebs out of their own substance. But the bee takes a middle course: it gathers its material from the flowers of the garden and of the field, but transforms and digests it by a power of its own. 

Novum Organum (1620), Book I, 95

The heart of our discussion will be this metaphorical "art of the bee" in working with corpus. From the point of collecting the utterances or texts to the final theoretical interpretation and its applications, "processing" the corpus work does indeed resemble a phase of "digestion" of empirical data.

More precisely, this evolution seems to have an intrinsic link with the development of tools in informatics and computer sciences (text navigation, online corpora, transcription tools, analyzer tools), which have dramatically changed our access to sources and affected the procedures of linguistic study. We assume that these technological evolutions have had an influence not only  on our field of linguistics but also in an interdisciplinary way in other social sciences l. It seems that in these fields, a similar trend of "experimental" and "data processing" approaches has soared over the last period. The development of internet and computers has introduced a whole range of possibilities in  corpus exploration. Part of the linguistic community is working on corpora as such, providing an always more detailed analysis, whereas others investigate the development of instruments through NLP. In both cases, the central problem is how to pool the findings. The situation is rather complex because of the great variety of approaches which depend on topics and orientations chosen and on tendencies to accompany them (constitution of "big" corpora, annotation workshop).

Following the COLDOC tradition of tackling methodological issues or broader problems of the linguistic field, we are calling for papers inquiring the topic of examining linguistic corpus, from its conception to its results. The issues at hand include the following topics:

  • perspectives on texts and utterances in different domains of linguistics,

  • levels of linguistic analysis and nature of corpus:

    • oral corpus in phonology, syntax, prosody, speech development problems, etc.

    • textual corpus in lexicometry, discourse analysis, syntax, communication studies,

    • multimodal corpus in acquisition, etc.

  • corpus-design: closed vs. open corpus, representativeness, size of corpus,

  • corpus transcription, alignment, structuration and organisation,

  • problem definition (linguistic phenomena and procedures),

  • annotations and their processing, (counts or measures, and their accuracy)

  • choice of input for the study: occurrences, constructions, categories, context, etc.

  • presenting results: statistics tables, graphics, schemes, typology, etc.

  • interpretation of results (regarding the hypothesis),

  • extractions, formal models, automatic learning,

  • pooling of corpus analyses and results:

    • exploring existing bases (available corpora),

    • beyond publication, towards sharing data and results.

We are pleased to invite postgraduate students and young researchers to present their thoughts on one or several of these topics, originating from their own practical research, regardless of the stage of their studies.


Bernard COMBETTES (ATILF/CNRS, Université de Lorraine)

Anne CONDAMINES (CLLE-ERSS/CNRS, Université Toulouse Le Mirail)


Keynote speakers

Review Committee

Jean-Michel ADAM (Université de Lausanne)

DelphineBATTISTELLI(STIH, Université Paris Sorbonne )

Annie BERTIN (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Caroline BOGLIOTTI (MoDyCo/CNRS, Université Paris Ouest Nanterre

Bernard COMBETTES (ATILF/CNRS, Université de Lorraine)

Anne CONDAMINES (CLLE-ERSS/CNRS, Université Toulouse Le Mirail)

Marcel CORI (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Flore COULOUMA (CREA-EA 370, Université Paris Ouest Nanterre)

Guillaume DESAGULIER (MoDyCo/CNRS, Université Paris Ouest Nanterre, Université Paris 8)

Brigitte JUANALS (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Simon KREK (Institut Jozef Stefan, Ljubljana)

Anne LACHERET (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Bernard LAKS (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Denis LE PESANT (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Danielle LEEMAN (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Sabine LEHMANN (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Sarah LEROY (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Sylvain LOISEAU (LDI/CNRS, Université Paris 13 Nord)

Dominique MAINGUENEAU(CEDITEC/EA 3119, Université Paris Est Créteil, IUF)

Philippe MARTIN (UFRL, Paris 7)

Sylvie MELLET (BCL/CNRS, Université Nice Sophia Antipolis)

Jean-Luc MINEL (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Colette NOYAU (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Christophe PARISSE (MoDyCo/CNRS, INSERM, Université Paris Ouest Nanterre La Défense)

Christiane PRENERON (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Sandrine REBOUL-TOURE (SYLED/CEDISCOR, Université Paris III Sorbonne nouvelle)

Fanny RINCK (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Clara ROMERO (MoDyCo/CNRS, Université Paris Ouest Nanterre)

Frédérique SITRI (SYLED, Université Paris III Sorbonne Nouvelle)

Ana ZWITTER VITEZ (Institut de Linguistique Slovène Appliquée Trojina, Ljubljana)

