READ ME file

Title of dataset: "Wannabe Approximatives: datasets"
Authors: Muriel Norde (Humboldt-Universität zu Berlin), Francesca Masini (Università di Bologna); Kristel Van Goethem (Université catholique de Louvain); Daniel Ebner (Humboldt-Universität zu Berlin)
Contact email: francesca.masini@unibo.it
License: Creative Commons: Attribution 4.0 (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/


------------------------
Abstract

This item contains 6 datasets including annotated concordances of "wannabe" collocations in 6 different languages: Danish, Dutch, English, Finnish, French, Italian. The concordances come from TenTen corpora on SketchEngine (https://www.sketchengine.eu/), specifically: daTenTen20 for Danish; nlTenTen20 for Dutch; enTenTen20 for English; fiTenTen14 for Finnish; frTenTen20 for French; itTenTen20 for Italian. The datasets are structured in the same way, allowing for cross-linguistic comparability. The datasets constitute the underlying data of the study by Norde et al. (2025), mentioned in the References.

------------------------
Content

The dataset includes the following files:

• README_wannabe.txt (this file)
• wannabe_Danish_datenten20_repository.csv
• wannabe_Danish_datenten20_repository.xlsx
• wannabe_Dutch_nltenten20_repository.csv
• wannabe_Dutch_nltenten20_repository.xlsx
• wannabe_English_ententen20_repository.csv
• wannabe_English_ententen20_repository.xlsx
• wannabe_Finnish_fitenten14_repository.csv
• wannabe_Finnish_fitenten14_repository.xlsx
• wannabe_French_frtenten2020_repository.csv
• wannabe_French_frtenten2020_repository.xlsx
• wannabe_Italian_ittenten20_repository.csv
• wannabe_Italian_ittenten20_repository.xlsx

------------------------
Details

Each file per language (provided in csv/Excel format) contains 500 concordance lines with examples sampled from TenTen corpora. Each concordance is annotated according to a variety of parameters that are fully described in Norde et al. (2025).

The Excel/csv file contains the following information:

• "Reference" = the (partial) url provided by SketchEngine
• "Left" = left context of Kwic
• "Kwic" = Kwic corresponding to the wannabe construction
• "Right" = right context of Kwic
• "Cxn" = the type of construction; possible values: "collocation", "derivation", "embedded collocation", "predicative", "wannabe_ADJ", "wannabe_ADV", "wannabe_N", "wannabe_V"
• "Bonding" = the type of bonding; possible values: "//" (NA), "bound", "free", "hyphen"
• "Head" = the head of the wannabe construction (if wannabe is not the head)
• "PoShead" = the part of speech of the head; possible values: "//" (NA), "ADJ", "N", "N-prop", "NP", "PRO"
• "Order" = the position of wannabe w.r.t. the head; possible values: "//" (NA), "wannabe-X", "X-wannabe", "X-wannabe-X"
• "Inflection" = the presence/type of inflection; possible values: "no", "yes_EN", "yes_native"
• "Semcat" = the semantic/ontological class of the head; possible values: "//" (NA), "animate", "human", "inanimate"

CSV files use the Western Europe (Windows-1252) character set. The field separator is semicolon (;) and the string delimiter is double quote (").

------------------------
References

Muriel Norde, Francesca Masini, Kristel Van Goethem & Daniel Ebner. 2025. Wannabe Approximatives. Creativity, Routinization or Both? In Sabine Arndt-Lappe & Natalia Filatkina (eds.), Dynamics at the Lexicon-Syntax Interface. Creativity and Routine in Word-Formation and Multi-Word Expressions. De Gruyter Mouton.