Robust assessment
Given the very rapid pace of recent developments, careful reflection on standards for the methodology of testing and assessment is lagging behind. What is required is a joint effort to converge on proper standards for robust assessment of language models. Methodology is robust, in the sense intended here, if its results are generalisable (carrying over with sufficient certainty to other models and data sets), transferable (insightful beyond the purposes of understanding a single type of computational model), and reproducible (with the same or different models and data sets). Robust methodology also aspires to be as future-proof as possible, i.e. likely relevant to the next generation of models or the next set of antagonistic examples.
Safe applicability
As language technology gets applied more and more widely, concerns of safe applicability become ever more important. Safe applicability subsumes critical aspects such as being conceptually sound (e.g. anchored in “first principles” or established empirical knowledge), validated (e.g. by mathematical proof or other rigorous derivation) or at least stress-tested across a near-exhaustive traversal of possible conditions of use, ethical (e.g. bias- and harm-free, or privacy-respecting), and also economical (i.e. minimising data requirements and energy consumption). Issues of safe applicability loom particularly large in the context of high-stake implications, of which application in the scientific process is a special case. The Priority Area LaSTing therefore also particularly invites contributions on the reflection of safe applicability of language technology for knowledge gain in the cognitive language sciences.
Foundational questions
Progress on understanding the behaviour of language models and their safe applicability is inexorably tied to a better understanding of their core mechanisms and the impact of their training data or their training objectives. But just as relevant are deep foundational questions concerning the nature of language models (e.g. what are LMs models of?) and their proper role in the scientific research into human language (e.g. how could LMs be used as explanatory tools for understanding human language?). In response to these issues, the Priority Programme especially welcomes foundational work addressing general properties or potential limits of particular classes of language models, e.g. by using mathematical arguments, simulations studies, tight conceptual argumentation or a mixture of such methods.
Examples of more concrete research questions that fit into these three core issues are:
- Behavioural Assessment: What are adequate, robust methods of experimentally assessing the (abstracted, linguistic) capability of an LM based on its input-output behaviour? What is a valid comparison of machine predictions with human behaviour?
- Representations & Mechanisms: Which information is reliably retrievable from LMs’ latent representations (embeddings) for linguistic/explanatory purposes or for understanding the inner workings of LMs? How can we distil the abstract computational processes that generate an LM’s behaviour?
- Training & Optimisation: How can we understand LMs in terms of their optimisation, e.g. in terms of properties of the training data, their internal inductive biases, the training objective etc.? How does that compare with human language learning?
- Task Decomposition Models: What are best practices for using LMs as part of a larger (theoretically informed) composition of the task to be solved (e.g. in agent models, applications such as RAG, or explanatory, neuro-symbolic (cognitive) models)?
- Resource Efficiency: How can we solve problems of data-hunger and computational costs (training and inference), e.g. by taking human-like inductive biases into account, or using more informative, curated data? How can we use synthetic data and machine judgements to solve theoretical issues?
- Alternative Models: How can language science benefit from alternative models beyond text-to-text LMs, e.g. by embracing multi-modality, interaction, dialogue or more cognitively plausible model architectures?
- Ontological Status: Are LMs models or theories of language? What exactly does an LM predict (occurrence frequencies, behaviour of an idealised speaker, aggregated behaviour of a population of speakers …)?
- Explanatory Potential: How can novel language technology be used as or in support of explanations, e.g. of linguistic phenomena, empirical or experimental data in the language sciences?
- LM capabilities: What are the limits of LM capabilities and why? How can we systematically identify them, also for future generations of language modelling/technology?
Examples of work that is outside the scope of this Priority Programme are efforts geared mainly at improving system performance (e.g. based on some benchmark score). Also, projects that merely seek new areas of application with established tools, as long as there is little or no reflection on methods or concepts, or any other bearing on the knowledge-oriented cognitive language sciences. In order to achieve its goals, LaSTing requires broad and deep interdisciplinary collaboration. The Priority Programme therefore implements an extensive suite of individual measures to support diversity, networking and dissemination, and to ensure the success of early career researchers and scholars with backgrounds underrepresented in academic research. Early career researchers are explicitly encouraged to submit their own proposals.