Share

Research Report

Word Predictability in Portuguese: Cloze Norming Study vs. LLMs

Jane Aristia

Université Paris Nanterre image/svg+xml

https://orcid.org/0000-0002-7668-7043

j.aristia@gmail.com


Keywords

Word Probability
Cloze Study
LLM

Abstract

With the rise of large language models (LLM), there has been deemed a possible alternative to human participants in many scientific domains, including linguistic studies, the cloze study. Cloze probability is used to inform researchers as to how predictable a word is within a certain sentential context. It is a common tool in linguistic studies to understand language production and processing. Previous studies (e.g., Jacobs et al., 2022; Lopes Rego et al., 2024) have compared LLM performance with traditional cloze studies and their results are promising. Nonetheless, these studies were done in English. Hence, we would like to know LLM performance in the Portuguese language. Here, we compared results from a traditional cloze study with two LLM, such as: Grevásio (Santos et al., 2024) and Tucano (Corrêa et al., 2024) then performed a correlation analysis to investigate their performance. The results show a moderate and weak correlation between the cloze probability from human participants and the LLMs. These results highlight the gap between human performance and LLM, specifically in cloze probability.

Lay Summary

The use of large language models (LLMs) provides the possibility of obtaining data that may mimic human responses due to their ability in producing human-like responses. In research, we observe a growing number of studies using these methods, as they provide time and financial benefit by allowing researchers to obtain responses faster than collecting data with human participants. This trend is also evident in the linguistic domain; one way LLMs are used is for estimating cloze probability. However, previous studies arguing that LLM responses can serve as an alternative to human responses have mostly been conducted in English. We still know little about whether this finding also applies to Portuguese. Therefore, this study aims to examine whether LLM responses are comparable to those produced by humans. The results show that there is still a gap between human and LLM performance.

Introduction

The growing interest in understanding language prediction is accompanied by the increased need for cloze norming study as it gives us information about the predictability of a word within a sentence. Traditionally, this norming study requires participants to complete a sentence (e.g., “The students are studying hard because tomorrow they will have an ___ “). It measures the predictability of a target word based on the sentential context. The cloze probability is obtained by calculating the proportion of participants’ word responses for each sentence. However, with the advancement of the large language model (LLM), we observe the shift towards LLM as an alternative. Some researchers argued that LLM is comparable to human response and can be used to understand human cognition (Hu et al., 2022, 2024). However, this idea is still debatable, as those who oppose it argue that LLMs use different mechanisms and it is premature to use them to understand human cognition (Katzir, 2023; Leivada et al., 2024).

Despite these debates, researchers still try to use LLM to understand linguistic processing. For instance, it has been used for linguistic tasks such as cloze norming study (Jacobs et al., 2022; Lopes Rego et al., 2024). In a traditional cloze norming study we need to collect human participant data, which takes time and sometimes we also need to provide financial compensation for participants. In short, this traditional method is time-consuming and costly. On the other hand, LLM enables us to obtain data efficiently and faster. LLM allows us to skip this data collection step because the cloze probability is calculated through the tokenization of sub-words of each word in a sentence. Therefore, LLM could be deemed as a promising alternative to the traditional cloze norming study.

To make sure that LLM could provide a good word probability, there have been studies that compare cloze norming studies with LLM (e.g., Jacobs et al., 2022; Lopes Rego et al., 2024) and the results showed that they are indeed comparable. For instance, Lopes Rego et al. (2024), compared the traditional cloze norming study with cloze results from LLMs such as GPT-2 and Llama to investigate the reading models such as OB-1 reader (Snell et al., 2018). Their study showed that LLMs’ results are more accurate in predicting human eye movement towards the anticipated words than the traditional cloze study. These studies seem to suggest that it is indeed promising to use LLM as an alternative to traditional cloze study. Nonetheless, we need to be cautious with the possibility of overfitting from LLM.

Looking at how LLM is comparable to traditional cloze study, we are interested to know if this is also applied in European Portuguese because most of the previous studies were done in English. So, this study aims to see if cloze probability from LLMs is comparable to Portuguese human cloze probability. For this purpose, we used cloze norming datasets from Aristia et al. (submitted) and for the LLMs we used two Portuguese text generation models such as Grevásio (Santos et al., 2024) and Tucano (Corrêa et al., 2024). These were the most recent openly accessible text generation models we could find.

1. Methods

1.1. Cloze norming study

The data were taken from cloze norming studies in Aristia et al. (submitted) wherein they were adapted from the sentence pool in Frade et al (2022). Aristia et al. (in preparation) conducted two cloze studies; the first study was to get the probability of the article in each sentence and the second to get the probability of the noun. In this present study, we only used 117 sentences that were used in their experiment and were obtained from the second cloze study. The average age of the participants is 27.69 years old (age range: 20-50 years old). There were 45 female participants from 125 participants.

1.2. Cloze probability with LLM

For the LLMs, we used Grevásio (Santos et al., 2024) and Tucano (Corrêa et al., 2024). Grevásio is an open-source trained decoder model from the LLaMA family. This model was trained with a supervised fine-tuning method wherein the dataset was labelled. The dataset was from GLUE and SuperGLUE that were machine translated into European Portuguese. The other model that we used here is Tucano, a Transformer-based model that is pre-trained in Portuguese. Nonetheless, here we used the supervised fine-tuning model of Tucano, which is called Tucano-2b4-Instruct. It is trained with several datasets such as the Portuguese version of 1 million GPT4, Orca word math problems (Mitra et al., 2024) in Portuguese, and the Aira dataset (Corrêa, 2024). Nonetheless, Tucano, unlike Grevásio, was trained not only in European Portuguese but also in Brazilian Portuguese.

Further, to obtain the target word’s probability from the LLM, we adapted the code fromGPT-2-for-Psycholinguistic-Applications developed by Samer Nour Eddine which allows us to use LLM for the Portuguese language. The sentences were parsed into sub-word tokens and each word position was marked. The cloze probability of each word was obtained by calculating the conditional probabilities of each word's sub-word token.

Figure 1. Figure 1. Average Cloze probability from the Cloze norming study, Grevásio PT and Tucano BR. The bar reflects the standard error (SE).

2. Results

2.1. Cloze probability

2.1.1. Cloze norming study with participants

The average cloze probability of the sentences was .73 (range= .41 - 1, SD = .17), depicted in Figure 1.

2.1.2. Cloze obtained through LLM

The average target word probability (see Figure 1) from Grevásio was .23 (range = 7.79E-08 - .99, SD = .32), and Tucano was .24 (range = 8.55E-05 - .85, SD = .19).

2.2 Data similarity analysis

Figure 1 illustrates the distribution of the data. To verify the similarity between the distributions of data obtained from human participants and LLM distance analysis using Jensen-Shannon method (Cha, 2007) was performed. To run this analysis the ‘philentropy’ package (Drost, 2018) in R (R Core Team, 2000) was used. The results showed that the LLM data of both Grevásio (p<.001) and Tucano (p<.001) was significantly different from the traditional cloze study using human participants.

2.3. Correlation analysis

To investigate if there are correlations between traditional cloze study and LLMs, we conducted Spearman correlation (Wissler, 1905) in R (R Core Team, 2000). It is a non-parametric test that looks at a monotonic relationship, which is less restrictive than assuming linearity in the data. It also determines the strength and direction of this observed correlation. Through this analysis, we found that Tucano performed better than Grevásio when their results were compared with the traditional cloze study. We observed significant and weak positive correlation (See Figure 2) between the word probability from traditional cloze study with Grevásio, r(115) = .25, p =.007, 95%CI [.10, .45]; and, moderate significant correlation with Tucano, r(115) = .36, p < .001, 95%CI [.19, .51]. These confidence intervals (CI) were obtained through the bootstrapping technique, which is a method that can estimate the uncertainty in the data through random resampling, without assuming normal distribution in the data. We can see in Figure 3 that the correlation values for both Grevásio and Tucano fall within the CI range. Although their values were overlapping, Tucano performed slightly better than Grevásio as the lowerband of the CI was higher for Tucano.

Figure 2. Figure 2. Plots depicting correlation between word probability from cloze study. a.) Correlation between participants’ data and Grevásio; b.) Correlation between participants’ data and Tucano.

Figure 3. Figure 3. Bootstrapping distribution of Spearman correlation analysis. The dashed line depicts the 95% confidence interval (CI) from each model, blue for Grevásio and red for Tucano, while the solid line depicts the mean correlation of each model. For Grevásio, the CI is between .10 and .45. For Grevásio, the CI for Spearman is between .19 and .51.

3. Discussion

The purpose of this study is to see if LLMs can be used as an alternative to traditional Cloze studies. Visually, in Figure 1, we can see that there are huge differences between the distribution of cloze probability obtained by human participants versus those using LLMs. This observation is confirmed by the similarity analysis wherein the cloze probabilities obtained through LLM are significantly different from the human participants data. Despite that, the statistical results show there are significant correlations between them. Nonetheless, it needs to be noted that they are not strongly correlated as seen in Figure 2. This indicates that for the Portuguese language, the LLMs are not yet ready to be used as an alternative to traditional cloze study with human participants.

Differences in the mean probability between the cloze probability from human participants and LLM, as seen in Figure 1, and weak to moderate correlations in Figure 2, are in line with the findings of Jacobs et al. (2024). They conducted four experiments comparing the cloze study from Peelle et al. (2020) with three LLMs: GPT-2 (Radford et al., 2019), RoBERTa (Liu et al., 2019), and Pythia (Biderman et al., 2023). In the first experiment, they compared probabilities from the cloze study with the model probabilities and they found no linearity in the correlation. In the second experiment, they used the rank of probable responses from both the cloze study and LLM, and again the results showed weak correlation between them. In the third experiment, they aimed to assess if the model training affects the fitting to the human data. In the fourth experiment, they conducted a clustering analysis to evaluate the semantic production of humans and LLM. In short, their first two studies focused on showing differences between human cloze probability and LLM.

LLM calculates the probability between words in a sentence differently, that is why we could observe differences between data from the human participant data and LLM in the present study. Günther and Cassani (2025) argued that LLM uses probability between tokens to predict the upcoming words. Therefore, in a sentence completion task like a cloze study, LLM tends to use a frequent word that may be less grammatical, or a less probable word that may be more grammatical (Katzir, 2023). For instance, Katzir (2023) showed that to continue a sentence such as: “The little duck that met the horses with the blue spots yesterday __”, the GPT model preferred to use ‘are’ rather than ‘destroys’ as a continuation of this sentence. On the other hand, human participants are able to make grammatical sentences with a less probable word because they can make use of the sentential context. In the same vein, Cai et al. (2023) study also showed that LLMs do not really rely on context to resolve syntactic ambiguity in a sentence as it relies more on sub-word tokens. Another possible reason for these weak correlations is perhaps due to the way the LLMs are fine-tuned, as they tend to perform well on the task related to their fine-tuning focus (Denning et al., 2025).

Nevertheless, the weak to moderate correlation between human participants' cloze probability and LLMs could also be interpreted to indicate that there is a possibility to use them as complementary data with human participants instead of an alternative. Caution not to use LLM as an alternative also comes from a recent Event-Related-Potentials (ERP) study (Arkhipova et al., 2025). ERP is a common method in psycho/neurolinguistics studies that investigate linguistic information that is used by the brain during language comprehension.

Apart from that, in the future, it is not impossible to use LLM as an alternative to human data. To achieve this, improvements are needed. For instance, additional data training and fine-tuning in Portuguese text is needed, specifically with cloze task. Also, Portuguese variance need to be taken into account, for instance there are differences in the word choice between Brazilian Portuguese and European Portuguese. Thus, taking this into account will increase the accuracy of LLM.

4. Conclusion

All in all, this study is an early attempt to compare traditional cloze study using human participants with Portuguese LLM. Portuguese LLM has not yet achieved the same level of performance in traditional cloze tests as in English. Nevertheless, it did show a similar trend as seen in the correlation results, wherein we observed there is an increase of LLM probability for words with higher cloze probability and particularly more stable and promising results from Tucano than Grevásio. Hence, it is favorable to recommend LLM as complementary data with human participants’ data rather than as a replacement and it needs to be kept in mind that LLM computes words’ probability differently than human. To enhance human-like performance of LLM, further studies and improvements are required.

Acknowledgments

The cloze data from this study was part of an EEG study (Aristia et al., submitted) that was conducted as part of the author postdoctoral fellowship (funded by CICPSI, Faculde de Psicologia, Universidade de Lisboa) at Prof AP’s lab (VoicES lab, Universidade de Lisboa) in collaboration with SF (CIS-Iscte). The author also thanked SNE for his advice on the comparison between human cloze and LLM; PZ who proofread the manuscript and check the grammar of this manuscript; A who proofread and check the Portuguese grammat for the resume; and, the reviewers for their comments to improve this manuscript.

Additional Information

Conflict of Interest

The author declares that there are no competing interests.

Statement of Data Availability

The data, codes, and materials that support this study are available in https://osf.io/f8jrv/ .

AI Usage Statement

The author declares that ChatGPT was used for information search, grammar checking and sentence rephrasing to make it clearer under the supervision of the author. It was also used to assist the coding part.

Ethics and Consent (se aplicável)

Data from cloze study that involved human participants was part of an EEG study (Aristia et al., submitted) that has been approved by the ethic committee of Faculdade Psicologia, Universidade de Lisboa.

References

Aristia, J., Frade, S., & Pinheiro, A. (submitted). Prediction by production in spoken sentence processing.

Arkhipova, Y., Lopopolo, A., Vasishth, S., & Rabovsky, M. (2025). When Meaning Matters Most: Rethinking Cloze Probability in N400 Research. bioRxiv, 2025-04.

Cai, Z. G., Duan, X., Haslett, D. A., Wang, S., & Pickering, M. J. (2023). Do large language models resemble humans in language use?. arXiv preprint arXiv:2303.08014.

Cha, S. H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. City, 1(2), 1.

Corrêa, N. K. (2024). Dynamic normativity: Necessary and sufficient conditions for value alignment. arXiv preprint arXiv:2406.11039.

Corrêa, N. K., Sen, A., Falk, S., & Fatimah, S. (2024). Tucano: Advancing Neural Text Generation for Portuguese. arXiv preprint arXiv:2411.07854.

Denning, J. M., Snefjella, B., & Blank, I. A. (2025). Do Large Language Models know who did what to whom?. arXiv preprint arXiv:2504.16884.

Drost, H. G. (2018). Philentropy: information theory and distance quantification with R. Journal of Open Source Software, 3(26), 765.

Frade, S., Pinheiro, A. P., Santi, A., & Raposo, A. (2022). Is second best good enough? An EEG study on the effects of word expectancy in sentence comprehension. Language, Cognition and Neuroscience, 37(2), 209-223.

Günther, F., Cassani, G., & Günther, F. (2025). Large Language Models in psycholinguistic studies.

Hu, J., Floyd, S., Jouravlev, O., Fedorenko, E., & Gibson, E. (2022). A fine-grained comparison of pragmatic language understanding in humans and language models. arXiv preprint arXiv:2212.06801.

Hu, J., Mahowald, K., Lupyan, G., Ivanova, A., & Levy, R. (2024). Language models align with human judgments on key grammatical constructions. Proceedings of the National Academy of Sciences, 121(36), e2400917121.

Jacobs, C. L., Hubbard, R. J., & Federmeier, K. D. (2022, February). Masked language models directly encode linguistic uncertainty. In Proceedings of the Society for Computation in Linguistics 2022 (pp. 225-228).

Jacobs, C. L., Grobol, L., & Tsang, A. (2024). Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned. arXiv preprint arXiv:2410.12057.

Katzir, R. (2023). Why large language models are poor theories of human linguistic cognition: A reply to Piantadosi. Biolinguistics, 17, 1-12.

Leivada, E., Dentella, V., & Günther, F. (2024). Evaluating the language abilities of Large Language Models vs. humans: Three caveats. Biolinguistics, 18, 1-12.

Lopes Rego, A. T., Snell, J., & Meeter, M. (2024). Language models outperform cloze predictability in a cognitive model of reading. PLOS Computational Biology, 20(9), e1012117.

Mitra, A., Khanpour, H., Rosset, C., & Awadallah, A. (2024). Orca-math: Unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830.

Peelle, J. E., Miller, R. L., Rogers, C. S., Spehar, B., Sommers, M. S., & Van Engen, K. J. (2020). Completion norms for 3085 English sentence contexts. Behavior Research Methods, 52, 1795-1799.

R Core Team. (2000). R language definition. Vienna, Austria: R foundation for statistical computing, 3(1), 116.

Santos, R., Silva, J., Gomes, L., Rodrigues, J., & Branco, A. (2024). Advancing Generative AI for Portuguese with Open Decoder Gervásio PT. arXiv preprint arXiv:2402.18766.

Snell, J., van Leipsig, S., Grainger, J., & Meeter, M. (2018). OB1-reader: A model of word recognition and eye movements in text reading. Psychological review, 125(6), 969.

Wissler, C. (1905). The Spearman correlation formula. Science, 22(558), 309-311.

Review

DOI: https://doi.org/10.25189/2675-4916.2027.V7.N2.ID865.R

Editorial Decision

EDITOR: Sandro Marcío Drumond Alves Marengo

ORCID: https://orcid.org/0000-0003-4658-004X

AFFILIATION: Universidade Federal de Sergipe, Sergipe, Brasil.

-

ASSESSMENT: O artigo “Word predictability in Portuguese: Cloze norming study vs. LLMs” investiga em que medida modelos de linguagem de grande porte podem aproximar-se do comportamento humano na previsão de palavras em contexto. O foco recai sobre o português europeu, língua que dispõe de menos recursos computacionais que o inglês, no qual a maior parte dos estudos comparativos vem sendo desenvolvida. O ponto de partida é o teste de cloze, metodologia consolidada na psicolinguística para medir previsibilidade lexical a partir das respostas de participantes humanos. A construção dessas normas demanda tempo, coleta extensa e validação cuidadosa, o que motiva a busca por alternativas mais ágeis. Nesse cenário, modelos de linguagem surgem como possibilidade promissora, já que produzem probabilidades de ocorrência de palavras em larga escala. Estudos em inglês sugerem que essas probabilidades podem aproximar-se de padrões humanos e até antecipar tendências de leitura. Entretanto, permanecia a dúvida sobre a validade dessa correspondência em outras línguas, sobretudo em contextos com estruturas gramaticais distintas e modelos menos amadurecidos.

Para enfrentar essa questão, o estudo compara as probabilidades obtidas em testes de cloze com 125 falantes nativos de português europeu às probabilidades geradas por dois modelos treinados para o português: Grevásio e Tucano. Os dois modelos permitem observar efeitos de trajetórias distintas de treinamento. O Grevásio deriva de uma arquitetura LLaMA ajustada a partir de dados traduzidos para português, enquanto o Tucano foi treinado desde o início em português e depois ajustado para seguir instruções. Essa diferença possibilita avaliar em que medida o tipo de treinamento influencia a aproximação com respostas humanas.

A metodologia combina duas análises. A primeira utiliza a divergência de Jensen-Shannon para verificar o grau de semelhança entre as distribuições de probabilidade de humanos e modelos. A segunda aplica a correlação de Spearman para analisar se a ordenação das palavras por previsibilidade é semelhante entre as duas fontes. Os resultados revelam um quadro complexo, mas informativo. Na análise de divergência, as distribuições não se mostram equivalentes. A média de probabilidade humana foi alta (0,73), indicando forte convergência de respostas, enquanto os modelos apresentaram médias muito menores (0,23 e 0,24), com grande variação. Isso sugere que os processos envolvidos não são os mesmos: humanos parecem mobilizar restrições semânticas e pragmáticas mais integradas, ao passo que os modelos operam em um espaço lexical mais amplo, com menos restrição contextual.

Por outro lado, a análise de correlação mostra alguma aproximação. As correlações foram positivas e significativas, embora em níveis diferentes: mais baixa no Grevásio (r = 0,25) e mais consistente no Tucano (r = 0,36). Isso indica que, mesmo com discrepâncias em valores absolutos, existe uma tendência comum: contextos que levam humanos a prever determinadas palavras com maior frequência tendem também a receber probabilidades mais altas dos modelos. O desempenho superior do Tucano sugere que o treinamento integral na língua-alvo contribui para maior alinhamento com padrões humanos.

Para a psicolinguística, os resultados mostram que, no momento, os modelos de linguagem para português não substituem normas humanas de cloze, pois as distribuições diferem de forma relevante. Ainda assim, a existência de correlação relativa aponta um uso complementar possível: modelos podem auxiliar na etapa exploratória de criação de hipóteses ou seleção inicial de estímulos, desde que validados com participantes reais.

Para a linguística computacional, os achados funcionam como diagnóstico e orientação. Eles mostram que resultados obtidos em inglês não se transferem automaticamente para outras línguas e reforçam a importância do treinamento específico na língua estudada. Indicam também caminhos futuros, como o uso de dados normativos humanos no ajuste dos modelos e a consideração de variações internas do português.

Em conjunto, o artigo oferece uma avaliação baseada em evidências sobre as potencialidades e os limites dos modelos de linguagem no português europeu, contribuindo tanto para decisões metodológicas em psicolinguística quanto para o desenvolvimento responsável dessas tecnologias. Sendo assim, sou favorável à publicação.

Rounds of Review

REVIEWER 1: Túlio Sousa de Gois

ORCID: https://orcid.org/0009-0000-5270-8033

AFFILIATION: Universidade Federal de Sergipe, Sergipe, Brasil.

-

REVIEWER 2: Daniela Cid de Garcia

ORCID: https://orcid.org/0000-0003-2134-1069

AFFILIATION: Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brasil.

-

REVIEWER 3: Elisângela Nogueira Teixeira

ORCID: https://orcid.org/0000-0003-3924-3985

AFFILIATION: Universidade Federal do Ceará, Ceará, Brasil.

-

ROUND 1

REVIEWER 1

2025-07-23 | 05:07 PM

The paper investigates the comparability between cloze probability generated by human participants and that generated by two LLMs trained on Portuguese data (Grevásio and Tucano). The cloze task is used to measure the predictability of a word in a given sentential context. The correlation analysis indicates that while a statistically significant relationship exists, the models' performance still differs from human performance. The work is relevant to the fields of Psycholinguistics and Computational Linguistics.

This paper assesses the feasibility of using Large Language Models (LLMs) as an alternative to human-based cloze norming studies for Portuguese. The authors compare the results of a traditional cloze test with the probabilities generated by two models, Grevásio and Tucano. The main contribution of the paper is its empirical analysis, which demonstrates, through weak to moderate correlations, the performance gap that still exists between the tested LLMs and human data.

One point that requires the reader's attention is the manuscript's structure. The quantitative results of the research (means, correlations, and p-values) are presented within the Methodology section, which may make it difficult to distinguish between the experimental design and the study's findings. Additionally, the conclusion that LLMs need improvement to reach human-level performance is a broad inference that should be interpreted with caution, given that it is based on the specific results of the models and the task presented here. Despite these structural issues, the study achieves its objective of comparing the probabilities and presents important data for the community.

-

REVIEWER 2

2025-07-14 | 12:07 PM

This is a valuable contribution to the ongoing debate on the role of computational models in linguistic research. Its originality, methodological clarity, and critical discussion make it suitable for publication, pending minor revisions regarding form and depth in a few areas. Accept with minor revisions.

This article investigates the correspondence between traditional cloze tasks with human participants and predictions made by large language models (LLMs) in European Portuguese. The topic is timely and highly relevant to the fields of psycholinguistics, computational linguistics, and cognitive science.

-

ROUND 2

REVIEWER 1

2025-11-10 | 10:14 AM

This paper assesses the comparability between cloze probability generated by human participants and that generated by two LLMs (Grevásio and Tucano) for Portuguese. The author compare the results of a traditional cloze test with the probabilities generated by the models, using correlation analysis to measure the relationship between the different data sources.The manuscript has been significantly revised and improved from the previous version. The paper's structure has been reconfigured, notably through the creation of a dedicated "Results" section, which resolved the prior overlap with the Methodology section and enhanced the clarity of the presentation. Furthermore, the argumentation has been strengthened by the inclusion of new statistical analyses (such as the distribution similarity analysis ) and the addition of relevant recent work, which better situates the findings within the psycholinguistic context.

-

REVIEWER 3

2026-01-18 | 02:25 PM

This study provides the first comparison between human sentence-completion behavior (cloze norming study) and predictions generated by Portuguese language models (LLM). The results are well supported and indicate that such models can serve as a complementary tool, though not a full substitute for humans, which is relevant for researchers working on psycholinguistics, language processing, computational linguistics, and future applications of artificial intelligence.

The manuscript addresses an important and timely question regarding whether probabilistic predictions generated by large language models can approximate or replace human behavior in cloze tasks. This is an original contribution in the context of Portuguese, where few empirical comparisons of this kind have been conducted. The statistical analyses are transparent and clearly presented, and the author adopts a responsible position by framing language models as potential complements rather than full substitutes for human participants. The work thus has the potential to encourage further research at the intersection of language modeling and psycholinguistics.At the same time, the manuscript opens space for a broader discussion about the conceptual and methodological landscape in which such comparisons occur. One central point concerns sociolinguistic and ecological asymmetries between the environments from which data are drawn. Human participants form a relatively homogeneous speech community, sharing linguistic variety, cultural expectations, lexical exposure, and pragmatic conventions. Language models, by contrast, aggregate data from highly heterogeneous corpora spanning multiple geographic regions, genres, registers, and historical periods. Divergences between human and model predictions may thus arise not solely from limitations of current artificial systems but also from the broader ecological diversity present in the corpora that train them.A related point concerns variation within Portuguese itself. The manuscript notes that Tucano was trained on both European and Brazilian Portuguese, and that Grevásio uses automatically translated European Portuguese datasets, but the consequences of these differences are not explored in depth. Variation affects lexical frequency, discourse expectations, morphosyntax, information structure, and the distribution of articles and other functional items—all categories relevant to this study. Since the cloze task here involves nouns and articles, which are particularly sensitive to sociolinguistic variation, it would be informative to consider how such variation may influence model performance.The comparison also raises interesting questions about task equivalence. Human participants generate a continuation, whereas the models estimate the likelihood of a pre-specified target word. These are not computationally or cognitively identical tasks: one involves production, prediction, and intersubjective inference; the other involves recognition and probability estimation. This distinction does not detract from the value of the empirical comparison, but acknowledging it can help readers interpret the observed correlations.Differences in genre and register also deserve mention. Human cloze responses are grounded in expectations emerging from everyday discourse, whereas training corpora for language models tend to feature written, technical, and media-oriented texts with distinct frequency profiles and stylistic constraints. This, too, may contribute to divergence between humans and models.Finally, the manuscript adopts human cloze performance as the implicit “gold standard.” In doing so, it aligns with most work in the field, but a brief discussion of how this standard is defined and what dimension of linguistic behavior it is intended to capture would further clarify the theoretical framing. It would also open a productive avenue for reflection on whether the relevant point of comparison in future work should be human expectation, corpus-based frequency, or some interplay between the two.Overall, the manuscript provides a valuable starting point for these discussions, and its publication may stimulate further debate on how best to integrate language models into experimental and theoretical research on Portuguese and other languages.

Author Reply

DOI: https://doi.org/10.25189/2675-4916.2026.V7.N2.ID865.A

-

ROUND 1

2025-10-10

Dear Reviewers,

Thank you very much for your feedback, I greatly appreciate them as they improve the clarity of the manuscript.

Hence, I tried to adjust the manuscript following your feedback. I have dedicated a separate section for the methods and results. I have also added details about the human data, such as the number of participants and their age range. Regarding LLM, I have added a link to the Tucano model that was used in my study.

In terms of statistical analysis, I have added reference for the methods that are used (e.g., Spearman correlation). I have also added statistical analysis such as data similarity to analyze the similarity between human data and LLMs data.

In terms of language and styling, I tried to rewording and rephrasing the sentences to improve the clarity of the text. Another change that I made was related to my affiliation, as I move institutions, I change my affiliation to the current one. Should you have more comments or require more information, please do not hesitate to contact me.

How to Cite

ARISTIA, J. Word Predictability in Portuguese: Cloze Norming Study vs. LLMs. Cadernos de Linguística, Campinas, SP, Brasil, v. 7, n. 2, p. e865, 2026. DOI: 10.25189/2675-4916.2026.v7.n2.id865. Disponível em: https://cadernos.abralin.org/index.php/cadernos/article/view/865. Acesso em: 25 mar. 2026.

Statistics

Copyright

© All Rights Reserved to the Authors

Cadernos de Linguística supports the Opens Science movement

Collaborate with the journal.

Submit your paper