Share

Experience Report

Building Carolina: Metadata for Provenance and Typology in a Corpus of Contemporary Brazilian Portuguese

Marcelo Finger

University of São Paulo image/svg+xml

https://orcid.org/0000-0002-1391-1175

Maria Clara Paixão de Sousa

University of São Paulo image/svg+xml

https://orcid.org/0000-0002-8422-417X

Cristiane Namiuti

State University of Southwestern Bahia image/svg+xml

https://orcid.org/0000-0002-1451-8391

Vanessa Martins do Monte

University of São Paulo image/svg+xml

https://orcid.org/0000-0002-4929-5298

Aline Silva Costa

Federal Institute of Education, Science and Technology of Bahia image/svg+xml

https://orcid.org/0000-0003-1434-3242

Felipe Ribas Serras

University of São Paulo image/svg+xml

https://orcid.org/0000-0003-1683-167X

Mariana Lourenço Sturzeneker

University of São Paulo image/svg+xml

https://orcid.org/0000-0002-0878-3463

Miguel de Mello Carpi

University of São Paulo image/svg+xml

https://orcid.org/0009-0007-5877-6716

Mayara Feliciano Palma

University of São Paulo image/svg+xml

https://orcid.org/0000-0002-2869-4484

Gabriela Alves Lachi

University of São Paulo image/svg+xml

https://orcid.org/0009-0005-3020-4543


Keywords

Brazilian Portuguese
Open corpus
WaC
Typology
Provenance
WaC-wiPT

Abstract

This paper presents the challenges of building Carolina, a large open corpus of Brazilian Portuguese texts developed since 2020 using the Web as Corpus methodology enhanced with concerns about provenance and typology (WaC-wiPT). The corpus aims to serve both as a reliable source for research in Linguistics and as an important resource for Computer Science research on language models. Above all, this endeavor aims at removing Portuguese from the set of “low-resource languages”. This paper details the construction methodology of Carolina, with special attention to the issue of describing provenance and typology according to international standards, while briefly describing its relationship with other existing corpora, its current state of development, and its future directions.

Lay Summary

The Carolina corpus is a large collection of texts written in Brazilian Portuguese between 1970 and 2021. It has been under construction since 2020 and aims to serve both as a reliable source for research in Linguistics and as an important resource for Computer Science research on language models. This article presents the challenges of building this corpus, emphasizing the construction methodology, with special attention to the problem of describing the types of texts and their origin.

Introduction

Carolina is an open corpus for Linguistics and Artificial Intelligence, with a robust and unprecedented volume of texts of varied typology in Brazilian Portuguese. The current version, Carolina 1.3 Ada, is comprised of 802 million tokens, 2 million texts and more than 11 GB. All the texts were originally written in Brazilian Portuguese between 1970 and 2024, and are available for free, open access download at https://sites.usp.br/corpuscarolina. Carolina was conceived and is currently being developed by a multidisciplinary research team at the Digital Humanities Virtual Lab (‘Laboratório Virtual de Humanidades Digitais’, LaViHD) as part of the Natural Language Processing of Portuguese (NLP2) Project of the Center for Artificial Intelligence (C4AI) of the University of São Paulo (USP).

C4AI-USP endeavors to produce advanced research in Artificial Intelligence in Brazil, disseminate and debate the main results, train students and professionals, and transfer technology to society. The NLP2 Project, one of C4AI’s challenges, seeks to develop systems that advance the state of the art of Natural Language Processing (i.e., NLP) to Brazilian Portuguese, targeting a new level of quality and performance compared to existing solutions. In this process, the Center aims to create opportunities for developing state-of-the-art language models and to distance Portuguese from the group of languages with “low NLP resources”. With this aim, C4AI-USP, via the NLP2 Project, is currently building several Brazilian Portuguese corpora, including CORAA, Corpus of Annotated Audios of spoken Portuguese, and the Portinari annotated corpus of Portuguese. Carolina is C4AI’s “mother ship” corpus and will incorporate CORAA’s audio transcripts, Portinari’s raw unlabelled texts and other corpora in the future.

The corpus team, composed by computer science, linguistics and philology researchers, has worked to develop a methodology for building corpora that can be used in a variety of ways, complying with rigorous data control criteria in terms of its origin/provenance and typology, a fundamental requirement for not only computer science research but also linguistic research, among others. We aim at building a robust resource with state-of-the-art features both for research in the field of Artificial Intelligence and in the field of Linguistics, focusing on the importance of provenance and a rich typology of information as fundamental assets in modern data availability.

Carolina was named in honor of Carolina Michaëlis de Vasconcelos (1851-1925), a German philologist and linguist based in Portugal, and the first woman named a professor at the Faculty of Letters of the University of Lisbon, in 1911[1]. This tribute symbolizes the aims of our team: to advance knowledge on the Portuguese language and its history, and encourage scientific research made by women[2].

The aim of this paper is to present the challenges of building Carolina. The information presented here is very useful for creating future data sets, and it is also an invitation to use the corpus for developing researches in Linguistic area and others. For this purpose, section 1 presents the foundations on which this construction was based, section 2 shows how and why to develop a new methodology to build a giant corpus, highlighting the problems involved in the “Web as Corpus” idea, section 3 presents the current stage of the project, and section 4 concludes the paper with some final considerations and the indication of future steps.

Fundamentals

Since long before the emergence of the digital world, humanity has developed means and techniques to meet the need to organize, locate and retrieve documents and information. Knowledge of a text's source is directly related to the trustworthiness of its content. Thus, the provenance and typology of the documents figure in the range of essential information for the research in the Humanities and data reliability in Computer Science, especially for the construction of large collections of documents that serve to store knowledge in a recoverable, searchable and accessible way.

Linguistics research has largely benefited from digital technology, given that automation in the processing of large volumes of data strongly supports formulating hypotheses about grammars. In addition, the reliability of linguistic studies has been enhanced due to the development of scientific techniques and methods for annotation and data control from the sources of a natural language corpus. Paixão de Sousa (2014) underlies linguistic studies based on electronic corpora in the global approach of the text, in conceptual and technological terms, which is reflected in an interaction of different levels of analysis. Based on this global approach, Carolina has the potential to contribute to the development of research on Brazilian Portuguese, since it is being built aiming at reliability guarantees, assured by the provenance control provided by structured metadata.

According to Santos and Namiuti (2019), a scientific metadata control such as information about provenance and typology of the documents constitute essential information for the research in the Humanities and data reliability in Computer Science; in addition, it serves other areas, such as History and Social Memory. To this end, the authors advocate the need for a structured metadata apparatus (AME - ‘Aparato de metadados estruturados’) as a solution for reliability.

In Computer Science, NLP research has been dominated in the past years by a succession of language models, that is, a set of machine-learning architectures, mostly based on neural networks. Starting from sequence-to-sequence encoder-decoder (Kalchbrenner; Blunsom, 2013) configurations, it incorporated neural attention (Bahdanau et al., 2014), leading to the attention-only Transformer architecture (Vaswani et al., 2017). Different ways of assembling and training transformers have led to a multitude of very successful language models, such as BERT (Devlin et al., 2019) and its derivatives (Liu et al., 2019; Sanh et al., 2019; Lan et al., 2019) for text classification, GPT (Radford et al., 2018) for text generation, and T5 (Raffel et al., 2019) for machine translation. In such a rapidly changing environment, in which today’s best model is condemned to short-term obsolescence, one must put forward a language model training pipeline, to be ready to generate “the next” proposed model. And the fundamental raw material for this production pipeline is a large, open and reliable general corpus such as ours.

Carolina was conceived within the Web as Corpus view (Baroni et al., 2009), extended with provenance and typology information, which we call the WaC-wiPT view[1]. The Web as Corpus (WaC) view of corpus building (Fletcher, 2007) has been dominant in recent developments in linguistic resource building, but future applications may require a more cautious approach to data collecting. Data provenance refers to the process of tracing and recording the origins of data and, thus, its movement. It allows one to answer questions such as “where did this piece of text come from?”, and “is it a part or the whole of one document?”. Therefore, if a future application reveals that a corpus may carry some open or hidden biases, provenance is the mechanism that allows us to trace back the origins of the bias. It is also important to know if the data was transferred in total, or in part, and the size of the part. Data provenance provides the audit trail of the data and thus it is a source of reliability on data and on applications derived from it. Additionally, this work understands typology in a broad sense, as free from theoretical commitment as possible, and as a crucial methodological tool in the development of such a large collection of texts, organizing the search, selection and balancing of texts, as will be shown in the Methodology section.

Providing an open, large and diverse corpus for Brazilian Portuguese, with provenance and typology information has the potential of directly impacting research both on Linguistics and Computer Science. This is the intended goal of this work. We hope that provenance and typology information will be helpful to researchers. Control of information regarding digital documents produced or posted on the Web is necessary to meet the potential uses of a large collection of documents. This control also makes it possible to cater for a very wide range of research areas of interest, such as Social Memory and History, as well as Linguistics and Computing.

Related Works

Given the widespread availability of online content in the last decades, many researchers turned to the Web as their main source for corpus building. Examples of corpora that were built using the Web as a source are the Terabyte corpus (Clarke et al., 2002) (53 billion tokens) and the Web Text Corpus (Liu; Curran, 2006) (10 billion tokens), both built using web-crawlers. The Terabyte Corpus targets the English language and is formed by HTML content obtained in mid-2001 from a set of URLs of the main sites of 2,392 universities and other educational organizations. The Web Text Corpus is also an English-language corpus composed of a collection of texts on various subjects. Unlike most corpora created for NLP use, this corpus employs a linguistic search process instead of the traditional use of web search engines, which are based on scores. The general objective of the corpus was to measure the accuracy of NLP-learning software using it in comparison to training using other corpora (Liu; Curran, 2006).

The TenTen Corpus Family is an initiative by Sketch Engine[1] for building Web corpora for all major languages in the world (Jakubíček et al., 2013). The TenTen corpora was also created using web-crawling techniques and presenting texts validated under exclusively linguistic criteria, using specialized technology for this. The very name of the corpora (TenTen, 1010) indicates the large size of each corpus that composes it, starting from the minimum size of 10 billion words for each language[2].

In this context, several large corpora were built adopting the WaCky (Web-As-Corpus Kool Yinitiative) methodology, following the emergence of the first WaCky corpora: the ukWaC, deWaC, itWaC (Bernardini et al., 2006; Baroni et al., 2009), and the frWaC (Ferraresi et al., 2010), which targeted English, German, Italian, and French respectively and include more than 1 billion words each. This methodology comprises four steps which are: identification of different sets of seed URLs, post-crawl cleaning, removal of duplicate content, and annotation. One of the corpora built following this framework is the Brazilian Portuguese Web as Corpus (brWaC), of great relevance as it was already considered the “biggest Brazilian Portuguese corpus available” during its construction (Boos et al., 2014)[1].

The use of the web as a data source for corpus construction, made possible by the WAC methodology, called for a progressive redefinition of the traditional concept of a corpus — see the differences in corpus definitions between Sardinha (2000) and Kilgarriff and Grefenstette (2003) — while providing linguistic research with click-through access to large-scale collections of real-world language samples. This, in turn, facilitated the detection of highly uncommon linguistic patterns at levels of representativeness previously unattainable (Kilgarriff; Grefenstette, 2003).

These advances, however, have not been without challenges. Critics of the approach highlight the high incidence of typographical and extraction errors, the differences in distribution between web-based language and that used in other contexts, and the challenges of defining and capturing the typological distribution of virtual language (Kilgarriff; Grefenstette, 2003).

It is also worth noting that, in its early days, the web was perceived as an easily accessible copyright-free repository of language, in contrast to the professionally edited sources previously available for corpus construction (Kilgarriff; Grefenstette, 2003). At the present time, however, debates over the right to build datasets from web content, particularly at a time when such resources are used to train large proprietary language models, are at the heart of ethical and legal controversies. Carolina, with its collection policy grounded on license compliance and provenance, has been developed with the explicit aim of addressing this issue.

Regarding other existing Portuguese-language corpora, the virtual organization Linguateca (Santos, 2000) stands out as a center for resources focused on the computational processing of this language. Its objective was to contribute to the development of new computational and linguistic resources, facilitating the access of new researchers to existing tools. Of the corpora available at Linguateca that specifically target Brazilian Portuguese, the ones that stand out as the most significant in size are: Brazilian Corpus (Sardinha et al., 2010), Lácio-Web (Aluísio et al., 2003), and Corpus do Português, subcorpora NOW and Web/Dialects (Davies; Ferreira, 2016; Davies; Ferreira, 2018).

The Brazilian Corpus has approximately one billion words, syntactically annotated with the parser PALAVRAS (Bick, 2000).[1] Lácio-Web was developed by USP as a project whose objective is to make fully and freely available its linguistic and computational tools as well as several corpora of contemporary Brazilian Portuguese. This set of corpora prioritizes whole-content texts and a variety of genres, text typologies, domains of knowledge, and means of distribution (Pinheiro; Aluísio, 2003; Aluísio et al., 2004).[2] Corpus do Português was developed by Brigham Young University and Georgetown University. The NOW (News on the Web) subcorpus has approximately 1.1 billion words of four different Portuguese varieties (Angola, Brazil, Mozambique, and Portugal), gathered from daily searches of articles in magazines and newspapers through Google News between 2012 and 2019. It is not possible to easily retrieve the source and copyright information of the texts, nor to know how much of the data refers specifically to Brazilian Portuguese. The subcorpus Web/Dialects, in turn, has approximately one billion words of the same four Portuguese varieties, of which 656 million words are in Brazilian Portuguese, mainly extracted from Blog-type sites (Davies; Ferreira, 2016)[3].

1. Building a Methodology

The construction of a billion-token corpus requires a considerable amount of preparation and coordination. First of all, we had to define the metadata scheme, which was adjusted after the initial surveys and tests. The goals of the corpus must remain clear at all times, and a mechanism for tracing sources, completion levels and data balancing must be followed diligently. Such an endeavor required three important stages: a detailed analysis of existing resources, the development of a methodological framework to adhere to our goals, and the developing of techniques for post-processing. The main methodological decisions in this process are described in 2.1 and 2.2 below, with special attention to aspects related to Provenance and Typology, and the processing stages are presented briefly in 2.3 and 2.4 further on.

1. 1. The issue of Provenance

Initially, significant effort was made to analyze the pre-existing resources for natural language processing in Brazilian Portuguese, with the aim of supporting the development of our methodology and exploring the possibility of incorporating some of these resources into Carolina. That enabled us to assess the benefits and drawbacks of their methodologies, as well as which niches of contemporary text were already corpus-indexed, and which were still fertile sources for us. In doing so, we decided on a web-based corpus, but against the adoption of the WaCky framework.

Despite the WaCky method claiming the facilitation of an automatic balancing of content without bias, and the brWac achieving the reduction of limited-relevance and duplicate content in comparison to other WaCKy corpora (Wagner et al., 2018), the methodology presents some drawbacks. As their creators acknowledge, automated methods offer limited control over the content included in the final corpus, thus necessitating post-hoc investigation (Baroni et al., 2009). For example, the brWac researchers only provide the annotated categories of the 100 most frequent websites (Wagner et al., 2018), and unlike the other WaCky corpus mentioned in the previous section, the list of bigrams, seeds and total URLs used for the brWaC construction are not easily accessible.

These challenges for content quality and provenance tracking, as well as on rights of use, are central issues in Carolina’s goals, and our methodology was developed around avoiding such problems. In line with Paixão de Sousa (2014) and Santos and Namiuti (2019), a scientific control of the memory processes of building a corpus, the memory of texts, and the definition of the set of essential metadata to control provenance and guarantee the documents’ reliability figure in the range of essential information for the research in Humanities and in Computer Science. Furthermore, as we intend to openly distribute the contents of the corpus online, under terms akin to Apache license and similarly permissive ones, assuring data provenance beforehand is crucial to determine the original terms of use of content.

For instance, Davies and Ferreira (2018) recognize that the texts used in the Brazilian Corpus may be copyrighted and, for this reason, they rely on the American Fair Use Law, which states that texts under copyrights can be used freely as long as their format is remodeled and that there is no foreseen impact on their potential market by their legal holders (Stim, 2016a; Stim, 2016b). Thus, to avoid copyright-related problems in the Brazilian Corpus, for every 200 words of text, 10 words were replaced by “@”, totaling the exclusion of 5% of the original text. The authors claim that, as the words were removed regardless of context, all words would be affected equally, so the frequency and usage counts would not be affected and the corpus would still be suitable for linguistic studies[1].

However, depending on local legislation and on the purpose of the collected material, corpora might still violate the law, even when publishing only fragmented or highly processed versions of texts that include copyrighted content. In addition, when crawling based on seed URLs and search engines, there is no control over the copyright nature of texts. According to Cardoso (2007), Brazilian law acknowledges copyright establishment at the exact moment an intellectual work is created, without the need for further legal requests or paperwork. Therefore, being published and openly accessible online is no waiver of copyright limitations. For this reason, while building Carolina we avoided collecting random samples from the web to ensure both the openness of the information crawled and compliance with Brazil’s recently enforced personal data protection law, LGPD: “Lei Geral de Proteção de Dados[1].

As for the possibility of incorporating pre-existing corpora to Carolina, there are some obstacles to be considered. Firstly, many corpora listed at Linguateca were discarded from our list for not fitting into our date range or for having content that may have reproduction and distribution restrictions due to possible copyright attributed to the texts or to the corpus itself. Many corpora that took into consideration the copyright limitations chose to work solely with fragments or excerpts of the original texts, choosing greater ease of access to the text at the expense of the completeness of its content, as is the case of Corpus do Português. In the construction of the corpus, we prioritize the use of integral or minimally modified texts, as we understand that a fragmentary nature of the content can be detrimental to inter-phrase or inter-text associations in both linguistic studies and software development for Natural Language Processing, amongst which the most recent alternatives, such as Attention-based Algorithms, require the processing of the text sentences in its completeness (Devlin et al., 2019). Another obstacle for corpus incorporation is that a large part of the datasets and smaller corpora gathered at Linguateca uses the European variety of the Portuguese language, or more than one variety. This constitutes a limitation to their incorporation, considering that Carolina intends to be an open corpus of contemporary Brazilian Portuguese.

Thus, we concluded that it would be more productive not to include the large existing corpora of Brazilian Portuguese in Carolina, but rather use them as theoretical guidance and control parameters for the development of a new methodology based on provenance, typology and free distribution of the texts: the WaC-wiPT.

However, since Carolina is C4AI’s “mother-ship” corpus, we may incorporate additional smaller corpora in the future, expanding upon those already included in our broader typology, Datasets and Other Corpora. We are interested mainly in those corpora whose content is unique or not easily independently recoverable, such as corpora of transcribed spontaneous speech, like the corpora developed by Project TaRSila[1], at C4AI, already described in Santos et al. (2022). We believe that these unique-content corpora will be important sources to guarantee a greater representation of dialectal and typological varieties to the corpus.

1. 2. Defining typologies

Having determined the objectives and philosophy for the construction of Carolina, we aimed to build it with reliability guarantees ensured by provenance control through structured metadata. To achieve this, we focused on conducting surveys by broad typologies, which we defined as a way to group related web domains with similar content. After defining a typology methodology, we started the downloading step, followed by a preprocessing phase and, finally, we proposed the categorization of metadata headers and the metadata scheme.

The surveys started by a broad typology (Figure 1), divided in seven types, as detailed below. The seven broad types first defined were categories that allowed us to group all the domains researched up to that point: judicial branch of government; legislative branch of government; datasets and other corpora; public domain works; wikis; university domains; and journalistic texts. As we chose the sources to be surveyed for the broad typology, we gave priority to those with open data and a large volume of files, since the process of requisition of rights of use would only take place in later stages of the project. Therefore, the sources that had copyright-protected data (for instance, the journalistic texts) were not prioritized in the first moment (Crespo et al., 2022).

Figure 1. Figure 1. Surveys by broad typology

The surveys consist of an in-depth research of each broad type chosen for the construction of the corpus and investigation of the web domains that comprise them. Thus, we surveyed information about the license of the texts and the basic directory structure of the investigated sites, as well as authorship, date, and other information that we deemed relevant for each broad type. All this collected data involved a great importance to the download process and it has been essential for the insertion of the predefined metadata and their revision. Therefore, the surveys are continuously ongoing, as research must be conducted or supplemented for each new web domain we wish to incorporate into Carolina.

In addition, given that throughout the surveying process we came across various types – often within a single web domain –, we defined a narrow typology, formed by subdivisions of the broad typology that take into account the structural similarity of the extracted files. Therefore, we created the distinction between broad and narrow typology: the former is an initial web domain grouping by similar content, and the latter, a more detailed label for the types of texts found in each section or file of a surveyed web domain. Narrow typology was also included in most surveys and is a metadata category.

1. 3. The downloading and preprocessing stages

After the initial survey by broad typology, the downloading and preprocessing stage begins to take place. It is important to note that those procedures, which could be called the ‘final’ stage of the corpus construction, rely heavily on our principles of Provenance and Typology. As discussed in section 2.1, text provenance is the baseline criterion for a text to be selected for the corpus; and as shown in section 2.2, the broad typologies of the texts are the guidelines over which the building process begins. The downloading and preprocessing stages described here were the basis for the production of versions Carolina 1.0 Ada (2021), Carolina 1.1 Ada (2022), Carolina 1.2 Ada (2023) and Carolina 1.3 Ada (2024). Carolina 2.0 Bea is being prepared for publication in 2025.

The files were mostly obtained through Wget, the chosen software for this process. As the Raw Corpus[1] (which precedes text preprocessing) aims at safekeeping the entirety of the selected web domains, thus avoiding any future problems in case they are partially or completely removed from the Internet, the mirror command was used. This command crawls entire web pages, with infinite recursivity inside a web domain, creating by default a mirror of its directory structure, complying with our intention of archiving a copy of most of the sources used in the corpus.

With the detailed inspection of each type in the broad typology, the process of downloading the files was facilitated. Accordingly, in some cases, pages whose content was irrelevant or out of the proposed frame were ignored, such as the public domain works[1] published prior to 1970. In those cases, the files were assessed one by one and downloaded with Wget, by means of feeding it a URL list in a TXT file.

The filtering of texts was included in the process of building the corpus version and is based on surveys of each type within the broad typology. Care was taken to exclude anything outside the proposed time span (1970–present) and pages with little or no textual content. Therefore, as these previous inspections enable a closer understanding of the structure of the surveyed websites, the desired sections will be easily tracked and selected for preprocessing among the downloaded files. In the preprocessing stage, we extract the text from the Raw Corpus and, after that, the Metadata insertion process takes place.

That methodology was also relevant when the mirror command did not retrieve all the targeted files of a website. As the initial survey allowed for the learning of which pages or directories were desired for the corpus, other download methods had to be employed in the cases where they were not automatically crawled by the mirror command. This difficulty was present especially in the Brazilian Federal Government public websites, which required alternatives to obtain their content, and many resources were used for that. For different sections of the Brazilian Supreme Court (STF) website[1], for example, we built tools to generate URLs based on file naming patterns, to extract URLs and save pages with an HTML parser, and to access and click links recursively, using the Python[2] library BeautifulSoup and the Selenium WebDriver. In addition to that, a large volume of judiciary documents was kindly provided by Jackson José de Souza, crawled with a tool he developed using Scrapy.[3]

1. 4. Defining Metadata

The stage following preprocessing is metadata insertion. The conception and development of appropriate Metadata categories have been core tasks in building Carolina. The identification of basic metadata for the objectives of the corpus was guided by the classification of information into two broad categories. The first category groups objective information contained in the source document of the text, not having been generated from any type of analysis. Following Santos and Namiuti (2019), we name this category “Dossier”. The second category, which we name “Carolina”, includes processing information and information generated from the analysis of the text contained in the source document. From these two categories, eight information groups were identified: Primary Identification, Authorship, Dating, Location, Size, Acquisition, Licenses and Typology.

Table 1 lists each piece of metadata identified as necessary for the corpus text header. The first column shows the information category (“Dossier” or “Carolina”); the second column identifies the information group within each category; the third column specifies the item of metadata within each group. The last column specifies the cardinality of each metadata, determining the minimum and maximum or the exact number of occurrences of that metadata for each file: the minimum cardinality of “zero” indicates that the item of metadata is optional, while the maximum cardinality of “one”, or “one or more” indicates that it is mandatory.

Figure 2. Table 1. Identification of metadata for Carolina (1.3 - Ada Version)

The texts in Carolina are represented as TEI Documents, encoded in XML in accordance with the specifications “TEI P5: TEI Guidelines for Electronic Text Encoding and Interchange”, developed and maintained by the Text Encoding Initiative Consortium (TEI Consortium, 2024). A single XML file encodes several texts of the corpus, with a hierarchy of elements that can be validated against a schema derived from the TEI Guidelines, aiming to ensure greater interoperability.

Each text included in the corpus contains a <TEI> element, which includes the descendant node <teiHeader>, mandatory in a TEI-conformant document. The metadata items listed in Table 2 are encoded in <teiHeader> element for each text. Figure 2 presents the general hierarchical structure of the XML header structure of each individual text. Carolina’s header of text was structured based on the AnnoTEI Schema, proposed by Costa (2024), which recommends encoding each item of metadata classified in the “Dossier” category within the <sourceDesc> (Source Description) element. Given the importance of the origin of the texts for the Carolina project, even though they are native digital documents, the data referring to the source document or file are encoded in sourceDesc. The <fileDesc> element (File Description) constitutes a mandatory element into the <teiHeader> and is designed for encoding file description. Since the <sourceDesc> element can contain <fileDesc>, the AnnoTEI Schema defines that this element includes a complete bibliographic description of the source document, while the remaining metadata items are inserted into other child elements of <fileDesc> that are not nested in <sourceDesc>, which includes information about the distribution and working with the corpus XML file.

Building on this, the final schema for the texts in the corpus was defined observing the specificity of the project. Carolina’s <teiHeader> also contains two elements defined as optional by TEI Guidelines: <encodingDesc> (Encoding Description) and <profileDesc> (text-profile Description). The <encodingDesc> element includes information about the encoding of the text. Finally, profileDesc contains the text classification according to the typology established by the project team. The decisions of which elements to use and their location were based on the objectives of the corpus, creating a schema in accordance with the “corpus” customization provided by the TEI.

Figure 3. Figure 2. General hierarchical structure of the Carolina XML file

Since TEI P5 is a modular and flexible system, whose infrastructure enables users to create a specified encoding schema appropriate to their needs without compromising data interoperability, a customized schema was defined for the Carolina Corpus. The final schema meets the conformance requirements outlined by the TEI standard, ensuring that documents validated by it are “TEI-Conformant”. Therefore, the customized schema follows the TEI Abstract Model and is generated from an ODD (One Document Does It All) file, as recommended by the guidelines. To achieve greater interoperability, the customization is a subset of the “TEI-All” schema, which makes it “clean modification”, according to the guidelines (TEI Consortium, 2024). This conformance ensures that the metadata encoded in the <teiHeader> can be consistently mapped onto widely adopted standards such as Dublin Core and, when required, onto CMDI, the standard adopted by the European CLARIN infrastructure.

The TEI header was designed to ensure interoperability with bibliographic and linguistic metadata standards. Since the AnnoTEI schema does not represent an extension of TEI, but rather what the guidelines classify as a clean modification, the <teiHeader> used in Carolina is fully compatible with international metadata standards and can be entirely mapped to Dublin Core. Furthermore, by selecting an appropriate profile, the header can also be converted to CMDI, the metadata standard adopted by the European CLARIN infrastructure for language resources. The Carolina corpus is already available in Portulan CLARIN, the Portuguese node of this infrastructure, thereby ensuring its interoperability with international metadata standards.

2. Current State

Since 2022, four versions of Carolina Ada have been published, each one with few updates or corrections in relation to the previous version. Table 2 below shows the schedule and size of each version, more information about all of them you can find on the corpus’ webpage (https://sites.usp.br/corpuscarolina).

Figure 4. Table 2. Published Carolina Versions

The current version of the corpus (Carolina 1.3 Ada), published in October 2024, is organized by the types in the broad typology established up to the present, plus an additional Social Media typology, and it shows the following numbers (Table 3).

Figure 5. Table 3. XML Carolina Corpus (Version 1.3 Ada) in numbers

The information presented in Table 3 concerns the XML corpus, which represents the final stage of Carolina 1.3 Ada version, containing texts extracted from open source web domains, balanced data, and their respective metadata encoded. The texts in the XML version have already been preprocessed, filtered, and deduplicated. Besides, it is valuable to mention that some websites or even entire broad types (such as the journalistic texts), which require the explicit authorization of their copyright owners to be made available (and, therefore, are still in the course of requisition of rights of use), are not being accounted for in the numbers.

3. Conclusion and Future Steps

Carolina has an important distinguishing feature: it is conceived with an original methodology developed by the LaViHD-C4AI team, which we call WaC-wiPT (Web as Corpus with Provenance and Typology information). We consider provenance to be a crucial aspect to strive for in web-based corpora, alongside typology and balance management. Apart from facilitating copyright compliance and typology labeling, it allows one to answer questions about the origin of texts and increases the scope of uses for the corpus.

As shown in our state-of-the-art review (a non-exhaustive list of openly available Brazilian Portuguese corpora and other relevant web-based corpora), many recent corpora were built adopting the WaCky methodology. Because this methodology does not envision provenance as we defend here for Carolina, and because most guidelines from other corpora emphasize only “corpus balance”, for which typology serves just as one criterion, most of these corpora were not incorporated into the corpus; instead, they had an important role in the conception of our own methodology.

Therefore, in line with the provenance proposition, the LaViHD team at C4AI, as part of its NLP2 challenge, has built a large corpus with a robust and unprecedented volume of texts of various typologies in Brazilian Portuguese. In this paper, we presented the current state and the next steps of the corpus construction, defending the importance of provenance and of a detailed typology scheme as fundamental assets in modern data availability. As related products developed during the construction of Carolina, we presented the WaC-wiPT methodology, based on provenance and typology, aiming to make as much data as possible openly available online (in its “beta version”). This also includes the building of metadata to describe provenance and typology, forming the first version of Carolina’s header scheme, using “TEI P5” in accordance with the reuse principle. Additionally, the Raw Corpus was created, currently totaling 1,779 GB and 124,084,164,722 tokens.

As Carolina approaches its fifth anniversary, we hope to be very aware of its limitations as well as its progress. In this regard, one of the main challenges at the current phase is balancing the corpus in terms of typologies; in particular, as we mentioned before, the sources of texts that had copyright-protected data (for instance, the journalistic texts) were not prioritized in this first moment. We are aware that this limitation must be overcome, and this is part of our goals for the next versions. Interestingly, we observe that this problem stems from our principle of guaranteed Provenance; but rather than compromise on this fundamental aspect, we opted to wait some time until we can obtain the correct licenses that would allow us to offer quality, whole texts independent of copyright liability.

Finally, another important challenge in the current phase of the development is the availability of a more user-friendly interface, in particular bearing in mind the users outside the realm of Computer Science. In its current state, Carolina is fully available through the main website, leading to dedicated platforms which allow bulk download[1]; in the near future, we will make it available on a searchable interface which will complement the possibility of downloading the whole corpus.

Acknowledgments

This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. This work was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -- Brasil (CAPES) -- Finance Code 001 and also by the Ministry of Science, Technology and Innovation, with resources of Law N. 8.248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published as Residence in TIC 13, DOU 01245.010222/2022-44. Marcelo Finger received partial support from FAPESP (\#2023/00488-5, \#2022/11254-2) and CNPq 302963/2022-7 (PQ), Cristiane Namiuti received partial support from Bahia Research Foundation (FAPESB 0007/2016, 0014/2016), Maria Clara Paixão de Sousa and Vanessa Martins do Monte received partial support from the São Paulo Research Foundation (FAPESP grant #2021/15133-2), Felipe R. Serras was supported by the IBM Corporation in a grant managed by FUSP under number 3541 and in a PPI-SOFTEX grant managed by FUSP under number 3970, Mariana L. Sturzeneker received support from the São Paulo Research Foundation (FAPESP grant #2024/13270-0) and Gabriela Alves Lachi received support from the Unified Scholarship Program of the University of São Paulo (PUB-USP), project 2024-5789.

We would like to thank the researchers who were involved in the earlier phases of Carolina but are no longer part of the project today: Maria Clara Ramos Morales Crespo, Maria Lina de Souza Jeannine Rocha, Guilherme Lamartine de Mello, Raquel de Paula Guets, Renata Morais Mesquita, Mariana Marques da Silva and Patrícia Brasil. Their contribution was essential to getting us to where we are now.

Additional Information

Conflict of Interest

The authors declare no conflict of interests.

Statement of Data Availability

This research is conducted as an Open Access Project.

Funding Sources

IBM Corporation.

Brazilian Ministry of Science, Technology and Innovation.

São Paulo Research Foundation (FAPESP), grants 2019/07665-4, 2021/15133-2, 2022/11254-2, 2023/00488-5, 2024/13270-0.

University of São Paulo Support Foundation (FUSP)

USP Unified Scholarship Program (PUB)

Coordination for the Improvement of Higher Education Personnel (CAPES), Finance Code 001.

National Council for Scientific and Technological Development (CNPq), 302963/2022-7 (PQ).

Bahia Research Foundation (FAPESB), grants 0007/2016, 0014/2016.

References

Bick, E. (2000). The parsing system palavras: Automatic grammatical analysis of Portuguese in a constraint grammar framework. Aarhus Universitetsforlag.

Boos, R., Prestes, K., Villavicencio, A., Padró, M. (2014). brWaC: A WaCky Corpus for Brazilian Portuguese. In: Baptista J., Mamede N., Candeias S., Paraboni I., Pardo T.A.S., Volpe Nunes, M.G. (eds). Computational Processing of the Portuguese Language. PROPOR 2014, Lecture Notes in Computer Science, vol 8775. Springer, Cham. (pp. 201-206). https://doi.org/10.1007/978-3-319-09761-9_22

Cardoso, J. A. (2007). Direitos Autorais no Trabalho Acadêmico. REVISTA JURÍDICA DA PRESIDÊNCIA, 9(86), 58-86.

Clarke, C. L., Cormack, G. V., Laszlo, M., Lynam, T. R.; Terra, E. L. (2002). The impact of corpus size on question answering performance. In K. Järvelin, M. Beaulieu, R. Baeza-Yates, S. H. Myaeng (Eds.), Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 369-370). Association for Computing Machinery

COSTA, A. S. Um sistema de anotação de múltiplas camadas para corpora históricos da língua portuguesa baseados em manuscritos. Doctoral Dissertation (PhD in Linguistics) – Department of Linguistic and Literary Studies (DELL), State University of Southwestern Bahia (UESB), Vitória da Conquista, Bahia, Brazil, 2024.

Crespo, M. C. R. M.; Rocha, M. L. S. J.; Sturzeneker, M. L.; Serras, F. R.; Mello, G. L.; Costa, A. S.; Palma, M. F.; Mesquita, R. M.; Guets, R. P.; Silva, M. M.; Finger, M.; Paixão de Sousa, M. C.; Namiuti, C.; Martins do Monte, V.. Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information. Manuscript. September, 2022. Preprint: arXiv:2303.16098v1 [cs.CL] 28 Mar 2023

Davies, M., Ferreira, M. (2016). Corpus do Português: 1,1 billion words, Web/Dialetics. Brigham Young University: Provo-UT, 2016. Retrieved May 26, 2021, from https://www.corpusdoportugues.org/web-dial/

Davies, M., Ferreira, M. (2018). Corpus do Português: 1,1 billion words, NOW. Brigham Young University: Provo-UT, 2018. Retrieved May 26, 2021, from https://www.corpusdoportugues.org/now/

de Brito, M. G., Valério, R. G., de Almeida, G. P.; de Oliveira, L. P. (2007). CORPOBRAS PUC-RIO: Desenvolvimento e análise de um corpus representativo do português. PUC-Rio. Retrieved May 26, 2021, from http://www.puc-rio.br/pibic/relatorio_resumo2007/resumos/LET/marcia_gonzaga_de_brito_rubiae_guilherme_valerio_e_gabriel_paladino_de_almeida.pdf

de Oliveira, L. P.; Dias, M. C. P. (2009). Compilação de corpus: representatividade e o CORPOBRAS. Calidoscópio, 7(3), 192-198. Unisino. https://www.doi.org/10.4013/cld.2009.73.03

Devlin J., Chang M.; Lee K.; Toutanova K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran; T. Solorio(Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (pp. 4171–4186). Association for Computational Linguistics. http://dx.doi.org/10.18653/v1/N19-1423

Ferraresi, A.; Bernardini, S.; Picci, G.; Baroni, M. (2010). Web corpora for bilingual lexicography: A pilot study of English/French collocation extraction and translation. In: Using Corpora in Contrastive and Translation Studies. Newcastle: Cambridge Scholars Publishing. (pp. 337-362).

Fletcher, W. H. (2007). Concordancing the Web: promise and problems, tools and techniques. In M. Hundt, N. Nesselhaulf; C. Biewer (Eds.), Corpus Linguistics and the Web (pp. 25-45). Rodopi.

Jakubíček, M.; Kilgarriff, A.; Vojtěch K.; Pavel R.; Vít Suchomel. (2013). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Kalchbrenner, N.; Blunsom, P. (2013). Two recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), (pp. 1700–1709). Association for Computational Linguistics.

Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 29(3), 333-347. https://doi.org/10.1162/089120103322711569

Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.

Liu, V.; Curran, J. R. (2006). Web text corpus for natural language processing. In D. McCarthy; S. Wintner (Eds.), 11th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics

Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Namiuti, C.; Santos, J.V. (2021). "Novos desafios para antigas fontes: a experiência DOViC na nova linguística histórica". In: Humanidades digitais e o mundo lusófono. Organização Ricardo M. Pimenta, Daniel Alves. – Rio de Janeiro: Editora FGV, 2021, págs. 69-89

Paixão de Sousa, M. C. O Corpus Tycho Brahe: contribuições para as humanidades digitais no Brasil. Filologia e Linguística Portuguesa, 16(esp.), 53-93. https://doi.org/10.11606/issn.2176-9419.v16ispep53-93.

Pinheiro, G. M.; Aluísio, S. M. (2003). Córpus Nilc: descrição e análise crítica com vistas ao projeto Lacio-Web. Núcleo Interinstitucional de Lingüística Computacional. Retrieved May 27, 2021, from http://143.107.183.175:22180/lacioweb/publicacoes.htm

Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI. Retrieved June 10, 2021, from https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. from https://www.jmlr.org/papers/v21/

Sales, Joana; Sales, Teresa. Carolina Michaëlis de Vasconcelos (1851-1925). Centro de Documentação e Arquivo Feminista Elina Guimarães, 2025. Disponível em: https://www.cdocfeminista.org/carolina-michaelis-de-vasconcelos-1851-1925/. Acesso em: 25 jan. 2025.

Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

Santos, V.G., Alves, C.A., Carlotto, B.B., Papa Dias, B.A., Stefanel Gris, L.R., Lima Izaias, R.d., Azevedo de Morais, M.L., Marin de Oliveira, P., Sicoli, R., Svartman, F.R.F., Leite, M.Q., Aluísio, S.M. (2022) CORAA NURC-SP Minimal Corpus: a manually annotated corpus of Brazilian Portuguese spontaneous speech . Proc. IberSPEECH 2022, 161-165, doi: 10.21437/IberSPEECH.2022-33

Santos, D. (2000). O projecto Processamento Computacional do Português: Balanço e perspectivas. In M. das Graças (Ed.), V Encontro para o processamento computacional da língua portuguesa escrita e falada (PROPOR 2000) (pp. 105-113). ICMC/USP

Santos, J. V.; Namiuti, C. O futuro das humanidades digitais é o passado. In: CARRILHO, E. et al. Estudos Linguísticos e Filológicos oferecidos a Ivo Castro. Lisboa: Centro de Linguística da Universidade de Lisboa, 2019 (pp. 1381-1404). ISBN 978-989-98666-3-8.

Sardinha, T. B. (2000). Lingüística de Corpus: Histórico e Problemática. DELTA: Documentação de Estudos em Linguística Teórica e Aplicada, 16(2), 323-367. https://doi.org/10.1590/S0102-44502000000200005

Sardinha, T. B.; Filho, J. L. M.; Alambert E.(2010) Manual Córpus Brasileiro. PUCSP, FAPESP. Retrieved May 26, 2021, from https://www.linguateca.pt/Repositorio/manual_cb.pdf

Souza, F.; Nogueira, R.; Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In: Brazilian Conference on Intelligent Systems. Spinger, Cham. (pp. 403-417), from https://link.springer.com/chapter/10.1007%2F978-3-030-61377-8_28.

Stim, R. (2016). Fair Use. Stanford Libraries; NOLO. Retrieved 27 May, 2021, from https://fairuse.stanford.edu/overview/fair-use/

Stim, R. (2016). Measuring Fair Use: The Four Factors. Stanford Libraries; NOLO. Retrieved 27 May, 2021, from https://fairuse.stanford.edu/overview/fair-use/

Sturzeneker, M. L.; Crespo, M. C. R. M.; Rocha, M. L. S. J.; Finger, M.; Paixão de Sousa, M. C.; Martins do Monte, V.; Namiuti, C.. ‘Carolina’s Methodology: building a large corpus with provenance and typology information’. Proceedings of the Second Workshop on Digital Humanities and Natural Language Processing (2nd DHandNLP 2022). CEUR-WS, Vol. 3128, 2022. Available at http://ceur-ws.org/Vol-3128.

TEI Consortium, Burnard, L.; Sperberg-McQueen, C. M. (2024). TEI P5: Guidelines for electronic text encoding and interchange. Version 4.8.0. Last updated on 8th July 2024, revision f9891a87. Retrieved Sep 4, 2024 from https://tei-c.org/Vault/P5/4.8.0/doc/tei-p5-doc/en/Guidelines.pdf

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems. arXiv:1706.03762.

Vianna, A. E. P. B.; de Oliveira, L. P. (2010). CORPOBRAS PUC-Rio: Análise de corpus e a metáfora gramatical. PUC-Rio. Retrieved May 26, 2021, from http://www.puc-rio.br/ensinopesq/ccpg/Pibic/relatorio_resumo2010/relatorios/ctch/let/LET-%20Ana%20Elisa%20Piani%20Besserman%20Vianna.pdf

Wagner Filho, J. A.; Wilkens, R.; Idiart, M.; Villavicencio, A. (2018). The BrWac corpus: A new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). (pp. 4339-4344), from https://www.aclweb.org/anthology/L18-1686.

Review

DOI: https://doi.org/10.25189/2675-4916.2025.V6.N4.ID812.R

Editorial Decision

EDITOR 1: Raquel Meister Ko Freitag

ORCID: https://orcid.org/0000-0002-4972-4320

AFFILIATION: Universidade Federal de Sergipe, Sergipe, Brasil.

EDITOR 2: Juliana Bertucci Barbosa

ORCID: https://orcid.org/0000-0002-1510-633X

AFFILIATION: Universidade Federal do Triângulo Mineiro, Minas Gerais, Brasil.

EDITOR 3: Marcia dos Santos Machado Vieira

ORCID: https://orcid.org/0000-0002-2320-5055

AFFILIATION: Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brasil.

-

ASSESSMENT: O artigo Construindo o Carolina:  Metadados de Proveniência e Tipologia em um  Corpus do Português Brasileiro Contemporâneo aporta ao dossiê Compartilhamento de Dados Linguísticos a experiência de construção e os desafios éticos, metodológicos e técnicos de um grande corpus aberto para o português brasileiro. Destaca-se o foco em metadados de proveniência e tipologia, aliado à transparência metodológica e à adesão a padrões internacionais. A iniciativa fortalece a Ciência Aberta, fomenta a reprodutibilidade e amplia os recursos disponíveis para a comunidade científica, promovendo o português como língua de interesse em pesquisas de larga escala.

Rounds of Review

REVIEWER 1: Júlio Cesar Galdino

ORCID: https://orcid.org/0000-0001-6378-4648

AFFILIATION: Universidade de São Paulo, São Paulo, Brasil.

REVIEWER 2: Sérgio Manuel Serra da Cruz

ORCID: https://orcid.org/0000-0002-0792-8157

AFFILIATION: Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brasil.

REVIEWER 3: Mariana Gonçalves da Costa

ORCID: https://orcid.org/0000-0002-8088-0794

AFFILIATION: Universidade de São Paulo, São Paulo, Brasil.

-

ROUND 1

REVIEWER 1

2025-05-26 | 10:22 PM

O artigo realiza uma descrição da metodologia do corpus CAROLINA, apresentando os passos necessários para a sua construção e as informações sobre providência e tipologia num conjunto do tipo web como corpus. O estudo é útil para profissionais que trabalham com dados de texto, e sua descrição contribui para que pesquisadores possam construir mais recursos para o português brasileiro.

O estudo pretendeu descrever a metodologia empregada no corpus CAROLINA. Para tanto, os autores apresentaram exemplos de outros corpora do mesmo tipo, detalharam as definições estabelecidas quanto à providência e à tipologia dos dados e discorreram sobre o estado atual do referido corpus.

Um dos pontos positivos no artigo é a apresentação de exemplos de outros corpora web, seguido das decisões metodológicas relacionadas à proveniência e à tipologia, bem como as etapas de processamento. Também são abordadas questões relativas aos direitos autorais envolvidos na coleta e no uso dos dados e as decisões quanto ao uso ou não de determinados métodos. A introdução contém pontos essenciais para o desenvolvimento do artigo, como a descrição do que é o corpus, a sua utilidade e a origem do nome CAROLINA. Na seção seguinte, os autores fundamentam o uso das informações relacionadas à procedência e à tipologia dos dados e expõem a constituição de CAROLINA como um corpus extraído da web. O objetivo do artigo foi alcançado, uma vez que os passos necessários para a construção do conjunto de dados foram descritos, bem como as categorias de tipologia e de proveniência foram discutidas. O artigo usa figuras e tabelas que auxiliam o entendimento do texto, com destaque para a Tabela 1, que organiza os metadados por grupo e indica a cardinalidade de cada categoria. O texto também apresenta os próximos passos, as limitações e os problemas de interface do site do corpus. Também existem pontos que poderiam ser aprimorados. Na introdução, por exemplo, as contribuições que o artigo oferece poderiam ser descritas, logo após do objetivo, antes da descrição do que se pretende fazer em cada seção. Incluir as contribuições é importante, pois destaca a relevância do trabalho. Por exemplo, as informações contidas no artigo são muito úteis para que futuros conjuntos de dados sejam criados. Na seção “Fundamentals and related works”, observa-se a ausência de uma reflexão sobre as restrições do uso da web como corpus. Essa discussão seria pertinente, não para invalidar o uso da internet para criar conjuntos de dados, mas para apresentar as críticas que são levantadas neste tipo de abordagem para então ressaltar, com mais propriedade, os benefícios que esse tipo de recurso fornece.

Diante dos aspectos levantados, o artigo é um trabalho importante para a Linguística Computacional e para a Linguística de Corpus: ao detalhar a metodologia de construção, os autores contribuem para a criação de futuros conjuntos de dados, aumentando recursos para pesquisas no português brasileiro e favorecendo a reprodutibilidade de novos corpora.

-

REVIEWER 2

2025-07-12 | 06:10 PM

O manuscrito aborda um tema de extrema relevância para as áreas de Linguística Computacional e Inteligência Artificial: a construção de um corpus robusto e de larga escala para o Português Brasileiro, com especial ênfase na captura e registro de metadados de proveniência e tipologia dos dados. O projeto CAROLINA se destaca pela sua ambição de abordar a escassez de recursos para o Português em NLP, e a metodologia WaC-wiPT (Web-as-Corpus enhanced with concerns about provenance and typology) é um ponto forte inovador, mas ainda requer um descrição mais refinada. A equipe multidisciplinar e a colaboração com o C4AI-USP são indicativos de um projeto bem estruturado e com grande potencial.

Pontos Fortes:Relevância e Originalidade: O trabalho aborda uma lacuna crítica para o Português Brasileiro no cenário de NLP, que é a falta de grandes corpora abertos e bem documentados. O texto indica a importância daos metadados de proveniência e tipologia como critérios de curadoria, mas é superficial nas discussões sobre a captura a repreação da proveniência, explorar com mais detalhes será um diferencial significativo e uma abordagem promissora para garantir a confiabilidade e usabilidade do corpus.Metodologia : A descrição da metodologia WaC-wiPT, incluindo as etapas de levantamento, download e pré-processamento, é clara e bem fundamentada mas pode ser mais detalhada considerando aspecto de reprodutibilidade do processo de cosntrução do corpus. A justificativa para não adotar o framework WaCky (devido a limitações de controle de conteúdo e proveniência) é convincente e demonstra um cuidado rigoroso com a qualidade dos dados, mas nao explicita o tipo de metadados de proveniência (retrospectiva, prospectiva?) são centrais no trabalho. A descrição da proveniência está superficial mas deveria ser mais densa visto que est´a no título do trabalho.Preocupação com Direitos Autorais e LGPD: A discussão aprofundada sobre questões de direitos autorais e a conformidade com a LGPD (Lei Geral de Proteção de Dados) é louvável e reflete uma abordagem ética e legalmente consciente na construção do corpus.Estrutura de Metadados Abrangente mas limitada: A proposta de um esquema de metadados detalhado, dividido em categorias "Dossiê" e "Carolina", e a especificação de cardinalidade para cada item, demonstra um planejamento meticuloso para a documentação do corpus. A utilização do padrão TEI P5 para a codificação dos textos é uma escolha acertada, garantindo interoperabilidade e aderência a padrões internacionais. Mas, essa elaboração nao se mostra alinhada com nenhum outro tipo de padrão de metadados. O fato do corpus ser em portugues não inviabiliza ou mesmo nao requer um esquema que seja circuscrito ao grupo ou ao centro de IA. Pelo contrário, caso uma abordagem baseada em padrões de metadados internacionais fosse adota poderia ser interessante nao apenas em termos de padronizaçao com esforços internacionais, mas servir de referências teórico para outras iniciativas.Abertura e Acessibilidade: A intenção de disponibilizar o corpus em acesso livre e aberto é fundamental para a comunidade científica e para o avanço da pesquisa em NLP para o Português Brasileiro.Equipe Multidisciplinar: A composição da equipe com pesquisadores de ciência da computação, linguística e filologia é um ponto forte, pois permite uma visão holística e garante que tanto os aspectos técnicos quanto os linguísticos sejam adequadamente endereçados.Pontos Fracos e Oportunidades de Melhoria:1) Aprofundamento na Discussão de Proveniência e Padrões de Metadados em Computação: Embora o texto mencione a importância da proveniência e tipologia, e apresente um esquema de metadados, a discussão sobre padrões de metadados e frameworks de proveniência amplamente utilizados na área da Computação poderia ser significativamente aprofundada. O uso do conceito de "tipologia" nao está claro, nao h´a nenhuma referência ou trabalho relacionado. Recomendo rever a aborggem a luz de padrões de metadados;Correção e Sugestão:Na Seção 1.1 "Fundamentals" e, mais especificamente, na Seção 2.1 "The issue of Provenance" e na Seção 2.4 "Defining Metadata", o manuscrito poderia ir além da mera afirmação da importância da proveniência.Apresentar exemplos concretos de modelos e padrões de proveniência de dados utilizados em sistemas computacionais e repositórios de dados. Por exemplo, o PROV (W3C Provenance Ontology) é um padrão fundamental que define um modelo de dados para descrever a proveniência de informações e processos. Discutir como o PROV ou conceitos similares se relacionam com a estrutura de metadados do CAROLINA adicionaria profundidade técnica e demonstraria alinhamento com as melhores práticas da Ciência da Computação.Mencionar outros padrões de metadados relevantes em Ciência da Computação além do TEI P5 (que é mais específico para Humanidades Digitais e codificação de texto). Por exemplo, Dublin Core, schema.org, ou padrões específicos para dados de pesquisa (e.g., DataCite Metadata Schema, ISO 23081 para gestão de registros) poderiam ser brevemente discutidos em relação à sua aplicabilidade ou por que o TEI P5 foi a escolha primária neste contexto. Isso enriqueceria a discussão sobre "padrões internacionais" mencionados no resumo. A descrição do XML está muito simplória, deveria apresentar algo mais explicando o XML Schema (talvez um anexo no site da plataforma) e nao apenas a semântica das tags. Esse tipo de descrição (textual) é razoável ´para humanos, e para as máquinas?Explicitar a arquitetura ou o mecanismo técnico pelo qual os metadados de proveniência são capturados (não está claro no texto, é um processo automatizado, manual?), armazenados e associados aos textos no corpus. Por exemplo, se há um banco de dados dedicado, um sistema de controle de versão para os metadados, ou como a rastreabilidade é implementada tecnicamente. Essa lacuna técnica poderia ser um pouco mais detalhada nas proximas versões do texto.2) Clareza na Distinção entre "Dossier" e "Carolina" Metadata Categories: As categorias "Dossiê" e "Carolina" são introduzidas na Seção 2.4, mas uma breve explicação mais didática sobre o raciocínio por trás dessa distinção nominal (além da descrição de seu conteúdo) poderia beneficiar leitores não familiarizados com a metodologia interna do projeto.Sugestão: Adicionar uma frase concisa explicando a motivação por trás dos nomes ou a filosofia que as diferencia. Por exemplo, "A categoria 'Dossiê' captura metadados intrínsecos e objetivos da fonte original, enquanto 'Carolina' abrange informações geradas ou processadas durante a construção do corpus, refletindo a curadoria ativa do projeto."3) Discussão sobre Desafios de Escala e Desempenho: A construção de um corpus implica em desafios computacionais significativos em termos de armazenamento, processamento e acesso e proveniência e curadoria. Embora o manuscrito mencione o uso de Wget e ferramentas Python, uma breve discussão sobre a infraestrutura computacional utilizada, dos processos e dos desafios de escala encontrados e as soluções adotadas para otimizar o desempenho (e.g., processamento distribuído, otimização de banco de dados para metadados) seria valiosa para a audiência da Ciência da Computação.

Sugestão: Incluir um parágrafo conciso na Seção 2.3 ou 2.4 abordando aspectos práticos da infraestrutura e dos desafios de escala.

4) Integração Futura de Outros Corpora: A menção de que CAROLINA será a "mother ship" para incorporar outros corpora menores no futuro é interessante, mas carece de detalhes sobre como a metodologia de proveniência do CAROLINA garantirá a rastreabilidade e a consistência dos metadados ao integrar fontes externas com suas próprias estruturas de anotação.

Sugestão: Ampliar essa discussão, talvez em "Future Directions", ou criar uma subseção de "limitatiosn" sobre os desafios e a abordagem planejada para harmonizar diferentes esquemas de metadados e manter a integridade da proveniência retrospectiva na integração de corpora diversos. Eventualmente recomendo que a equive avalie a possibilidade de aprofundar (ou mesmo iniciar) as discussões sobre tornar o repositório compatível com os principios de dados FAIR. Acredito que será uma vigorosa contribuição para as comunidades de PLN e Computação.

5) Revisão Linguística Pontual (Inglês): O texto é bem escrito e organizado, mas apresenta algumas nuances que poderiam ser aprimoradas para um inglês mais fluente e acadêmico.Exemplos de Correções e Sugestões:"variated typology" -> "varied typology" ou "diverse typologies"."hf. NLP" -> "i.e., NLP" ou "namely, NLP"."assigned as a professor" -> "appointed as a professor" ou "named a professor"."Knowing the source of the text is directly related to the trust placed on its content." -> "Knowledge of a text's source is directly related to the trustworthiness of its content.""figure in the range of essential information" -> "constitute essential information" ou "are crucial information".

"As its own creators acknowledge, the automated methods allow for limited control over the contents that end up in the final corpus, and therefore they need to be post-hoc investigated (Baroni et al., 2009)." -> "As their creators acknowledge, automated methods offer limited control over the content included in the final corpus, thus necessitating post-hoc investigation (Baroni et al., 2009)."

"corpus might be violating the law even if publishing only fragmented or highly processed versions of the texts when copyrighted content is included." -> "corpora might still violate the law, even when publishing only fragmented or highly processed versions of texts that include copyrighted content."

Correções Textuais Pontuais:Página 1, Título: Consistência no uso de "Corpus" (maiúsculo) ou "corpus" (minúsculo) ao longo do texto. Sugiro manter "corpus" em minúsculo, exceto quando parte de um nome próprio (e.g., "Corpus CAROLINA").Página 3, Linha 11: "hf. NLP" deve ser corrigido para uma abreviação mais formal ou expandido.Página 3, Linha 24: A citação 11 e 12 estão como sobrescrito, mas o ideal seria serem referências numeradas na sequência de texto.Página 4, Figura 1: Verificar a legibilidade do texto dentro do diagrama. "Univesity Domains" deveria ser "University Domains", e "Journalistc texts" deveria ser "Journalistic texts".Página 9, Tabela 1: Na linha "Extent", "Measures (File size in bytes, number of pages or number of tokens of the source document) 1+" - "or number of tokens" pode ser "and/or number of tokens" dependendo se todas as medidas são sempre coletadas.

-

REVIEWER 3

2025-07-12 | 08:06 PM

The work presented in the "Building Carolina" paper has a strong potential to positively impact research carried in both Linguistics and Computer Science, with special implications to studies in Natural Language Processing of Brazilian Portuguese.

In "Building Carolina", the authors provide a comprehensive description of both the theoretical foundations and practical methodologies behind the development of the CAROLINA corpus, a large-scale collection of Brazilian Portuguese data. Their discussions focus on the methodology of Web-as-Corpus along with issues of data provenance and typology in corpora construction (WaC-wiPT), comparing the CAROLINA corpus to existing corpora and methodology in the literature. Although, as the authors recognize, the project faces issues with copyrighting policies that affect the plurality of typologies represented in the corpus, the paper describes an ongoing project that has strong potential to positively impact and contribute to research carried in Natural Language Processing of Brazilian Portuguese.

Author Reply

DOI: https://doi.org/10.25189/2675-4916.2025.V6.N4.ID812.A

-

ROUND 1

2025-08-26

Aos revisores:

Agradecemos aos revisores a atenta leitura que fizeram do nosso artigo e aos apontamentos que muito contribuíram para a melhoria do texto.

Buscando atender às sugestões apontadas no primeiro parecer, acrescentamos, na seção 1, três parágrafos destinados a uma reflexão sobre o uso da web como corpus, ressaltando alguns de seus desafios; e de maneira sucinta, na introdução, foram mencionadas algumas das contribuições do artigo. Na seção 2 também foi acrescentado e modificado texto que discute e esclarece questões apontadas no parecer 2, relacionadas a proveniência, metadados e integração com outros corpora. Não obstante, tais assuntos encontram-se mais desenvolvidos em outras produções da equipe e que foram citadas neste artigo: sobre a metodologia WaC-PT citamos o artigo “Carolina’s Methodology: building a large corpus with provenance and typology information” (Sturzeneker et al, 2022), sobre o versionamento, citamos o artigo “Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information” (Crespo et al, 2023); e sobre sistema de anotação e metadados citamos a tese de doutoramento de Aline Silva Costa (Costa, 2024). Com relação às referencias Santos e Namiuti (2019) e Namiuti e Santos (2021), apontadas no parecer 3, ambas já estavam citadas no texto e já constavam nas referencias.

Também foram incorporadas as sugestões pontuais de escrita, normalizações e correções apontadas nos três pareceres.

Assim, no texto revisado, buscamos atender às recomendações contidas nos três pareceres e esperamos tê-lo feito a contento.

Cordialmente,

Os autores

How to Cite

FINGER, M.; SOUSA, M. C. P. de; NAMIUTI, C.; MONTE, V. M. do; COSTA, A. S.; SERRAS, F. R.; STURZENEKER, M. L.; CARPI, M. de M.; PALMA, M. F.; LACHI, G. A. Building Carolina: Metadata for Provenance and Typology in a Corpus of Contemporary Brazilian Portuguese. Cadernos de Linguística, Campinas, SP, Brasil, v. 6, n. 4, p. e812, 2025. DOI: 10.25189/2675-4916.2025.v6.n4.id812. Disponível em: https://cadernos.abralin.org/index.php/cadernos/article/view/812. Acesso em: 31 dec. 2025.

Statistics

Copyright

© All Rights Reserved to the Authors

Cadernos de Linguística supports the Opens Science movement

Collaborate with the journal.

Submit your paper