Policy on Data Availability in Special Cases

Some studies rely on data that cannot be shared in full (due to ethical, legal, contractual, or copyright restrictions) and/or on content that is publicly accessible on the internet but may change or disappear (broken links, edits, removals). This policy explains how to write the Data Availability Statement so the study remains verifiable and analyses can be rerun, without violating restrictions.

Rule of thumb

When raw data cannot be redistributed, the Statement must still point to a minimum package (with a DOI/persistent identifier) that enables others to reconstruct exactly what was analyzed and rerun the processing, within the applicable rules.

When this applies

  1. Restricted data: when part (or all) of the data cannot be shared publicly for ethical, legal, contractual, or copyright reasons.
  2. Third-party data: when the study uses material controlled by third parties (licensed, subject to terms of use, or not redistributable).
  3. Unstable web sources: when the material is hosted on pages/platforms that may change, be removed, or become unavailable.

1) When part (or all) of the data cannot be shared publicly

If public sharing is not possible, follow the steps below:

  1. State the restriction and what it affects: specify the nature of the restriction (ethical, legal, contractual, copyright, trade secret) and which components of the dataset are affected (for example: raw audio; full transcripts; identifiable metadata; a licensed corpus; third-party files).
  2. Make publicly available the maximum needed to replicate the analyses (with a DOI): deposit a DOI-enabled package in a repository (recommended: Zenodo) containing, where applicable:
    • code/scripts and parameters (including library versions, when applicable);
    • materials and instruments (scripts/prompts, inclusion/exclusion criteria, annotation protocol);
    • a data dictionary, coding scheme, and decision rules;
    • derived/aggregated data that do not violate the restriction and allow results to be recomputed.
  3. Explain how to obtain access to the restricted data: if access can be granted under conditions, describe objectively:
    • who grants access (corresponding author, institution, committee, controlled-access repository);
    • the procedure (institutional email, form, terms of use, confidentiality agreement, academic purpose);
    • which documents are required;
    • a typical response timeframe.
  4. Availability for peer review: where applicable, restricted data must be available for reviewers’ assessment, under a usage agreement and within ethical and legal limits.
  5. When the data are third-party (licensed or subject to terms of use): clearly state who the rights holder is, what the rules/terms are, and how readers can legally obtain the same source (license, purchase, registration). If the material cannot be redistributed, also apply the minimum package described in Section 3.

What does not meet the standard

  • Statements such as “data available upon request” without justification and without a clear access procedure do not meet the standard.

2) When data are publicly accessible on the internet but unstable

Links break; pages change; content is removed. This can prevent others from retrieving exactly the material analyzed. In these cases, preserve a time-stamped version of the corpus at the time of collection/analysis and deposit a DOI-enabled package.

  1. Capture a time-stamped version of the analyzed material: when possible, preserve it in a web-archiving format (for example: WARC/WACZ). When this is not feasible, use alternatives such as complete HTML, PDF/screenshots, and metadata records.
  2. Deposit the preserved copy (or a complete “manifest”) in a repository with a DOI: at minimum, the package must include:
    • a complete list of URLs;
    • access dates (and, if needed, times);
    • version identifiers/IDs available on the platform itself (when present);
    • checksums (for example: SHA-256) for downloaded files (when applicable);
    • instructions to reconstruct the collection (scripts and parameters, if used).
  3. In the manuscript, cite the original source and the preserved version: in the Statement, provide (a) the original URLs and (b) the DOI of the preserved package.
  4. If redistributing the full content is not permitted: deposit publicly the metadata, manifest, scripts, criteria, and derived data, and include objective instructions on how to access the original content legally.

3) Third-party audiovisual content (YouTube, podcasts, TV, social media): mandatory minimum package

When a study uses third-party audiovisual materials, even if they are publicly accessible on the internet, it may not be possible to redistribute the full files (due to copyright/terms of use). In these cases, Cad_Lin requires, at minimum, depositing in a repository (for example, Zenodo) a DOI-enabled package containing:

3.1 Corpus metadata (mandatory file)

A file (CSV/TSV/XLSX/JSON) that, for each corpus item, includes:

  • content title;
  • publication date (when available);
  • source/outlet/channel/profile;
  • URL;
  • access date;
  • platform item identifier (when present);
  • version/edit notes (when detectable).

Suggested filename: corpus_metadata.csv

3.2 Spreadsheet of the units/segments actually analyzed (mandatory file)

A spreadsheet (CSV/TSV/XLSX) listing only the analyzed segments, with exact location:

  • item ID (linking to the metadata file);
  • segment/unit identifier (if any);
  • timecodes (start and end) or an equivalent marker;
  • segment transcript, if produced by the authors and if it can be shared;
  • annotated variable(s) and produced coding(s);
  • decision notes (when there is disagreement/ambiguity).

Suggested filename: analyzed_segments.csv

3.3 Annotations/codings and produced materials (when applicable)

  • annotation files (for example, coding tables, tool files, scripts);
  • coding guide and criteria;
  • logs/versions of the procedure, when applicable.

3.4 Repository README (mandatory)

Include a README.txt or README.md with:

  • a brief description of the corpus and the subset used;
  • how the segments were selected;
  • how segment location should be interpreted (timecodes, IDs, etc.);
  • file structure and column definitions;
  • instructions to reproduce processing/extraction (if scripts exist).

Expected outcome

Even if the video/audio cannot be redistributed, anyone can identify exactly what was analyzed and rerun the analysis from the original source, within the rights holder’s rules.

4) What the Statement must include in these cases

The Data Availability Statement must state objectively:

  • what is available (metadata, analyzed segments, code, materials, protocol);
  • where and how (repository + DOI; version; and, when applicable, access conditions);
  • what is not available, why (type of restriction/terms), and how to access the original source legally;
  • for web sources: access date(s) and the DOI of the deposited package;
  • License: when the package is open, indicate the repository license (when applicable);
  • Dataset citation: when there is a DOI deposit, cite the dataset in the References, following the journal’s style.

5) Statement templates (copy and adapt)

A) Third-party audiovisual (not redistributable in full)

The corpus metadata and the spreadsheet of the segments actually analyzed (with timecodes), as well as the codings and materials produced in this study, are available in [repository] via DOI http://doi.org/[doi]. The full audiovisual content was obtained from resources publicly accessible on the internet and remains hosted on the original platforms (URLs listed in the package). Due to copyright/terms of use restrictions, the full audiovisual files are not redistributed in the repository.

B) Unstable web source (with preservation)

The corpus was collected from resources publicly accessible on the internet on [dates]. To mitigate changes and unavailability, a preserved version of the material actually analyzed (including the list of URLs and access dates) was deposited in [repository] via DOI http://doi.org/[doi]. The original URLs are: [list resources and URLs].

C) Part of the data is restricted (with described access)

The code, materials, and derived data needed to replicate the analyses are available in [repository] via DOI http://doi.org/[doi]. The raw data [describe] are not publicly available due to [restriction]. Access may be requested from [responsible party/institution] via [procedure], under [conditions], with a typical response timeframe of [timeframe]. Where applicable, restricted data will be available for reviewers’ assessment, under a usage agreement and within ethical and legal limits.

Cadernos de Linguística supports the Opens Science movement

Collaborate with the journal.

Submit your paper