<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.2 20190208//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ali="http://www.niso.org/schemas/ali/1.0">
  <front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Cadernos de Linguística</journal-id>
<journal-title-group>
<journal-title>Revista da Abralin</journal-title>
</journal-title-group>
<issn pub-type="epub">2675-4916</issn>
<publisher>
<publisher-name> Associação Brasileira de Linguística </publisher-name>
</publisher>
</journal-meta>
    <article-meta>
<article-id pub-id-type="doi">10.25189/2675-4916.2021.V2.N4.ID410</article-id>
      <article-categories>
        <subj-group>
          <subject content-type="Type of Contribution">Tutorial</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>A GUIDE ON EXTRACTING AND TIDYING TWEETS WITH R</article-title>
      </title-group>
      <contrib-group content-type="author">
        <contrib id="person-3c3c60556173f8d573eb113c2d6e07bd" contrib-type="person" equal-contrib="no" corresp="yes" deceased="no">
          <name>
            <surname>Adams</surname>
            <given-names>Julia Bahia</given-names>
          </name>
          <email>j176760@dac.unicamp.br</email>
          <xref ref-type="aff" rid="affiliation-59e5a1663c487e892bc737b74328acd6" />
        </contrib>
        <contrib id="person-41ea47d41d2b6bfb20533eed3cfb143e" contrib-type="person" equal-contrib="no" corresp="no" deceased="no">
          <name>
            <surname>Chiarelli</surname>
            <given-names>Carlos Augusto Jardim</given-names>
          </name>
          <email>ca.chiarelli.97@gmail.com</email>
          <xref ref-type="aff" rid="affiliation-3c93d58198dd9bc07e752b418e078022" />
        </contrib>
      </contrib-group>
      <contrib-group content-type="editor">
        <contrib id="person-f6e93de22d5a621eea9c13c16a4230ff" contrib-type="person" equal-contrib="no" corresp="no" deceased="no">
          <name>
            <surname>Oliveira, Jr</surname>
            <given-names>Miguel </given-names>
          </name>
          <email>miguel@fale.ufal.br</email>
          <xref ref-type="aff" rid="affiliation-cacaa336e1c0380ea054bce3cbeed908" />
        </contrib>
        <contrib id="person-bbf699a10b319dd73c8522b6a904fa4c" contrib-type="person" equal-contrib="no" corresp="no" deceased="no">
          <name>
            <surname>Almeida</surname>
            <given-names>René Alain</given-names>
          </name>
          <email>renealain@hotmail.com</email>
          <xref ref-type="aff" rid="affiliation-cebc39cbb6f1834bd64f5407ca61b830" />
        </contrib>
      </contrib-group>
      <aff id="affiliation-59e5a1663c487e892bc737b74328acd6">
        <institution content-type="orgname">Universidade Estadual de
Campinas (UNICAMP)</institution>
        <institution content-type="orgdiv1">Instituto de Estudos da Linguagem</institution>
      </aff>
      <aff id="affiliation-cacaa336e1c0380ea054bce3cbeed908">
        <institution content-type="orgname">Universiade Federal de Alagoas</institution>
      </aff>
      <aff id="affiliation-cebc39cbb6f1834bd64f5407ca61b830">
        <institution content-type="orgname">Universidade Federal de Sergipe</institution>
      </aff>
      <aff id="affiliation-3c93d58198dd9bc07e752b418e078022">
        <institution content-type="orgname">Universidade Estadual de
Campinas (NICAMP)</institution>
        <institution content-type="orgdiv1">Faculdade de Engenharia Mecânica</institution>
      </aff>
      <pub-date date-type="pub" iso-8601-date="12/03/2021" />
      <volume>2</volume>
      <issue>4</issue>
      <issue-title>Linguistics Challenges in Open Science</issue-title>
      <elocation-id>e410</elocation-id>
      <history>
        <date date-type="accepted" iso-8601-date="10/18/2021" />
        <date date-type="received" iso-8601-date="07/31/2021" />
      </history>
      <permissions id="permission">
        <license>
          <ali:license_ref>http://creativecommons.org/licenses/by/4.0/</ali:license_ref>
        </license>
      </permissions>
      <abstract>
        <p id="_paragraph-1">Social media platforms represent a profuse resource for academic research and a wide range of untapped possibilities for linguists (D’ARCY; YOUNG, 2012). This rapidly developing field presents various ethical issues and unique challenges regarding methods to retrieve and analyze data. This tutorial provides a straightforward guide to harvesting and tidying Twitter data, focused mainly on the Tweets’ text, by using the R programming language (R CORE TEAM, 2020) via Twitter APIs. The R code was developed in Adams (2020), based on the <italic id="italic-1">rtweet</italic> package (KEARNEY, 2018), and successfully resulted in a script for corpora compilation. In this tutorial, we discuss limitations, problems, and solutions in our framework for conducting ethical research on this social networking site. Our ethical concerns go beyond what we “agree to” in terms of use and privacy policies, that is, we argue that their content does not contemplate all the concerns researchers need to attend to. Additionally, our aim is to show that using Twitter as a data source does not require advanced computational skills.</p>
      </abstract>
      <abstract abstract-type="executive-summary">
        <title>Resumo</title>
        <p id="paragraph-02f994ad19f88df58a61dcaca08d9c4a">As plataformas de redes sociais representam uma profunda fonte de dados para pesquisas acadêmicas e um amplo leque de possibilidades para linguistas (D’ARCY; YOUNG, 2012). Este campo em rápido desenvolvimento apresenta diversas questões éticas e desafios únicos no que concerne os métodos de coleta e análise de dados. Esse tutorial oferece um guia direto para extração e mineração de dados do Twitter, voltando-se principalmente para o texto dos Tweets, por meio da linguagem de programação R (R CORE TEAM, 2020) via os Twitter APIs. O código em R foi desenvolvido em Adams (2020), com base no pacote <italic id="italic-441ffd7d759bb517f06bfed1a87372d1">rtweet</italic> (KEARNEY, 2018), e resultou com sucesso em um <italic id="italic-2">script</italic> para compilação de <italic id="italic-3">corpora</italic>. Nesse guia, são discutidas limitações, problemas e soluções na nossa abordagem para a condução ética de pesquisa nessa rede social. Nossas preocupações éticas vão além daquilo com o que “concordamos” nos termos de uso e nas políticas de privacidade, isto é, argumentamos que seu conteúdo não abrange todas as questões a que pesquisadoras(es) devem responder. Ademais, nosso objetivo é demonstrar que utilizar o Twitter como uma fonte de dados não requer habilidades computacionais avançadas.</p>
      </abstract>
      <kwd-group>
        <kwd content-type="">Data-collection methods</kwd>
        <kwd content-type="">Social media</kwd>
        <kwd content-type="">Research ethics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body id="body">
    <sec id="heading-ac5b9ebaf5848a2aebfab7263fa7a046">
      <title>Introduction</title>
      <p id="paragraph-30a4769a5b5088e9feecf08cb557d476">The realm of social media presents a wide range of possibilities for linguistic research, which raises unique methodological challenges and ethical issues (D’ARCY; YOUNG, 2012). One of these challenges is the creation of guidelines, protocols, or standards for ethical conduction of social media research. This is due to various reasons, such as the distinctions between these digital platforms — for example, between Facebook, Instagram, and Twitter — and the specificities of their community standards, terms of use and privacy policies; the distinct use that virtual community members make of these platforms; and the variety of ethical questions that arise depending on the type of work being carried out and each context, as well as the norms that govern these virtual spaces.</p>
      <p id="paragraph-2">This tutorial focuses on Twitter, a data-rich microblogging platform that was launched in 2006 and that reached 199 million daily active users in 2021. Our aim is to present a ‘how-to’ guide on harvesting Twitter data and compiling a corpus with R (R CORE TEAM, 2020). In doing so, we discuss limitations, problems, and solutions in our framework for conducting ethical research on this social networking site. Aside from corpora compilation, the R programming language is a free software environment that can be used for several computational tasks, such as statistical computing, graphics, among others (see BAAYEN, 2008; GRIES, 2009; LEVSHINA, 2015). Although this guide is not an introduction to R for linguists (see OUSHIRO, 2014) nor to data science (see WICKHAM; GROLEMUND, 2017) or to<italic id="italic-172e9c29dd2b1d566214443c334d0426"> tidyverse</italic> (see WICKHAM <italic id="italic-8f6cf8d7881e34a3a1154dbd455621c7">et al</italic>., 2019), it intends to show that collecting data via Twitter APIs is not as daunting and does not require advanced computational skills as it can initially seem.</p>
      <p id="paragraph-9b519b2e4ab860fc5805e77a46f9d029">Our <italic id="italic-5b09ac2e1506b51c7894589e9965f94b">researchTwitter</italic> script is a list of commands that provides several functions developed to extract and tidy Twitter data, all of which will be examined in detail in the following section. The <italic id="italic-4ebfefac67fe48654e645f1839c93bd0">rtweet</italic> package (KEARNEY, 2018), which allows a more approachable way to import data, was used conversely to the <italic id="italic-2c0321c6fe502ad063200a9bf71d9623">twitteR</italic> package (GENTRY, 2013), since the former package is up to date and actively maintained whereas the latter one is deprecated.</p>
      <p id="paragraph-4">It is worth mentioning that this method design results from a variationist sociolinguistic project about stranded prepositions and syntactic variation in Brazilian Portuguese, that had to address the issue of data scarcity (ADAMS, 2020). This undergraduate research project focused a great deal on methods, which often goes unnoticed in some academic circles. Similarly to Schilling’s considerations about sociolinguistic field methods (SCHILLING, 2013), we also evaluate that this other kind of methodology plays a crucial role in shaping our data, and, as a result, our findings and conclusions; hence the embrace of open access to our code (see EASTERBROOK, 2014; STODDEN, 2011), and the push towards open science — especially principles ‘A’ and ‘R’, which represent some degree of ‘FAIRness’ in our work (see WILKINSON <italic id="italic-92dc9444494c87338a27c921c1613c35">et al</italic>., 2016, p. 4-5).</p>
      <p id="paragraph-604eb7e7056189486d9d1b34f92be7d3">A variety of approaches were taken to draw and reach this outline, so how we initially envisioned to obtain this text data thoroughly changed during the testing process. Moreover, this process allowed us to evaluate strengths and weaknesses of each approach (SCHILLING, 2013) and draw a more polished and precise methodological approach. The data-collection process in Adams (2020) resulted in a corpus of approximately ten million words, consisting of roughly 450,000 Tweets.</p>
      <p id="paragraph-6">The last part of this tutorial offers a discussion on ethical issues that emerged when dealing with Twitter data and what strategies for data anonymization were developed to bypass these challenges (ADAMS, 2020). It approaches expectations of privacy (D’ARCY; YOUNG, 2012; ZIMMER, 2010) regarding modern technologies, the nature of informed consent, and the role of scholars when engaging in research within virtual social network platforms.</p>
    </sec>
    <sec id="heading-4bdb8d232dcc1c4009fd4b26e460b88b">
      <title>1. Data harvesting</title>
      <p id="paragraph-cd7ac0eab6d4fca616741fc2af59231f">To compile the Twitter corpus through R (R CORE TEAM, 2020), a script was built based on the <italic id="italic-47b49d3d6668f156729d7d738f7b2b53">rtweet</italic> package (KEARNEY, 2018), that provides several functions designed to extract Twitter data. In outline, the script extracts Tweets through Twitter APIs<xref id="xref-9308e48f1d0b9a678598337fcb9697e2" ref-type="fn" rid="footnote-f0ecd757fd7fc925aa874771fae4684f">1</xref>, as demonstrated in Listing 1; then, as shown in Listing 2, it cleans the data to remove the variables that are lists, which are a type of object in R. If it is the first time running the script, it will also create a CSV text file where the new data from the following extractions will be attached to (Listings 3 and 4). The last part of this script contains a function that adds more data to the main file (Listing 4). All blocks of our R code have comments indicating what that specific command does as part of the script. Each comment line is identified by an initial “#” and ends in the same numbered line.</p>
      <fig id="figure-panel-365873c45acdceb4d6073579084f67b8">
        <label>Figure 1</label>
        <caption>
          <title><bold id="bold-65a1ea92f30ce519d1cb6e0daeed8028">Listing 1.</bold> R function that extracts Tweets<bold id="bold-ea80d075e7e21c4200e60625ea7679c6"/></title>
          <p id="paragraph-d884918d68375c4b8d26d8c6d82ea0c8" />
        </caption>
        <graphic id="graphic-201813a007e7533135a297e3588b245c" mimetype="image" mime-subtype="png" xlink:href="l1.png" />
      </fig>
      <p id="paragraph-250577afe16902c0ad089a11382f130c">It is fundamental to highlight lines 16-23, regarding access to Twitter’s API. The keys, tokens, and other credentials necessary in order to fill out those fields can be obtained through Twitter’s Developer Platform, where applying for a developer account will be a requirement. There is an Academic Research application available, which gives access to higher levels of data than the standard application. As stated on the Developer Portal, the keys and tokens are unique identifiers that authenticate your request and a type of authorization to gain specific access to data, respectively.</p>
      <p id="paragraph-92165c6d3e7cce9a6a4e3e4805d9b143">Another essential aspect is line 34, since <italic id="italic-70e4b38aaf893527328db0af8cdf9640">q</italic> establishes the query to be searched, which is used to filter and select Tweets to be returned. For further information on the arguments of the <italic id="italic-1c88accae0496ac41ce0d314af39000f">search_tweets</italic> function and on indicating multiple terms in the search query, see <italic id="italic-1e64b2cc60d800caaf271030592603ef">rtweet</italic>’s package description (KEARNEY, 2018).</p>
      <p id="paragraph-4c24bc34538de2c3e2e1e358ea5ef959">The decision to remove variables that were lists, as shown in Listing 2, derived from this type of data structure being more versatile and not figuring out how to go around the complication of simplifying all these variables correctly — something similar to what the function <italic id="italic-ac6f86a0d49d36b0b6732dcd7d5053f2">unlist(x)</italic> does to produce vectors. However, most of these specific variables did not contain relevant metadata information for the purposes of the research in Adams (2020); additionally, we bypassed the few issues that arose.</p>
      <fig id="figure-panel-e74482f0c4dd35eeb049648d3751a7b1">
        <label>Figure 2</label>
        <caption>
          <title><bold id="bold-004f6476a69b8c5c870de78f74738111">Listing 2.</bold> R function that removes lists from the object bound to tweets</title>
          <p id="paragraph-35af04a82dccc7f3ff963110a27f9ad7" />
        </caption>
        <graphic id="graphic-57ec5c1a2ee3486ce2d7c6dcc14c584c" mimetype="image" mime-subtype="png" xlink:href="l2.png" />
      </fig>
      <p id="paragraph-44fbf7b3c6026099204ec5f8616eb12a">Since removing particular variables that were lists meant we would lose information about the Tweets' language codes and location, such as coordinates and place, we included another <italic id="italic-1fa0ba8ce3317e428d0c6b07716b4ffe">rtweet</italic> function, that is, <italic id="italic-a1e48ba60c54bcb2be05e0978e59792d">lookup_coords( )</italic>, which looks up latitude/longitude coordinate information for a specified location (KEARNEY, 2018, p. 36). This required a valid Google Maps API key, that can be obtained through the Google Cloud Platform Console (see KAHLE; WICKHAM, 2013).<xref id="xref-330a9ab2988a067189840f6ffe39ad02" ref-type="fn" rid="footnote-4f4443c29d6226a91c3b96e7c6cb0dde">2</xref> Despite removing the language variable from the extracted data and in spite of not being able to fully guarantee all extracted Tweets are from native speakers of Brazilian Portuguese, it is important to clarify that by structuring the script this way, we still ensured it drew out Tweets written in Portuguese and published in Brazil (see Listing 1, lines 28-30 and line 39).</p>
      <p id="paragraph-c696bb070a1dbbb11c7240f9ab162e3e">Further, Listing 3 elucidates the R function that creates the main file where the Tweets from all temporary tables with the extractions will be attached onto. One important part of this function is the selection of the first line of the data frame with <italic id="italic-99bf3b95853fdd98cd9ae35220ba21a2">slice(1)</italic>, where the variables’ names are placed. This way, the columns remain named correctly according to those variables, such as USER_ID, CREATED_AT, and TEXT.</p>
      <fig id="figure-panel-f5a7c27b35628545fd4f40899053f888">
        <label>Figure 3</label>
        <caption>
          <title><bold id="bold-2132503f743fc92857bf8fe13e980ac9">Listing 3.</bold> R function that creates the main file</title>
          <p id="paragraph-754463a8e1af6bd8b572382fc11a86a6" />
        </caption>
        <graphic id="graphic-6f70e07198c8e930a94c564f3fffb4a1" mimetype="image" mime-subtype="png" xlink:href="l3.png" />
      </fig>
      <p id="paragraph-03c990f507b7f1c8fe1563a643e4aac9">Moreover, Listing 4 shows how the main file is updated with the output of each search through temporary tables. The rows from the temporary tables are bound to the main file, then line 50 indicates to the removal of duplicate rows according to the TEXT variable, which is the Tweet’s text, whilst preserving the remaining data. It is possible to specify that duplicate rows are removed according to a different variable.</p>
      <fig id="figure-panel-cac561deb1b6ed59ae68976d6bf466e9">
        <label>Figure 4</label>
        <caption>
          <title><bold id="bold-cc4eb71a05f75275e4d81d52baf854d7">Listing 4.</bold> R function that attaches new Tweets to the main file</title>
          <p id="paragraph-b63a9467ca201ff60026927ca0ed2065" />
        </caption>
        <graphic id="graphic-3fa43806201298ff84cbef2a2ad02c9b" mimetype="image" mime-subtype="png" xlink:href="l4.png" />
      </fig>
      <p id="paragraph-c1609cfd1ed76f092ba64c58fd20c8ba">As a result, our CSV file contained one header row and the following variables as columns: TEXT, SCREEN_NAME, CREATED_AT, SOURCE, DISPLAY_TEXT_WIDTH, REPLY_TO_SCREEN_NAME, IS_QUOTE, PLACE_NAME, PLACE_TYPE, LOCATION, FOLLOWERS_COUNT, FRIENDS_COUNT, and FAVOURITES_COUNT. The remaining rows are all of extracted Tweets and their information.</p>
      <p id="paragraph-349a225e31ccbde86d5abd1c5c01f2c2">Following the overview of these four functions that form our <italic id="italic-de77ea68776b46189818fb53bcafec79">researchTwitter</italic> script, we highlight precisely what a researcher would have to do to reach their own CSV main file. That is, firstly one must open the script — preferably in a software application like RStudio —, insert their information in <italic id="italic-8115fd1dfeb920569694aed716ff7c36">create_token( ) </italic>(Listing 1, lines 16-23), and run each function separately, so that these objects are saved and made available in the environment. Whenever the query argument (Listing 1, line 34) or any other part of the functions are changed, it is necessary to save the script file and run this command line (Listing 5):</p>
      <fig id="figure-panel-be49e0b40f39d9575fbf0d4af80fb5c1">
        <label>Figure 5</label>
        <caption>
          <title><bold id="bold-f0b8010eb6e69f593df86ec1c74fec5b">Listing 5.</bold> R command line to be run after modifying the <italic id="italic-233cdc80bdf6ababf34b87299cd9d19d">researchTwitter</italic> script</title>
          <p id="paragraph-996d4b5b8ce16d50b2a741f101d0e067" />
        </caption>
        <graphic id="graphic-eff63f58008e7c6f57c2c1f90903799f" mimetype="image" mime-subtype="png" xlink:href="l5.png" />
      </fig>
      <p id="paragraph-628dca9d0dc9b01584cd8b48e2f3bc25">Finally, after loading all four functions, <italic id="italic-90c6124124a505d735d7a03a7b2a8b31">atualizaTWEETS_principal( ) </italic>can be run several times ­— every single time it will call <italic id="italic-f87280cfc386a950878b9625427a95b9">puxaTWEET </italic>and <italic id="italic-a236116cb77f59b643791360313a4a15">limpaTWEET </italic>to, respectively, extract more Tweets and remove lists. In other words, since all three other functions are embedded in <italic id="italic-f45f7c44441bc2ef8eaf6f4a3217480d">atualizaTWEETS_principal( )</italic> (Listing 4, lines 15-25, 27, and 29-31), there is no need to run them individually as command lines after they have already been saved in the environment. If the working directory is the same and <italic id="italic-39480da5ee07a821ffa7411e3d5bc7a6">criaTABELA_principal( ) </italic>is used as a command line after the start of a corpus compilation process, a user’s main file will be overwritten. It is important to stress that there is a rate limit for requests under a specific time interval and the command line in Listing 5 has to be run in case of, for example, modifications in the search query.</p>
    </sec>
    <sec id="heading-a4275e36f7ba89754dd654b675596bda">
      <title>2. Ethics of social media research</title>
      <p id="paragraph-3601c4d7583140b5ae50a15cc56002ae">Initially, we also intended to build a corpus of Facebook data (ADAMS, 2020), but recent changes in Facebook Platform Terms and Developer Policies restricted what could be done with the <italic id="italic-5a176d42f072b0983db70a6dbce2759f">Rfacebook</italic> package (BARBERA <italic id="italic-4c4bd433b77bfc345877b4b64621c29d">et al</italic>., 2017) — something similar to the <italic id="italic-f75e9c5779dd28f0b7c3ec52e0ab1ad0">rtweet</italic> package (KEARNEY, 2018) —, thus we used Twitter as our primary source of data. Recent data scandals, such as the breach involving Facebook and Cambridge Analytica, have led social media platforms to review and limit what type of data can be extracted through their APIs. Along with advances in facial recognition, fingerprint sensors, tracking and other emerging technologies, it is even more crucial and imperative to openly (re)discuss our research conducts and practices on virtual platforms.</p>
      <p id="paragraph-67f18712d02a1809cab5c8103be1f95f">According to Sobo and De Munck, “[r]esearches are also expected to ensure that participants’ rights and interests are always protected” (SOBO; DE MUNCK, 1998, p. 23) and, especially concerning linguists, “[r]esearch on language always involve human agents” (ECKERT, 2014, p. 13). Here we argue that it is not enough to guide ourselves by what is stated on terms of use and privacy policies, which we “consent to” before beginning to use any application, technological device, or social media. Assuming that those terms focus greatly on legal issues and that ethics are not necessarily taken into account in the foundation of companies’ policies, we advocate that researchers need to carefully and critically consider the overlapping ethical and methodological issues in social media research. It is arguable how much academic research can rely on what terms and policies cover considering “a survey once found as few as 18 per cent of users may actually read terms conditions agreements” (ZIMMER; PROFERES, 2014 <italic id="italic-bd91ebbf3641b5d8c0afce59879e3679">apud </italic>AHMED <italic id="italic-a137893a73ec221a5fb728285bd7686c">et al.</italic>, 2017, p. 18), which brings into question if agreeing to those conditions actually constitutes informed consent.</p>
      <p id="paragraph-6c0075b708a3c0b4890e6ea9618d7e73">Beyond institutional and Research Ethics Committees requirements, “[i]n all aspects of research, transparency is critical” (D’ARCY; YOUNG, 2012, p. 538), which all researchers should support, encourage and respect in regards to doing science. Directly related to the notion of transparency, Eckert (2014) offers a definition of what embodies free and informed consent, one of the principles of ethical research: “consent should not be a matter of getting a signature on paper, but the establishment of an informed working relationship” (ECKERT, 2014, p. 14). Adding that “[i]nformed consent assumes the ability to grasp the implications of participation in the research and to make decisions for oneself” (ECKERT, 2014, p. 16-17). </p>
      <p id="paragraph-eb01e5ba919e12e8af19f760b8f42e0e">It is necessary to state that no information that was not already public is collected through the script made with the <italic id="italic-df3cac487d07c97da491b19948eba476">rtweet</italic> package functions (KEARNEY, 2018). In other words, we do not have access to private information from the users whose Tweets are part of our Twitter corpus. No participants were directly approached by the researchers. Furthermore, we took other measures to ensure the participants’ privacy, which meant removing identifiable information from the corpus data set, as explained in detail subsequently (ADAMS, 2020).</p>
      <p id="paragraph-8175f16fb4a2ed214b53f13bb1a606f2">As disclosed in Twitter’s Privacy Policy and Help Center, Tweets are searchable by anyone around the globe through search engines and other third parties, which can retain copies of public information, even if that is deleted from Twitter services or if an account is deactivated. Users are given tools and settings to object, restrict or withdraw consent where applicable for the use of data provided to Twitter; they can also choose to share additional information, like e-mail addresses, phone numbers, address book contacts and public profiles. </p>
      <p id="paragraph-23354416859c513df943eb1e25d05739">When debating the possibility of acquiring written and informed consent from each one of the subjects that produced the vast amount of Tweets that are part of our sample, it was clear that an undergraduate research project (ADAMS, 2020) would not be able to reach out to all those users with consent forms (see AHMED <italic id="italic-7">et al</italic>., 2017) for several different reasons. In consideration of the aspects mentioned above and specially to address the issue of written and informed consent being impracticable, it was decided that certain measures would be taken for data anonymization regarding the dissemination of research findings, for example in giving oral presentations or publishing in journals. When linguistic data from Tweets part of our corpus is cited as examples of the structures under analysis, any information that could lead to pinpointing a user has been erased, like profile pictures, usernames, and display names. No screenshots of any posts, comments or Tweets are used as well.</p>
      <p id="paragraph-a7c2c205d68532b76205797415641bf4">Moreover, users could be identified by the search of precise strings from Tweets — for example, by enclosing an entire phrase in quotation marks on Google Search. To avoid this from occurring, no exact-quoted content is used in any material, by cutting excerpts or by the substitution of secondary lexical items for synonyms; this way, we aim to avoid reverse searches and to maintain users’ identities anonymous. This strategy results in not even the only people with access to the corpus being able to trace the original Tweet afterwards. </p>
      <p id="paragraph-8">Also, there is no use of data with controversial content, like any form of intolerance, discrimination or prejudice regarding gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, ethnicity, or religion. This type of content was not removed from the corpus; if a part of text was coded in the process of analyzing variants of our variable of interest (ADAMS, 2020), those Tweets were not considered in any way as potential examples in material with description of our research findings. This decision was made considering that sensitive content, such as political discourses, could lead to someone attempting to find the author of a certain Tweet, which would compromise their privacy and break their anonymity.</p>
      <p id="paragraph-bf30c6bd0946ddf266cb67a1ea3dce43">Although an ideal solution would have been to reach out to those thousands of Twitter users individually with consent forms, the research project (ADAMS, 2020) took the ethical standpoint of not disclosing our Twitter corpus, among other measures previously described, to tackle the ethical and methodological challenges of doing research in a social networking site. In agreement with D’Arcy and Young (2012), in spite of the fact that users are tweeting in a public space, the “content is networked between actors with different privacy expectations” (D’ARCY; YOUNG, 2012, p. 542). As scholars, this expands our concerns over consent and privacy.</p>
      <p id="paragraph-10">At last, this project aimed to contribute to a critical open dialogue between researchers regarding the emerging and unique challenges of engaging in research within rapidly evolving online social network platforms: “[t]hese include challenges to the traditional nature of consent, properly identifying and respecting expectations of privacy on social network sites, developing sufficient strategies for data anonymization […]” (ZIMMER, 2010, p. 323). This shift from physical to virtual spaces also requires that institutional review boards have a better understanding of these other spheres (see D’ARCY; YOUNG, 2012; ZIMMER, 2010), to make headway and avoid potential shortcomings when overseeing research projects that retrieve data from social media.</p>
    </sec>
    <sec id="heading-2782ece169d0ea35c9e9106a0cbaec44">
      <title>3. Acknowledgments</title>
      <p id="paragraph-d33cb752ce86852f9ce6675eea131f0b">The development of the data-collection method detailed here was possible due to São Paulo Research Foundation’s grant (Process number 2018/24511-8) and Livia Oushiro’s academic advising. Thanks to Gabriel Catani for his (online) feedback on this tutorial and to Natasha Mourão for reviewing the manuscript.</p>
    </sec>
    <sec id="heading-dc876fdbcdda9d9680015e75ca301d0a">
      <title>References</title>
      <p id="paragraph-398b58c4ba3beb0c3c571d6f531045be">ADAMS, Julia Bahia. <bold id="bold-1">Um estudo sobre preposition stranding e orphaning em falantes de português brasileiro</bold>. Relatório final do processo no. 18/24511-8, Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), 2020.</p>
      <p id="paragraph-3">AHMED, Wasim; BATH, Peter; DEMARTINI, Gianluca. Chapter 4 Using Twitter as a Data Source: An Overview of Ethical, Legal, and Methodological Challenges<italic id="italic-982bd0d7baffd390be9825e7fd5f1280">.</italic> In: WOODFIELD, Kandy, (ed.) <bold id="bold-2">The Ethics of Online Research</bold>. <bold id="bold-3">Advances in Research Ethics and Integrity</bold> (2). Emerald, pp. 79-107, 2017. Disponível em: <ext-link id="external-link-1" xlink:href="http://eprints.whiterose.ac.uk/126729/">http://eprints.whiterose.ac.uk/126729/</ext-link>. Acesso em: 30 jun. 2021.</p>
      <p id="paragraph-5">BAAYEN, R. Harald. <bold id="bold-4">Analyzing Linguistic Data</bold>. New York: Cambridge University Press, 2008.</p>
      <p id="paragraph-7">BARBERA, Pablo; PICCIRILLI, Michael; GEISLER, Andrew; VAN ATTEVELD, Wouter. <bold id="bold-5">Rfacebook: Access to Facebook API via R</bold>. Comprehensive R Archive Network, 2017. Disponível em: <ext-link id="external-link-2" xlink:href="https://CRAN.R-project.org/package=Rfacebook">https://CRAN.R-project.org/package=Rfacebook</ext-link>. Acesso em: 11 mar. 2019.</p>
      <p id="paragraph-9">D’ARCY, Alexandra; YOUNG, Taylor Marie. Ethics and social media: Implications for sociolinguistics in the networked public. <bold id="bold-6">Journal of Sociolinguistics</bold>, v. 16, n. 4, p. 532-546, set. 2012. DOI <ext-link id="external-link-3" xlink:href="https://doi.org/10.1111/j.1467-9841.2012.00543.x">10.1111/j.1467-9841.2012.00543.x</ext-link>.</p>
      <p id="paragraph-11">EASTERBROOK, Steve. Open code for open science?. <bold id="bold-7">Nature Geoscience</bold>, v. 7, n. 11, p. 779-781, nov. 2014. DOI <ext-link id="external-link-4" xlink:href="https://doi.org/10.1038/ngeo2283">10.1038/ngeo2283</ext-link>.</p>
      <p id="paragraph-13">ECKERT, Penelope. “Ethics in linguistic research”. <italic id="italic-6453ea79bc9a9cd440b3103140947fcf">In:</italic> PODESVA, Robert J.; SHARMA, Deyvani. (Eds.). <bold id="bold-8">Research Methods in Linguistics</bold>. New York: Cambridge University Press, 2014. p. 11-26.</p>
      <p id="paragraph-15">GENTRY, Jeff. <bold id="bold-9">twitteR: R Based Twitter Client</bold>. Comprehensive R Archive Network, 2013. Disponível em: <ext-link id="external-link-5" xlink:href="https://CRAN.R-project.org/package=twitteR">https://CRAN.R-project.org/package=twitteR</ext-link>. Acesso em: 4 jan. 2019.</p>
      <p id="paragraph-17">GRIES, Stefan Thomas. <bold id="bold-10">Quantitative corpus linguistics with R: a practical introduction</bold>. 1. ed. New York: Routledge, 2009.</p>
      <p id="paragraph-19">KAHLE, David; WICKHAM, Hadley. ggmap: Spatial Visualization with ggplot2. <bold id="bold-11">The R Journal</bold>, v. 5, n. 1, p. 144-161, 2013. Disponível em: <ext-link id="external-link-6" xlink:href="https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf">https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf</ext-link>. Acesso em: 30 mar. 2018.</p>
      <p id="paragraph-21">KEARNEY, Michael W. <bold id="bold-12">rtweet: Collecting Twitter data</bold>. Comprehensive R Archive Network, 2018. DOI <ext-link id="external-link-7" xlink:href="10.5281/zenodo.2528481">10.5281/zenodo.2528481</ext-link>.</p>
      <p id="paragraph-22">LEVSHINA, Natalia. <bold id="bold-13">How to do Linguistics with R: Data exploration and statistical analysis</bold>. 1. ed. Amsterdam/Philadelphia: John Benjamins Publishing Company, 2015.</p>
      <p id="paragraph-23">OUSHIRO, Livia. “TRATAMENTO DE DADOS COM O R PARA ANÁLISES SOCIOLINGUÍSTICAS”. <italic id="italic-23fec7e900ae2e1a80d2b257f165bd2c">In: </italic>FREITAG, Raquel Meister Ko. (Orgs.). <bold id="bold-14">Metodologia de Coleta e Manipulação de Dados em Sociolinguística</bold>. São Paulo: Editora Edgard Blücher, 2014. p. 134-177. DOI <ext-link id="external-link-8" xlink:href="http://dx.doi.org/10.5151/BlucherOA-MCMDS-10cap">10.5151/BlucherOA-MCMDS-10cap</ext-link>.</p>
      <p id="paragraph-24">OUSHIRO, Livia. <bold id="bold-15">Identidade na pluralidade: avaliação, produção e percepção linguística na cidade de São Paulo</bold>. 2015. Dissertação (Doutorado em Letras) – Faculdade de Filosofia, Letras e Ciências Humanas, Universidade de São Paulo, São Paulo, 2015. DOI <ext-link id="external-link-9" xlink:href="https://doi.org/10.11606/T.8.2015.tde-15062015-104952">10.11606/T.8.2015.tde-15062015-104952</ext-link>.</p>
      <p id="paragraph-25">R CORE TEAM. <bold id="bold-16">R: A language and environment for statistical computing</bold>. R Foundation for Statistical Computing, Vienna, Austria, 2020. Disponível em: <ext-link id="external-link-10" xlink:href="https://www.R-project.org">https://www.R-project.org</ext-link>.</p>
      <p id="paragraph-26">SCHILLING, Natalie. <bold id="bold-17">Sociolinguistic Fieldwork</bold>. 1. ed. New York: Cambridge University Press, 2013.</p>
      <p id="paragraph-27">SOBO, Elisa J.; DE MUNCK, Victor C. “The Forest of Methods”. <italic id="italic-4">In: </italic>DE MUNCK, Victor C.; SOBO, Elisa J. (Eds.) <bold id="bold-18">Using Methods in the Field: a practical introduction and casebook</bold>. Walnut Creek: Altamira Press, 1998. p. 13-37.</p>
      <p id="paragraph-29">STODDEN, Victoria. Trust Your Science? Open Your Data and Code. <bold id="bold-19">Amstat News</bold>, Alexandria, 1 jul. 2011. Disponível em: <ext-link id="external-link-11" xlink:href="https://magazine.amstat.org/blog/2011/07/01/trust-your-science/">https://magazine.amstat.org/blog/2011/07/01/trust-your-science/</ext-link>. Acesso em: 22 set. 2021.</p>
      <p id="paragraph-31">WICKHAM, Hadley; GROLEMUND, Garrett. <bold id="bold-20">R for Data Science:</bold> <bold id="bold-21">Import, Tidy, Transform, Visualize, and Model Data</bold>. 1. ed. Sebastopol: O'Reilly Media, Inc., 2017.</p>
      <p id="paragraph-33">WICKHAM, Hadley <italic id="italic-5">et al</italic>. Welcome to the tidyverse. <bold id="bold-22">Journal of Open Source Software</bold>, v. 4, n. 43, p. 1686, 2019. DOI <ext-link id="external-link-12" xlink:href="https://doi.org/10.21105/joss.01686">10.21105/joss.01686</ext-link>.</p>
      <p id="paragraph-35">WILKINSON, Mark D. <italic id="italic-6">et al</italic>. The FAIR Guiding Principles for scientific data management and stewardship. <bold id="bold-23">Scientific Data</bold>, v. 3, n. 1, 2016. DOI <ext-link id="external-link-13" xlink:href="https://doi.org/10.1038/sdata.2016.18">10.1038/sdata.2016.18</ext-link>.</p>
      <p id="paragraph-37">ZIMMER, Michael. “But the data is already public”: on the ethics of research in Facebook. <bold id="bold-24">Ethics and Information Technology</bold>, v. 12, n. 4, pp. 313-325, 2010. DOI <ext-link id="external-link-14" xlink:href="https://doi.org/10.1007/s10676-010-9227-5">10.1007/s10676-010-9227-5</ext-link>.<underline id="underline-1"/></p>
    </sec>
  </body>
  <back>
    <fn-group>
      <fn id="footnote-f0ecd757fd7fc925aa874771fae4684f">
        <label>1</label>
        <p id="paragraph-06f6f820d598643591d92fdf66ddfbdc">An application programming interface (API) is a set of routines or patterns to build software applications and integrate systems, as a bridge between applications to connect, communicate and share data.</p>
      </fn>
      <fn id="footnote-4f4443c29d6226a91c3b96e7c6cb0dde">
        <label>2</label>
        <p id="paragraph-8c81d55f717a129c4df52dde0b1834a0">If this type of information is not relevant for a certain research topic, lines 30 and 39 from Listing 1 can be commented out without any issue. This also indicates to the importance of reading packages’ descriptions: modifications can be made in order to include or remove arguments and functions, which allows the assembly of something more suitable for a different research design.</p>
      </fn>
    </fn-group>
  </back>
</article>