<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.2 20190208//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ali="http://www.niso.org/schemas/ali/1.0">
  <front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Cadernos de Linguística</journal-id>
<journal-title-group>
<journal-title>Revista da Abralin</journal-title>
</journal-title-group>
<issn pub-type="epub">2675-4916</issn>
<publisher>
<publisher-name> Associação Brasileira de Linguística </publisher-name>
</publisher>
</journal-meta>
    <article-meta>
<article-id pub-id-type="doi">10.25189/2675-4916.2021.V2.N3.ID399</article-id>
      <article-categories>
        <subj-group>
          <subject content-type="Type of Contribution">Research Report</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title><bold id="bold-1">QUANTIFYING THE DIFFERENCES </bold>BETWEEN LEXICAL CATEGORIES</article-title>
        <subtitle>THE CASE OF PRONOUNS AND DETERMINATIVES IN ENGLISH</subtitle>
      </title-group>
      <contrib-group content-type="author">
        <contrib id="person-3c3c60556173f8d573eb113c2d6e07bd" contrib-type="person" equal-contrib="no" corresp="yes" deceased="no">
          <name>
            <surname>Reynolds</surname>
            <given-names>Brett</given-names>
          </name>
          <email>brett.reynolds@humber.ca</email>
          <xref ref-type="aff" rid="affiliation-59e5a1663c487e892bc737b74328acd6" />
        </contrib>
      </contrib-group>
      <contrib-group content-type="editor">
        <contrib id="person-f6e93de22d5a621eea9c13c16a4230ff" contrib-type="person" equal-contrib="no" corresp="no" deceased="no">
          <name>
            <surname>Oliveira, Jr</surname>
            <given-names>Miguel </given-names>
          </name>
          <email>miguel@fale.ufal.br</email>
          <xref ref-type="aff" rid="affiliation-cacaa336e1c0380ea054bce3cbeed908" />
        </contrib>
        <contrib id="person-bbf699a10b319dd73c8522b6a904fa4c" contrib-type="person" equal-contrib="no" corresp="no" deceased="no">
          <name>
            <surname>Almeida</surname>
            <given-names>René Alain</given-names>
          </name>
          <email>renealain@hotmail.com</email>
          <xref ref-type="aff" rid="affiliation-cebc39cbb6f1834bd64f5407ca61b830" />
        </contrib>
      </contrib-group>
      <aff id="affiliation-59e5a1663c487e892bc737b74328acd6">
        <institution content-type="orgname">Humber College</institution>
      </aff>
      <aff id="affiliation-cacaa336e1c0380ea054bce3cbeed908">
        <institution content-type="orgname">Universiade Federal de Alagoas</institution>
      </aff>
      <aff id="affiliation-cebc39cbb6f1834bd64f5407ca61b830">
        <institution content-type="orgname">Universidade Federal de Sergipe</institution>
      </aff>
      <pub-date date-type="pub" iso-8601-date="08/27/2021" />
      <volume>2</volume>
      <issue>3</issue>
      <issue-title>Linguistweets</issue-title>
      <elocation-id>e399</elocation-id>
      <history>
        <date date-type="accepted" iso-8601-date="08/17/2021" />
        <date date-type="received" iso-8601-date="07/30/2021" />
      </history>
      <permissions id="permission">
        <license>
          <ali:license_ref>http://creativecommons.org/licenses/by/4.0/</ali:license_ref>
        </license>
      </permissions>
      <abstract>
        <p id="_paragraph-1">The Cambridge grammar of the English language (HUDDLESTON; PULLUM, 2002) attempts to present a comprehensive and rigorous description of Modern Standard English. Much of the book is taken up with describing the properties of the various lexical categories, including determinative and pronoun. The distinction between these categories has been questioned by various authors in English (ABNEY, 1987; CROFT, 2001; HUDSON, 2004; MATTHEWS, 2014; POSTAL, 2014/1966; SOMMERSTEIN, 1972) and other languages (e.g., NAU, 2016). Here, I employ energy distance, a novel family of non-parametric statistics, to adjudicate between these positions. Following Crystal (1967), I binarily encode the features (has/doesn’t have feature) of the determinatives and pronouns from CGEL<italic id="italic-1"> </italic>in a 138 word-forms by 232 features matrix. The results provide support for CGEL’s analysis (<italic id="italic-2">k-</italic>groups produces a 93% correspondence with CGEL’s categorization) and show that energy distance statistics applied to such matrices can help us adjudicate between competing lexical category analyses without resorting to methodological opportunism (CROFT, 2001).</p>
      </abstract>
      <abstract abstract-type="executive-summary">
        <title>Resumo</title>
        <p id="paragraph-02f994ad19f88df58a61dcaca08d9c4a">The Cambridge grammar of the English language (HUDDLESTON; PULLUM, 2002) tenta apresentar uma descrição abrangente e rigorosa do inglês padrão moderno. Muito do livro é dedicado à descrição das propriedades das várias categorias lexicais, incluindo determinativos e pronomes. A distinção entre essas categorias foi questionada por vários autores em inglês (ABNEY, 1987; CROFT, 2001; HUDSON, 2004; MATTHEWS, 2014; POSTAL, 2014/1966; SOMMERSTEIN, 1972) e outras línguas (por exemplo, NAU, 2016). Aqui, eu emprego energy distance, uma nova família de estatísticas não paramétricas, para julgar entre essas posições. Seguindo Crystal (1967), codifico binariamente as características (tem / não tem característica) dos determinativos e pronomes do CGEL em uma forma de 138 palavras por 232 características matriciais. Os resultados fornecem um suporte para a análise do CGEL (<italic id="italic-406fbcf1babcc1367fc2ee0123422e8d">k</italic>-groups produzem uma correspondência de 93% com a categorização do CGEL) e mostram que as estatísticas de energy distance aplicadas a tais matrizes podem nos ajudar a decidir entre análises de categorias lexicais concorrentes sem recorrer ao oportunismo metodológico (CROFT, 2001).</p>
      </abstract>
      <kwd-group>
        <kwd content-type="">Lexical categories</kwd>
        <kwd content-type="">Parts of speech</kwd>
        <kwd content-type="">Determinatives</kwd>
        <kwd content-type="">Pronouns</kwd>
        <kwd content-type="">Energy distance</kwd>
        <kwd content-type="">Methodological opportunism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body id="body">
    <sec id="heading-ac5b9ebaf5848a2aebfab7263fa7a046">
      <title>Introduction</title>
      <p id="paragraph-aae7e218a9644db3e9011f5b5b383ee6">Lexical categories are a matter of perennial dispute in linguistics. This is true as much for descriptive categories as for comparative concepts (HASPELMATH, 2018) and applies regardless of how carefully studied the language is. In 1967, Crystal concluded that “word classes in English are more complex things than is still generally supposed; and … before we can produce a set of satisfactory definitions, we need to examine the distribution of single words much more thoroughly” (CRYSTAL, 1967, p. 55). Many would say that despite the intervention of over half a century, lexical categories are still more complex things than is generally supposed and that we still lack a set of satisfactory definitions, even for the lexical categories of English, one of the most comprehensively studied languages.</p>
      <p id="paragraph-ee4aed4134afe4edcdadb9a5a43b1108">Of the lexical categories, perhaps determinatives and pronouns throw up the most disagreement. “The connection between pronouns and determinatives is striking in many languages. If they are not to be grouped into one class, their relationship must be clarified”<xref id="xref-710ce4786d61b6e2eb3698499e360e9f" ref-type="fn" rid="footnote-f0ecd757fd7fc925aa874771fae4684f">1</xref> (NAU, 2016, p. 96). And this is true even English.</p>
      <p id="paragraph-c24a4e9565e179ad9345a3c6bf7582d3" />
      <p id="paragraph-7d66de879a4206bdf779a5a6b4469099">Although the term ‘determiner’ [or the term I use here, DETERMINATIVE] has been in common use for nearly half a century, there is no consensus in detail, especially among recent authorities, either as to what words should be assigned to such a part of speech or, if categories and functions are distinguished, what the units are that have a ‘determining’ function. This is true both intensionally, of the features by which determiners have been defined, and extensionally, of the range of individual words included. (MATTHEWS, 2014, p. 69)</p>
      <p id="paragraph-f260048d636d5798d33abb3daa0d5050" />
      <p id="paragraph-4780b7a9130d43aa097140a442f3d090">Here the greatest confusion is between determinatives and pronouns: is <italic id="italic-04f5b7da9b94871c3941f5feb8c42a6e">my </italic>a determinative because it appears before a noun, helping us to pick out its referent, or is it a pronoun because of the morphological similarities it shares with other pronouns? Is <italic id="italic-590436c615e7dd328439b8be4efd1ca4">many </italic>always a determinative or is it a pronoun when it stands alone as a subject (e.g., <italic id="italic-810b43c4fea3b53e2b75085bb5a02b75"><underline id="underline-1">many</underline> were swayed</italic>)? The Cambridge grammar of the English language (HUDDLESTON; PULLUM, 2002, hereafter CGEL), attempts a detailed and systematic answer to these questions. It describes the determiner function, and the categories of determinative and pronoun in exquisite detail over almost 100 pages in chapter 5, starting on page 354. Yet doubt and disagreement remain.</p>
      <p id="paragraph-8a726381461220ef4d43a4cac3bd94f9">Working in early transformational theories, POSTAL (2014/1966) argues “the so-called pronouns <italic id="italic-4">I</italic>, <italic id="italic-5">our</italic>, <italic id="italic-6">they</italic>, etc. are really articles, in fact types of definite article. However, article elements are only introduced as segments in intermediate syntactic structures” (p. 15). SOMMERSTEIN (1972) makes the converse claim, but both deny a fundamental difference between determinatives and pronouns. ABNEY (1987) takes up Postal’s arguments and develops them in his so-called “DP analysis”. HUDSON (2004), in a more modern theory, word grammar (HUDSON, 2007), dispenses with the deep-structure arguments and straightforwardly argues that determinatives “are a subset of pronouns ranging over many of the traditional pronoun types — demonstrative, possessive, interrogative and so on” (p. 10). CROFT (2001) goes a step further and actually calls the whole enterprise into question, accusing researchers attempting to establish lexical categories of “methodological opportunism,” and claiming, “there is no a priori way to decide which of several constructions with mismatching distributions, or which subset of constructions, should be chosen as criteria for identifying the category in question” (p. 41).</p>
      <p id="paragraph-116c51351213bb9d4d1d7cb63e512dbd">CROFT’s observation highlights a difficulty with the way the enterprise of lexical categorization has been conducted, which is through logical argumentation. Such an approach could be definitive if necessary and sufficient conditions were identified, but this has not been possible; exceptions abound, and consensus is far from forthcoming. So, as CRYSTAL says, “the only realistic solution seems to be statistical” (1967, p. 45). This would involve not picking and choosing from among the possible criteria, but rather setting out all the likely<xref id="xref-5a9dcf1798865ca3e9fcfb23ef3faaaa" ref-type="fn" rid="footnote-6af6b26cb08147f20056f24b2e794846">2</xref> criteria and then analysing them.</p>
      <p id="paragraph-46ca7bec7f2e230c0b0fa9b348a89cf4">The first step in such an approach was described by CRYSTAL (1967). An example of what he had in mind can be seen in Table 1, reproduced in part below, where he set out a matrix with rows for “adverb” lexemes and columns for the following criteria:</p>
      <p id="paragraph-67503cd67ab5683e36e388011fd284f5" />
      <p id="paragraph-298c11eef22f15b6a753df68b54ba1a7">(l) ability to occur immediately before or after verb, viz. Subject (Adverb) Verb (Adverb)</p>
      <p id="paragraph-c4c716dc815d96a24938deb24d874f41">(2) ability to take intensifier without preceding determiner </p>
      <p id="paragraph-806847ef34ac1f40dddd37647d77c18e">(3) ability to occur initially (mobility criterion) in sentence.</p>
      <p id="paragraph-1cfa71ebd5c56df820594b1ad2ac8b8b" />
      <table-wrap id="table-figure-c16b5c7930d4e39aa10f323a495dfc2e">
        <label>Table 1</label>
        <caption>
          <title><bold id="bold-80dbfbc2c9681b264f4fc5b82f2a9336">Table 1</bold><bold id="bold-c76e323f723762be0f42e274b685bf76">.</bold> A matrix of words and features with binary categorical coding (reproduced from CRYSTAL, 1967, p. 52)</title>
          <p id="paragraph-4b71caf4f164eb6cc6265be1a27ecf34" />
        </caption>
        <table id="table-f97ea53512261e19dea4e816427e8283">
          <tbody>
            <tr id="table-row-734b87722117ff621d2670f03d4bf38a">
              <th id="table-cell-e9ff31cab5e63beafcc50faaa1c89767" />
              <th id="table-cell-dbd8f092be11f649b93c1e0fc8261504">1</th>
              <th id="table-cell-125aeb4d994dad176d9fa20d817c9ee9">2</th>
              <th id="table-cell-106365943eccd95369f31b420b3e5a03">3</th>
            </tr>
            <tr id="table-row-56823ddc9704dc06708ac30e6f5f9db6">
              <td id="table-cell-494eb2c375418cc3746288e1cce879e1">
                <italic id="italic-bf139cdf7d4ac452c41c4bc4a05e178b">asleep</italic>
              </td>
              <td id="table-cell-df288ef85e5ecfe4737b7463b7b9eea8">+</td>
              <td id="table-cell-7456493f12ce3a97c6449f1c01f110a6">+</td>
              <td id="table-cell-85b2027057dc7958ff33bf858a317714">+</td>
            </tr>
            <tr id="table-row-a6ca0377a9f4e8d5d7736f429beace59">
              <td id="table-cell-8dfe2019165a0040534b9442a647cb14">
                <italic id="italic-8f751246aa0b1cbc59d82cc30a958280">inside, downstairs</italic>
              </td>
              <td id="table-cell-a34533e912422dccb1c4351b00080e00">-</td>
              <td id="table-cell-9fe94d09a8ac2871f38a4f874eca7c7e">-</td>
              <td id="table-cell-da467e8df81cf63d3fee71862088f1c2">+</td>
            </tr>
            <tr id="table-row-ef7c561e40d3f31047efd2e71d1d923a">
              <td id="table-cell-3c9687541b5b6ae565412420ce0f6a0e">
                <italic id="italic-c129997e0b04782e98f4df65095807e3">alike</italic>
              </td>
              <td id="table-cell-3fe99345a4d21667aa85826857ae1e19">+</td>
              <td id="table-cell-6be2dd983a23e7031c56df5045bdf77c">+</td>
              <td id="table-cell-75138f8ddb941283d02911e1bbf6a7ea">?-</td>
            </tr>
          </tbody>
        </table>
      </table-wrap>
      <p id="paragraph-6c81512cf6390616e4f60c231d43dcb7">The criteria here are syntactic, but there is no a priori reason to rule out phonological, grammatical, lexical, or semantic criteria (CRYSTAL, 1967).</p>
      <p id="paragraph-6d2acb36dc5388372d34c576d4836c71">What CRYSTAL had in mind, though, is something like a simple tally. “One would always expect a coherent word class to have at least one criterion with 100% applicability, to justify one's intuition of coherence” (p. 45), and then other criteria could be ranked by the number of words to which they apply. I reject that approach as unworkable. There are almost never criteria with 100% applicability when it comes to the lexical categories that linguists, lexicographers, language teachers, and language learners find most useful. But lexical categories are useful to the extent that they help us predict features of their members, even if they are not perfectly reliable. More importantly, the more features that a given lexical category suggests, the more useful that categorization is.</p>
      <p id="paragraph-50d3f661cd814d831faa296a6f6d9ffb">Statistics can help us to evaluate our categorizations by comparing the probability of a set of features given a certain categorization to their probability under a random distribution. This is usually expressed as a <italic id="italic-f3d75d7646dbdb3cf5a1f64deed99a18">p</italic>-value, ranging from 1 (indistinguishable from random) to 0 (impossible under a random distribution). We can then argue that, allowing for some error, a categorization that leads to a lower <italic id="italic-0f3be17fbb5791b7ebb7544adb1293ee">p</italic>-value over a given set of features is better than one that leads to a higher value.</p>
      <p id="paragraph-6048e02d7faddc0eb26ec7e0db69393f">Another approach is to use an unsupervised learning algorithm to analyze the data and create clusters of words. These clusters could then be compared against a hypothesis. Clustering does not produce a <italic id="italic-64bcbe996628dfa782d1c7b46c235371">p</italic>-value, but it can produce plots that can be examined to see if the groupings are like the categories and sub-categories set out in a grammar such as CGEL.</p>
    </sec>
    <sec id="heading-0b2695408c3206933bb7945b07e423da">
      <title>1. Energy distance statistics</title>
      <p id="paragraph-778c289b7fb886e488d4e553988a0fe4">Energy distance is a recently developed family of non-parametric statistics (RIZZO; SZÉKELY, 2016). There is also a related package for R (R CORE TEAM, 2019) called energy (RIZZO; SZÉKELY, 2019), which allows researchers to calculate these statistics relatively easily.</p>
      <p id="paragraph-594772d0969ec87edbd282478f7331bb">Statistics can be classified as parametric or non-parametric. In parametric statistics, a model is selected to reflect some assumptions about your data. An example might be a linear model. Once a model has been selected, the degree of fit between the model and the data is calculated. This is done because “it is generally much easier to estimate a set of parameters … than it is to fit an entirely arbitrary function” (JAMES, 2017, p. 22). Parametric statistics require relatively few observations compared to non-parametric statistics. Because of these advantages, many parametric options are available, and parametric statistics are often preferred by researchers.</p>
      <p id="paragraph-d6fecd009a12086fcf2370e82e7c7374">However, in many cases the assumptions cannot be met. Models commonly assume, for example, that the data is normally distributed or that there is constant variance. In the case of matrices like that in Table 1, the data is in the form of binary categorical variables (e.g., has genitive case ), instead of discrete or continuous variables, so the error is neither normally distributed, nor is there constant variance. A case like this requires non-parametric statistics, which “seek an estimate of <italic id="italic-12091bcb3ebcbfa306d2a34e30199f14">f</italic> that gets as close to the data points as possible without being too rough or wiggly” (JAMES, 2017, p. 23). Unfortunately, fewer non-parametric options exist, and they are often less familiar to researchers, but energy distance is now available.</p>
      <p id="paragraph-449cb663ffb1dd76723d7628b28067d6">The analysis I would like to conduct here includes multiple features (e.g., has genitive case, starts with /ð/, marks an NP as definite, etc.), which is to say that it is a multivariate analysis. Typically, MANOVA would be applicable for this kind of situation, but MANOVA won’t work because it is a parametric model requiring “that random error is normally distributed with mean zero and constant variance” (RIZZO; SZÉKELY, 2010, p. 1034). This is precisely the kind of situation where energy statistics will work. Also, the common problem of too few observations underpowering non-parametric statistics is not an issue because the list of words can include the entire universe of CGEL determinatives and prepositions.</p>
      <p id="paragraph-4534f75bf7ae2f4316d7975efdce1632">As far as I can discern, energy statistics have not been previously employed in linguistics research, perhaps because they have not been available for long. This study will then be an evaluation of these statistics for lexical categorization as well as an evaluation of the categories set out in CGEL. This weighing of two approaches to categorization is akin to the epistemology behind statistics like inter-rater reliability estimates. In both cases, we cannot be sure of the validity of any individual rater’s judgement, but if multiple raters converge on similar judgements, then this has been seen as the basis for taking their judgements to be reliable, reliability being one aspect of construct validity (MESSICK, 1995). Similarly, if CGEL categories are well supported by energy statistics, this provides evidence to support the validity of each.</p>
      <sec id="heading-1f9f27a24bebc38d29ae4a7d55faad75">
        <title>1.1. <italic id="italic-1adabd92201b6d6a91d856188d7271b0">K</italic>-groups clustering</title>
        <p id="paragraph-b4a8f0d0c36856cb44f023eed46575f0">The <italic id="italic-84656cc48293e75c6edfc9b8b857b526">k-</italic>groups unsupervised learning algorithm developed by LI (2015) is part of the energy distance family. It is a generalization of the more familiar parametric <italic id="italic-cea878a19befb0d736b546eac72c0dbd">k-</italic>means algorithm. <italic id="italic-54696bc8862471ab0ae89086d440ee9f">K-</italic>means measures the statistical distance between pairs of clusters and searches for “the best partition which maximizes the total between-clusters energy distance” (LI, 2015). Typically, <italic id="italic-166d8a3bf52c8a39b331eb726e13bcc8">k</italic>-means would be applicable for discovering clusters like lexical categories, but, like MANOVA, <italic id="italic-9e9829220f9ba5097fab56b770dac85d">k-</italic>means will not work with the data used here for the reasons explained above (LI, 2015, p. 49). The <italic id="italic-e4931a093152fc0a7e75b2f2cfb2d60a">k</italic>-groups method overcomes these limitations.</p>
        <p id="paragraph-9128c25ed3f59955ba1c00458b885d31">Once the words have been assigned to clusters by <italic id="italic-7">k-</italic>groups, it is possible to compare those clusters to CGEL’s categories to see how much overlap exists.</p>
      </sec>
      <sec id="heading-96d2b3338da9c2760b6b556ec5bb2a1a">
        <title>1.2. Visualization and dendrograms</title>
        <p id="paragraph-b915d26b28a84580120929d9f8387b6f">The energy algorithm for clustering is formally similar to Ward’s method (RIZZO; SZÉKELY, 2016). Where the <italic id="italic-6a6fff0c9484441d7e22145206fb92f8">k</italic>-groups algorithm is designed to maximize the distance between the two groups, the algorithm for clustering starts with observations of single pairs of words, and, at each step, merges clusters that have minimum cluster distance (RIZZO; SZÉKELY, 2016, p. 30). This results in hierarchically structured sets of pair-wise clusters. The resulting structure can be plotted as a dendrogram, which provides a way to see detailed similarities between words. Because of the way this algorithm proceeds, the largest two groups may not be maximally distant in the way that they would be under <italic id="italic-b6c5784d1c8ac15e51d04713ef8003c9">k</italic>-groups. The purpose, therefore, is to observe the detailed structure to see which words and clusters of words are closest together.</p>
      </sec>
      <sec id="heading-0a099a41fea0203a52ecb7e042e7b488">
        <title>1.3. Distance components (Disco)</title>
        <p id="paragraph-9cb233cce4de3d8ba3702573d9173135">Distance components (DISCO; RIZZO; SZÉKELY, 2010) is a member of the energy family of non-parametric statistics. Unlike the first two procedures, it doesn’t discover the structure in the data but rather takes whatever two groups it is given and calculates both the between-groups distance and the within-group distances. These two measures can then be used to calculate an “F” ratio, which, according to EVERITT (2006), is calculated as</p>
        <fig id="figure-panel-df1f4bb7dd5fe11ce515e1925b7eb474">
          <label>Figure 1</label>
          <caption>
            <p id="paragraph-7fd4ba3704aaab252a6eb6b83baaa4ca" />
          </caption>
          <graphic id="graphic-26cb09ea94308783b4ad4bfc9ac8e615" mimetype="image" mime-subtype="png" xlink:href="Captura de tela 2021-08-26 233613.png" />
        </fig>
        <p id="paragraph-4404781d9affeee1b77eebf2902a00a8">A larger <italic id="italic-a70dbf292b65b3def5aba13cb3f6b8df">F </italic>indicates that the two categories are more distinct, relative to their internal differences. Unfortunately, there is no commonly known average or minimum <italic id="italic-4954119c36047add973408699099ed0a">F </italic>value when it comes to lexical categories in English or in any language. As a result, an <italic id="italic-4dd917a5ab6fa3e1a15ad30bcb384d7d">F</italic>-ratio calculated on the CGEL pronouns and determinatives is currently uninterpretable on its own. Nevertheless, a <italic id="italic-8c1eb13121cb4382f7d33fe40d3f7b6d">p</italic>-value can be derived from <italic id="italic-3337d8af805b3d1ef49335fccfece1e2">F</italic>. Conventionally, any <italic id="italic-61be982afd829dbf4d8a4efa976c86e4">p-</italic>value smaller than 0.05 is considered to show that the categorization is significantly different from random.</p>
        <p id="paragraph-e6eed4ab9a45c9f1f74604c0fa05b63f">Even a statistically significant <italic id="italic-1ac7eae5aefc545d82b2d1a996d62fb9">p</italic>-value, though, will not be enough to confirm that two groups of words should be considered distinct lexical categories. It may be the case that even subcategories may produce significantly large <italic id="italic-06e4728d26faa529d81d1333e655b9f8">F</italic>-values.<xref id="xref-6705cfd78d2186f7df4ff3f35f5ceb67" ref-type="fn" rid="footnote-b09f491d898e64cd4c6625a7745b38a3">3</xref></p>
      </sec>
    </sec>
    <sec id="heading-39f8ea85fdaabf5e9687efe92cd845dc">
      <title>2. Method</title>
      <sec id="heading-3c0f714c0b7d036df23074be905e55e1">
        <title>2.1. Research questions</title>
        <p id="heading-3db8d290ac5a5e5d0ec719dd67be8dfe">I have five research questions:</p>
        <p id="paragraph-6bb8c2c0a2125a5152acfc4710a366b8" />
        <p id="paragraph-4a709188e83b29e9fa44d974f366c451">(1) How well do CGEL’s pronouns and determinative categories align with the clusters derived from <italic id="italic-eaaa5c851d33bdee70e45eb0791c3f1f">k</italic>-groups?</p>
        <p id="paragraph-1a4e13ad5f4d73a65d43122eb46b88d7">(2) How well does the hierarchical structure of the dendrogram match our intuitive notions of the similarities within the pronouns and determinative categories?</p>
        <p id="paragraph-3c53acaab846230da1cdbdfaadddd1fc">(3) What is the <italic id="italic-0eef1400022b7cd7b0bf57e7b9d80bd5">F</italic>-ratio of CGEL’s pronouns and determinatives resulting from the DISCO test?</p>
        <p id="paragraph-6e27820c4080bfac13d49d0b6cd7b81e">(4) Is the <italic id="italic-71abf12cd541cfa7a9f425a51ce5f5b6">F</italic>-ratio statistically significant?</p>
        <p id="paragraph-4c46781877b7317a6e86d17798bb8187">(5) If the <italic id="italic-b906b31bbf96eb43278f31738a97605a">F</italic>-ratio from (4) is statistically significant, are <italic id="italic-9f51dac02de500253be65247686b5d97">F</italic>-ratios calculated on subsets of CGEL’s pronouns and determinative categories also statistically significant?</p>
      </sec>
      <sec id="heading-6dd7a91e5215cb6657a3b3aeba01c2bb">
        <title>2.2. Preparing the data matrix</title>
        <p id="paragraph-c4d227b1f740481e92ec88377fe732fa">To begin answering these questions, I constructed a matrix like the one in Table 1 (downloadable as REYNOLDS, 2021). It ended up with 138 rows for word forms (73 determinatives and 65 pronouns) and 232 columns for features, and the aim was to be as inclusive as possible. The list of word forms is based on the CGEL<italic id="italic-aa0944f9411045242d1f88978dc1fcc9"> </italic>descriptions of determinatives and pronouns, and it includes all words specifically mentioned as pronouns or determinative in CGEL. I made the choice to use word forms, rather than lexemes because, as NAU (2016) says, “it is not lexemes that are used in syntactic functions, but word forms”<xref id="xref-350528ae65b37bfde9550d3b500a0bca" ref-type="fn" rid="footnote-c12b848b06f0b18e93e4fc88421b78d8">4</xref> (p. 31).</p>
        <p id="paragraph-bf59e180a09b6e8ed22875f1c8e2111b">Next, I built the list of features. These were grouped as morphological (139), phonological (3), semantic (36), and syntactic (54) features. In the morphological group, there was a column for each word appearing as part of another word (e.g., <italic id="italic-64091856f0611486c941d0a74fadb127">any </italic>appears in <italic id="italic-71c1e80fd2f082639c919ab27ddba02b">any</italic>, <italic id="italic-9c3a41de9807a2b69fe752bb81e03ff6">anybody</italic>, <italic id="italic-6338fa5b93dea00ed64af60937af45b0">anyone</italic>, <italic id="italic-232a092556bdd1a8e0e0b97a7816fe59">anything</italic>, and <italic id="italic-643090563bff5f94958d59eb02a5b2cc">anywhere</italic>). Almost all the other features were taken from CGEL, but other sources were included where a particular concept seemed relevant. Sometimes these came from the literature (e.g., must be outranked by a coindexed element, SAG; WASOW; BENDER, 2003, p. 292) and sometimes they were just features that struck me as possibly relevant (e.g., starts with /ð/).</p>
        <p id="paragraph-4ab18c985487394564f892a04e2abbe2">Finally, I coded each cell in the table as “may exhibit the feature” or “never does”. In many cases, I relied on my own judgement, often informed by corpus queries. With over 30,000 cells to deal with, it was impractical to do anything else. I have made my data available, and any researcher who finds errors or disagreements is welcome to publish their revisions.</p>
      </sec>
      <sec id="heading-d9820f9dcdb48bd9874ef0a0a17ba659">
        <title>2.3. Data analysis and hypotheses</title>
        <p id="paragraph-e764ddf7ee2528fc1369d9e1264407a1">The first step in the analysis was to cluster the data with <italic id="italic-31625f16c505be74ef2f9bb1a8b7af8e">k</italic>-groups and to compare the resulting clusters with the CGEL categories. I expected to find a high degree of overlap. Second, I created a dendrogram and visually inspected it to understand the hierarchical structure of the data. I expected to find intuitive structure. Third, I ran the DISCO analysis on the full set of data grouped into CGEL determinatives and pronouns. I expected to find a significant difference between the two categories. Fourth, I ran the DISCO analysis on the determinatives grouped into the first 36 determinatives listed alphabetically and the remaining 37. I expected to find a smaller, non-significant <italic id="italic-e4c0665854cb7af7c0b9c246cde1290f">F</italic>-ratio. I used the Energy package (RIZZO; SZÉKELY, 2019) in R (R CORE TEAM, 2019) to perform for all analyses.<xref id="xref-07c786b6029cc1e291cc9d0c54c00e6a" ref-type="fn" rid="footnote-41a39e38e7832a3c232bc19ad831382f">5</xref></p>
      </sec>
    </sec>
    <sec id="heading-f418024ee41cfa2fce5e291021caab6b">
      <title>3. Results</title>
    </sec>
    <sec id="heading-19adcad6323524eca75bc124686b64ea">
      <title>3.1. <italic id="italic-c34e65cae150b4ac8bc758ec34226e4b">K</italic><italic id="italic-1fc391f9f150ff98055c5c61b2feacd1">-</italic>groups clustering</title>
      <p id="paragraph-d89ed9e8648f614ee19318b2e551bd5d">The clusters resulting from the <italic id="italic-90014487fbd19731a0eb7c4938ac8274">k</italic>-groups were very similar to the CGEL categories. Only three CGEL determinatives were assigned to the <italic id="italic-2ab5d9012f7d4db1557eb1455f3364fc">k-</italic>groups pronoun cluster, the personal determinatives <italic id="italic-4928bf243aacdc0ff6935d566d250e67">you</italic>, <italic id="italic-fa7a60bd7da974a04629d9ea107429d4">we</italic>, and <italic id="italic-1fa53181d68275c4c4b5c34f62df9020">us</italic> (as in <italic id="italic-f9770bd0a890ffdb8dd8125ebf25c8e1"><underline id="underline-d1e1443b44e2e30593be6d7618f15dd7">you</underline> kids can come too</italic>). And only six CGEL pronouns were assigned to the <italic id="italic-a770d0a04928cfac09199a133c25fc25">k-</italic>groups determinatives cluster. These include one reciprocal pronoun <italic id="italic-8">one another </italic>(though not <italic id="italic-9">each other</italic>), the two dummy pronouns <italic id="italic-10">it </italic>and <italic id="italic-11">there</italic>, and the interrogative and relative pronouns <italic id="italic-12">what</italic> and <italic id="italic-13">whatever</italic>, along with relative <italic id="italic-14">which</italic>. Overall, then, there was agreement on 129 out of 138 words (93.48%). A random assignment would centre on 50%, so, to correct for this, 50% can be subtracted and then the result doubled, which comes to a 86.96% “adjusted agreement” value. So, the answer to research question 1 is “very well”, which is in line with my expectations.</p>
      <sec id="heading-1addbba982d95f2d2fd4b8713e17c746">
        <title>3.2. Visualization with dendrograms</title>
        <p id="paragraph-a4ca6e1cdd80c34067b1b82ef77fa4ad">The dendrogram is reproduced as Figure 1. The CGEL pronouns are enclosed in the upper red rectangle, and the CGEL determinatives are enclosed in the lower blue triangle. The two dummy pronouns <italic id="italic-66d0b5159425083c6aa2337550a4f07c">it </italic>and <italic id="italic-f26fdcf1209d2bbae7cc40da96cc4fc3">there </italic>are enclosed in a smaller red triangle inside the blue triangle.<xref id="xref-e8c4e967317eef00bbfca0b7a8f7f9bb" ref-type="fn" rid="footnote-098f13844e062b070e60576f74dc8d3b">6</xref></p>
        <fig id="figure-panel-0a1334f17960213a2acf191d25a2b820">
          <label>Figure 2</label>
          <caption>
            <title><bold id="bold-79bfe1bcbafa9baa0f505336f5c1daa2">Figure 1. </bold>Dendrogram showing structure of the pronoun and determinative categories.<bold id="bold-d32aaeeee8ff737bc6412cc66e15e5d7"/></title>
            <p id="paragraph-2603086e58c4b8fa12ae44e87d255f9d" />
          </caption>
          <graphic id="graphic-423e933e127bd933a8bad70107a4a76b" mimetype="image" mime-subtype="png" xlink:href="f1.png" />
        </fig>
        <p id="paragraph-83dd28cb10f30fab4304036d79210e93">There is a good deal that is very intuitive about the hierarchical structure here. To demonstrate, I will describe the pronoun structure using CGEL terminology. All the genitive pronouns live together on one branch at the top of the dendrogram. This splits into a branch with seven dependent genitive pronouns (e.g., <italic id="italic-6f7e12a7ee62926fcaae3fbffe349538"><underline id="underline-1019dbbe1250c9354441360e8edcc148">my</underline> time</italic>) and a branch with eight independent genitive pronouns (e.g., <italic id="italic-3ac4367f4bb0071a85f18b7006115c5d">mine</italic>), the odd one out being <italic id="italic-6edfc578a4c8f38c8189044975ac1e3e">his</italic>, which is both dependent and independent. The next group is the reflexive pronouns (e.g., <italic id="italic-de1248632da3595cacd53a5f47d195cf">myself</italic>). Somewhat unintuitively, these share a branch with the temporal pronouns <italic id="italic-6a3bbb5535a9a35d77ef4ef3fd34b201">today</italic>, <italic id="italic-bf56d4deef7ec9468685aa83ff78a5aa">tomorrow</italic>, <italic id="italic-8e909f397f6959a86ba6b0d0703f7e9b">tonight</italic>, and <italic id="italic-d05eb0c8f2a3062e6127a8af9fd69239">yesterday</italic>. These are followed by a branch that splits into the accusative pronouns and the nominative pronouns. This completes the top branch of the dendrogram, but 22 CGEL pronouns are clustered on branches with determinatives. The majority of these are interrogative and/or relative pronouns, which all share a branch along with the pronouns <italic id="italic-9913b84ef29e6b382b7919ec9089f6d4">oneself</italic>, <italic id="italic-00758b77abbf28c73eb6605746b1a1b1">one</italic>, <italic id="italic-96f103f83ab0ce2f41cb86f438a4fd8b">one’s</italic>, and <italic id="italic-8cdff08c2e396951b4ee44a46d9d0dfb">its</italic>, along with the reciprocals <italic id="italic-7e948c4f89248a34fd02e7942000485a">one another </italic>and <italic id="italic-540cfa96a14735d84e3ea5141d2a6c21">each other</italic>. Inside the determinatives group, we find similar levels of structure with groupings of interrogatives, demonstratives, quantifiers, and others. So, the answer to research question 2 is also “very well”, which is in line with expectations.</p>
      </sec>
      <sec id="heading-b7455ebd152a5d006601b5c85b106840">
        <title>3.3. Disco</title>
        <p id="paragraph-bb69ee36533950669dee8664be282cd9">The results of the DISCO test conducted with the full set of words and features are presented in Table 2.</p>
        <table-wrap id="table-figure-01befb3724458190bcbf1cae5f87f97e">
          <label>Table 2</label>
          <caption>
            <title><bold id="bold-77c6ec8330e69dfa503877b189938cb2">Table 2</bold><bold id="bold-7f1503cd82282eccdaa610bae448791f">.</bold> DISCO test output</title>
            <p id="paragraph-da8170d0a0aaad1e43a2ddb428bbc86e" />
          </caption>
          <table id="table-cf0c7e1e4a42cce54cf0344c12802a88">
            <tbody>
              <tr id="table-row-d8ddfd901f46db5b6ed8e759d0628be2">
                <td id="table-cell-fa2c9d9c1abb14aa5d281873aa4eadd8">Source</td>
                <td id="table-cell-7408f8e070b6480c3331eb908a8f066f">Df</td>
                <td id="table-cell-69a90614c7708b07328d5e5f13d7f01e">Sum Dist</td>
                <td id="table-cell-68580fc1607583389701f3d242ac9026">Mean Dist</td>
                <td id="table-cell-b76d4dc3184aaf106d74048ae6e7e5c8"><italic id="italic-5ea324cd369e2e671ac2e7f8e71aa5ab">F</italic>-ratio</td>
                <td id="table-cell-e9b082f5304be66a65d0e0c200a39a7b"><italic id="italic-e45a17a79f157f0da17d60a872685a26">p</italic>-value</td>
              </tr>
              <tr id="table-row-6faba84aaa0d09133ae76030387ecb9a">
                <td id="table-cell-a536f9fada4d03bfede3ac99889a99b0">factors</td>
                <td id="table-cell-8db3c952bd197cc66edf9570f032c504">1</td>
                <td id="table-cell-50325a84c44521bdb683f5a1ebdae477">30.58286</td>
                <td id="table-cell-282de059fe27db7db27ee6f06f185742">30.58286</td>
                <td id="table-cell-b5239da2edfac1da8cf96ea5781be098">11.670</td>
                <td id="table-cell-3be7304d7ffe4d727fe046467565d28d">0.001</td>
              </tr>
              <tr id="table-row-65d7062ec9864155d01e7610f0f44812">
                <td id="table-cell-e180a75c44439e7097740bdc091743b0">Within</td>
                <td id="table-cell-a7f87ff0f4b03464a883aee5dd4cb4cc">136</td>
                <td id="table-cell-f253e0bf0ff2de160a16d277d63f74a5">356.41607</td>
                <td id="table-cell-dc0e8cf31e2f6c325c901a148fe8b8f3">2.62071</td>
                <td id="table-cell-f749851b7898c33bd8d2e1a481af19b8" />
                <td id="table-cell-c587dbfcf7c478a7ad2f51267d7835ac" />
              </tr>
              <tr id="table-row-78480bb91a8ffbe46a031bf2ab9de6b6">
                <td id="table-cell-0fc2e5ed9be9ab756f379454c19d0c32">Total</td>
                <td id="table-cell-1b71e5de29040cfb0c6c7f4002c68014">137</td>
                <td id="table-cell-43509ce6d0f57be613c7cb6ace797045">386.99893</td>
                <td id="table-cell-0f416fe6eccfdf42ad2ec4ebb2adc1ad" />
                <td id="table-cell-904b40ec2076d58ff400f6fdf802c39c" />
                <td id="table-cell-4e2ebf6018632db7c5ae08fd26431aa5" />
              </tr>
            </tbody>
          </table>
        </table-wrap>
        <p id="paragraph-86aa790428301f30fe637b6db72e2fe7">The mean distance between the determinatives and the pronouns is 30.58,<xref id="xref-94fe24845df740b3ad0f103059e444d1" ref-type="fn" rid="footnote-fbcd64e496fea59562cd9ffd0fab4896">7</xref> while the mean distance within each group is only 2.62. The <italic id="italic-2ccf8b5e0a46d7ce19f40cbfeff69318">F</italic>-ratio is simply the ratio of these two distances, as described in §2.3, so the mean distance between categories is 11.67 times more than the mean distance within the categories. This <italic id="italic-1912b2d1fffffaeffe8f0a889baf1d77">F</italic>-ratio is significant at , strongly suggesting that CGEL’s determinatives and pronouns are indeed significantly different groups of words. This is in line with expectations.</p>
        <p id="paragraph-af6695989a5594859cee4c2d94233aa7">When I conducted the DISCO test on the two groups of pronouns, the <italic id="italic-6da89997bd6ec66b8d6e16d6454c8d7e">F</italic>-ratio was much smaller (2.29), as expected, but it was still significant ( ), which was unexpected.</p>
      </sec>
    </sec>
    <sec id="heading-34b4f1939b90842ed6ab4d16cfb55d37">
      <title>4. Conclusions</title>
      <p id="paragraph-db2a4f0d2163b3001ff3e357849d04a4">I used the new energy distance family of non-parametric statistics to evaluate competing claims about the category status of pronouns and determinatives. There is a 93.48% overlap between CGEL’s categories and the <italic id="italic-454874b0728f57ec88a8a22fe4c92116">k</italic>-means clusters (86.96% “adjusted agreement”). As expected, the results of the <italic id="italic-ec4ebdfc9b7e9af51329755239f97801">k</italic>-groups analysis strongly support not only the position that pronouns and determinatives are distinct groups, but that the particular words CGEL categorizes as pronouns and determinatives closely match the discovered clusters. This result suggests both that <italic id="italic-826e5af799dc7ac1641ba2dd0ab74063">k</italic>-groups can be useful for discovering lexical categories and that CGEL’s categories are strongly motivated by the features of the word forms.</p>
      <p id="paragraph-6acebae1ad3e1a9b1bdddd11cde619ed">The structure of the dendrogram provides further support for CGEL’s categories. As expected, much of the structure reflects subcategories described in CGEL, though there are some unexpected elements too, such as the location of the temporal pronouns.</p>
      <p id="paragraph-fe3d86cda3c2c8f6e9211eda06bf22c3">The results of the DISCO test show, as expected, that the two CGEL categories are significantly different ( ). Unexpectedly, the <italic id="italic-e840da045c82dd1f7aa6e16372d961ac">F</italic>-ratio of the two halves of the pronouns group, while smaller, is also significant ( ), meaning that the DISCO test can’t be used as a simple way to assess category status.</p>
      <p id="paragraph-fbf09ea8a03380a17e5c053fb4021ea2">The significant result with the two halves of the pronouns group may be understood to some degree by observing the dendrogram in Figure 1. The length of the branches from left to right reflects their distance in the high-dimensional matrix (R CORE TEAM, 2019). The leftmost pair of branches are the longest, and mostly reflect the split between determinatives and pronouns, but there is a good deal of structure <bold id="bold-34799a6fe8ab06c061fe48fc6edb18ed">inside</bold> each of those categories. The alphabetical order also imposed some structure with, for instance, all the interrogative pronouns in the second half of the list.<xref id="xref-79ec7a4d138afcb3af0334cd818877a7" ref-type="fn" rid="footnote-f8ee1b63a2f304f917ca3b7b215f2276">8</xref></p>
      <p id="paragraph-c363c5c0828167f0b8013c3557cfdb31">Taken together these results appear to be consistent with the analysis set out in CGEL. They are not, however, entirely inconsistent with claims, such as HUDSON’s (2004), that determinatives are a subcategory of pronouns or vice versa. More research is needed to see how these analyses perform with other categories (e.g., verbs &amp; auxiliary verbs; nouns, pronouns, and determinatives; nouns &amp; verbs; etc.) before rejecting or accepting such claims.</p>
      <p id="paragraph-a1a5e7f226d64362455f3afcdfe58b9f">Having found statistical support for CGEL’s analysis, the next step is to see whether it can be improved upon. Several possibilities for recategorization suggest themselves. CGEL categorizes the demonstratives (<italic id="italic-48ccc5f4c6a583dc587e510c86472f95">this</italic>, <italic id="italic-5d02fd60ad16e849e693e9b7e24c092b">that</italic>, <italic id="italic-13a90aa7b6b1a60e0243c9702842e54e">these</italic>, &amp; <italic id="italic-9efcaf38d5313cec0e731899dd6e58d3">those</italic>) with the determinatives, but calls them “borderline cases” (p. 422). It would be interesting to run the analysis with demonstratives categorized as pronouns and compare the <italic id="italic-fc1bf3de3a73e368202e44bf5f8d25da">F-</italic>ratios.</p>
      <p id="paragraph-a0dc801da7a37dbbfa5214416be4ace8">CGEL proposes two relative <italic id="italic-edc8e16bf151c9c61ec87e3dfe72f1bd">which</italic> words, a determinative (<italic id="italic-bb9c6a274743130a94454cdebead44be">during <underline id="underline-78094601e37c55b745eb6ee508549b35">which</underline> time…</italic>) and a pronoun (<italic id="italic-1559330489759ea6a98a6a9bffd39258">the time in <underline id="underline-2">which</underline> we live</italic>). The justification for this is that the pronoun has non-personal gender (it doesn’t apply to persons, e.g., *<italic id="italic-fff21cdd8d8862c8f11bd37c2dbc3626">the person which was there</italic>) and that gender is a feature of pronouns. But CGEL’s determinatives <italic id="italic-15">you</italic> and <italic id="italic-16">us </italic>(<italic id="italic-17">us linguists</italic>) have personal gender too, so this seems poorly motivated.</p>
      <p id="paragraph-bcfa56cf490afc64b63f626fd9f89ae5">A third possibility is to try recategorizing the reciprocal words <italic id="italic-18">each other </italic>and <italic id="italic-19">one another</italic>, which CGEL has as pronouns based on the observation that they don’t take any dependents. Their morphology, though, is much more like the compound determinatives (e.g., <italic id="italic-20">anything</italic> &amp; <italic id="italic-21">nobody</italic>), suggesting they may fit better into that group.</p>
      <p id="paragraph-b46e744a4e0ca488bc676b631bb56eff">A final example is that, in both the <italic id="italic-22">k</italic>-groups and the dendrogram, the so-called dummy pronouns were grouped with the determinatives. It could be useful to discover why this is the case.</p>
      <p id="paragraph-3c993535ffd5905ebffb3a4c27438fb8">Another interesting question that may be amenable to an approach like that presented here is discovering which of the 232 features used (or others that were overlooked) will turn out to be the most reliably predictive of a given category. Conversely, we may be able to identify members of the determinatives and pronouns that are most centrally located within their respective categories. It would be interesting to discover whether <italic id="italic-23">the</italic>, for instance, is really the most prototypical determinative or whether <italic id="italic-24">it </italic>is a more central pronoun than <italic id="italic-25">you</italic>.</p>
      <p id="paragraph-8e1138ca8386901ed8dfccd586c71fd3">The use of energy statistics can also be extended to other categories and to other languages, though there are considerable hurdles to applying this to other categories. Two stand out. The first is that the number of pronouns and determinatives is relatively small. Any attempt to compare determinatives and adjectives, for example, would presumably have to come up with a defensible sampling procedure for the adjectives. The second is that coding the semantic features of adjectives is likely much more complex than coding the features of the pronouns and determinative. Similar problems would apply to other open categories, though prepositions may be easier. Nevertheless, an existing scheme such as the UCREL Semantic Analysis System (RAYSON <italic id="italic-26">et al.</italic>, 2004) could be adapted to the problem.</p>
      <p id="paragraph-3e4f7cbef8992a36639c696a98f24347">A final general but important point is that the results of this new approach undermine Croft’s (2001) broad accusations of language-internal methodological opportunism and his claim that “if one does choose one construction (or subset of constructions) to define a category, then one still has not accounted for the anomalous distribution pattern of the constructions that have been left out” (p. 41). The results from the three analyses performed here are consistent with CGEL’s framework without the use of definitional features. In fact, this is what we would hope to find. A category defined perfectly by a single feature correlating with no other is of little value. One that has a cluster of generally related features is much more useful, even without any single perfectly reliable criterion.</p>
      <p id="paragraph-206c869560a05811010e8a7093ee1b0b">It seems, then, that, when working in a grammar such as that presented in CGEL, with attention to a wide range of features and careful consideration of many cases, it is possible to discover useful categories that stand up to statistical scrutiny. Of course, this doesn’t mean that methodological opportunism doesn’t happen, or even that it is not the rule. But it does suggest that linguists can discover and describe lexical categories without being methodological opportunists.</p>
    </sec>
    <sec id="heading-4ead81b57329c058d5c4f9d0df56313c">
      <title>5. Acknowledgements</title>
      <p id="paragraph-73c09c947d38f55924217760667ed9b7">I’d like to thank my family for their indulgence as I did the research and writing in my free time without funding. A version of this was presented at Linguistweets 2020. I’d like to thank Matthew Reynolds for help with R, Maria Rizzo for advice on DISCO and the Energy package for R, Bernhard Wälchli for pointing me to some very useful sources, and Aileen Bach and Tiago Luís Delgado for proofreading. Finally, I’d like to thank Dick Hudson for some helpful insights. Of course, the conclusions I draw are not necessarily endorsed by any of these kind folks.</p>
    </sec>
    <sec id="heading-719ca5bdd591987fb481c1f3409233cb">
      <title>References</title>
      <p id="heading-c306a21f31f4f530520d1b896a971517">ABNEY, S. P. <bold id="bold-ff5cae63160124dadd44a17437ef31f9">The English noun phrase in its sentential aspect</bold>. Massachusetts Institute of Technology, 1987.</p>
      <p id="paragraph-3">CROFT, W. A. <bold id="bold-2">Radical construction grammar: Syntactic theory in typological perspective</bold>. Oxford: Oxford University Press, 2001.</p>
      <p id="paragraph-5">CRYSTAL, D. English. <bold id="bold-3">Lingua</bold>, v. 17, n. 1–2, p. 24–56. 1967.</p>
      <p id="paragraph-7">EVERITT, B. S. <bold id="bold-4">The Cambridge dictionary of statistics. 3. ed</bold>. Cambridge: Cambridge University Press, 2006.</p>
      <p id="paragraph-9">HASPELMATH, M. How comparative concepts and descriptive linguistic categories are different. <bold id="bold-5">Aspects of Linguistic Variation</bold>. De Gruyter Mouton, 2018. p. 83–114.</p>
      <p id="paragraph-11">HUDDLESTON, R.; PULLUM, G. K. <bold id="bold-6">The Cambridge grammar of the English language</bold>. Cambridge: Cambridge University Press, 2002.</p>
      <p id="paragraph-13">HUDSON, R. A. Are determiners heads? <bold id="bold-7">Functions of Language</bold>, v. 11, n. 1, p. 7–42, 2004.</p>
      <p id="paragraph-15">HUDSON, R. A. <bold id="bold-8">Language networks: The new word grammar</bold>. Oxford: Oxford University Press, 2007.</p>
      <p id="paragraph-17">JAMES, G. <italic id="italic-9858e2ca5a07bacea5c8c72a5a31579f">et al.</italic> <bold id="bold-9">An introduction to statistical learning with applications in R</bold>. 8th printing. New York: Springer, 2017.</p>
      <p id="paragraph-19">LI, S. <bold id="bold-10"><italic id="italic-f6c12dbd4452fa1cbcc45c66c555db8b">K</italic>-groups: A generalization of <italic id="italic-3">k</italic>-means by energy distance</bold>. Bowling Green State University, 2015. Disponível em: &lt;https://etd.ohiolink.edu/apexprod/rws_olink/r/1501/10?p10_etd_subid=102410&gt;. </p>
      <p id="paragraph-21">MATTHEWS, P. H. <bold id="bold-11">The positions of adjectives in English</bold>. Oxford: Oxford University Press, 2014.</p>
      <p id="paragraph-23">MESSICK, S. Standards of Validity and the Validity of Standards in Performance Assessment. <bold id="bold-12">Educational Measurement: Issues and Practice</bold>, v. 14, n. 4, p. 5–8, 1995.</p>
      <p id="paragraph-25">NAU, N. <bold id="bold-13">Wortarten und Pronomina: Studien zur lettischen Grammatik</bold>. Wydział Neofilologii UAM w Poznaniu, 2016.</p>
      <p id="paragraph-27">POSTAL, P. M. On so-called “pronouns” in English. In: KAYNE, R.; LEU, T.; ZANUTTINI, R. (Org.). <bold id="bold-14">An annotated syntax reader: Lasting insights and questions</bold>. Blackwell Publishing Ltd (originally published in 1966), 2014. p. 12–25.</p>
      <p id="paragraph-29">R CORE TEAM. <bold id="bold-15">R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing</bold>, 2019. Disponível em: &lt;https://www.r-project.org/&gt;. </p>
      <p id="paragraph-31">RAYSON, P. et al. The UCREL semantic analysis system. <bold id="bold-16">Proceedings of the beyond named entity recognition semantic labelling for NLP tasks workshop</bold>. p. 7–12, 2004. Disponível em: &lt;http://eprints.lancs.ac.uk/1783/&gt;.</p>
      <p id="paragraph-33">REYNOLDS, B. Full matrix of English determinative and pronoun features. <bold id="bold-17">Lingbuzz</bold>, 2021. Disponível em: &lt;https://ling.auf.net/lingbuzz/005747&gt;.</p>
      <p id="paragraph-35">RIZZO, M. L.; SZÉKELY, G. J. DISCO analysis: A nonparametric extension of analysis of variance. <bold id="bold-18">Annals of Applied Statistics</bold>, v. 4, n. 2, p. 1034–1055, 2010.</p>
      <p id="paragraph-37">RIZZO, M. L.; SZÉKELY, G. J. Energy distance. <bold id="bold-19">Wiley Interdisciplinary Reviews: Computational Statistics</bold>, v. 8, n. 1, p. 27–38, 2016.</p>
      <p id="paragraph-39">RIZZO, M. L.; SZÉKELY, G. J. <bold id="bold-20">Package ‘energy’: E-statistics: Multivariate inference via the energy of data</bold>, 2019. Disponível em: &lt;https://github.com/mariarizzo/energy&gt;</p>
      <p id="paragraph-41">SAG, I. A.; WASOW, T.; BENDER, E. M. <bold id="bold-21">Syntactic theory: A formal introduction. 2. ed</bold>. Stanford, CA: Centre for the Study of Language and Information, 2003.</p>
      <p id="paragraph-43">SOMMERSTEIN, A. H. On the so-called definite article in English. <bold id="bold-22">Linguistic Inquiry</bold>, v. 3, n. 2, p. 197–209, 1972.</p>
    </sec>
    <sec id="heading-91566a84fe2112c61ca35d8662859dfd">
      <title>Appendix: R code</title>
      <p id="paragraph-2"> ### load packages ###</p>
      <p id="paragraph-4">library(readr)</p>
      <p id="paragraph-b23d5130581c84b59f2432d2bbdd00a7">library(energy) #(RIZZO; SZÉKELY, 2019)</p>
      <p id="paragraph-6">library(ape)</p>
      <p id="paragraph-5b629b54b71c7a825840b41116549f7b">library(dplyr)</p>
      <p id="paragraph-8" />
      <p id="paragraph-10">### Clear environment ###</p>
      <p id="paragraph-574d7a4cee772c22712d7d89261e281b">rm(list = ls())</p>
      <p id="paragraph-12" />
      <p id="paragraph-14">### Preparing the data ###</p>
      <p id="paragraph-3aaf6914b563e05be97a47b7697ca894"># load data WordListM as tibble with all columns as characters. The file has 73 determinatives, 65 pronouns</p>
      <p id="paragraph-16"># this requires the comma-separated-values file named 73-65full.csv to be located in the default data folder for R.</p>
      <p id="paragraph-c3e357b3132f7e76f03fed2cc6bbd305">WordListM &lt;- read_csv("73-65full.csv")</p>
      <p id="paragraph-18" />
      <p id="paragraph-f87c2b5b27dcbe38dd4b10e35ae96fe9"># convert all columns to factors</p>
      <p id="paragraph-20">WordListM &lt;- WordListM %&gt;% mutate_if(is.character,as.factor)</p>
      <p id="paragraph-0e257bbe55ae44d37ad73d79acc1ac00" />
      <p id="paragraph-22"># create WordListM as data frame from tibble</p>
      <p id="paragraph-e9772ab33885a1669ad0b40fec61040f">WordListM &lt;- as.data.frame(WordListM)</p>
      <p id="paragraph-24" />
      <p id="paragraph-47fd26a6fe28b9587451044199ecb699"># convert first column to rowname</p>
      <p id="paragraph-26">rownames(WordListM) &lt;- WordListM[, 1]</p>
      <p id="paragraph-eb1a864ce1224049839fcc101c94a24c">WordListM &lt;- WordListM[, -1]</p>
      <p id="paragraph-28" />
      <p id="paragraph-232e833576a6e0d81eb5dae3f62f1473"># convert data frame to matrix</p>
      <p id="paragraph-30">WordListM &lt;- data.matrix(WordListM, rownames.force = NA)</p>
      <p id="paragraph-32" />
      <p id="paragraph-9c6dc59b22d7ca2eb863ba2717bdc07f">### k-groups ###</p>
      <p id="paragraph-34"># run k-groups and display output</p>
      <p id="paragraph-53312575e63ea072ee19cfef1a5c12d1">kgroups(WordListM [,-1], 2, iter.max = 10, nstart = 1, cluster = NULL)</p>
      <p id="paragraph-36" />
      <p id="paragraph-dfac891b29786435d4274b83a91a8b79"># show list of k-groups cluster assignments</p>
      <p id="paragraph-38">fitted(kgroups(WordListM[,-1], 2, iter.max = 10, nstart = 1, cluster = NULL))</p>
      <p id="paragraph-ce160de58000cb278f450a582ef83941" />
      <p id="paragraph-ae23f4a07e7e46146930034b52127782">### dendrogram ###</p>
      <p id="paragraph-42"># create clusters</p>
      <p id="paragraph-7eb7b983940dfd2a92d4674bf5be177d">hc &lt;- energy.hclust(dist(WordListM))</p>
      <p id="paragraph-44" />
      <p id="paragraph-45"># plot clusters as dendrogram</p>
      <p id="paragraph-46">plot(as.phylo(hc), cex = 0.3, label.offset = 0.1)</p>
      <p id="paragraph-47" />
      <p id="paragraph-49">### DISCO ###</p>
      <p id="paragraph-50">#Run DISCO for 73 determinatives &amp; 65 pronouns</p>
      <p id="paragraph-51">eqdist.etest(WordListM, sizes=c(73, 65), R=999, method="discoF")</p>
      <p id="paragraph-52" />
      <p id="paragraph-53">#Rerun DISCO on 73 determinatives split in half to test sensitivity of discoF</p>
      <p id="paragraph-54">eqdist.etest(WordListM[c(1:73),], sizes=c(35,38), R=999, method="discoF")</p>
    </sec>
  </body>
  <back>
    <fn-group>
      <fn id="footnote-f0ecd757fd7fc925aa874771fae4684f">
        <label>1</label>
        <p id="paragraph-06f6f820d598643591d92fdf66ddfbdc">“Der Zusammenhang zwischen Pronomina und Determinativen ist in vielen Sprachen auffällig. Wenn sie nicht zu einer Klasse zusammengefasst werden sollen, muss doch ihre Beziehung geklärt werden.” (my translation)</p>
      </fn>
      <fn id="footnote-6af6b26cb08147f20056f24b2e794846">
        <label>2</label>
        <p id="paragraph-fbddc7cc42d07620cc5cadc826a9379b">Of course, there is no end to the possible criteria. We could consider all words appearing paragraph initially in the first British printing of <italic id="italic-829c4c1dd89064ca024299447f9c957d">Moby Dick</italic>, but the idea is to be inclusive without becoming absurd.</p>
      </fn>
      <fn id="footnote-b09f491d898e64cd4c6625a7745b38a3">
        <label>3</label>
        <p id="paragraph-da09da1e75d9fea086880abe885711e9">I’d like to thank reviewer Prof. Dr. João Paulo Cyrino for this observation and for some helpful suggestions about R coding.</p>
      </fn>
      <fn id="footnote-c12b848b06f0b18e93e4fc88421b78d8">
        <label>4</label>
        <p id="paragraph-6b3034414c8944f1a045bd1a079d3686">“Denn nicht Lexeme werden in syntaktischen Funktionen gebraucht, sondern Wortformen.” (my translation)</p>
      </fn>
      <fn id="footnote-41a39e38e7832a3c232bc19ad831382f">
        <label>5</label>
        <p id="paragraph-2702f8d3faa5fe0d0dcfb763706d8d24">See the appendix for the R code.</p>
      </fn>
      <fn id="footnote-098f13844e062b070e60576f74dc8d3b">
        <label>6</label>
        <p id="paragraph-84c45c22ae5c3a10c0e9c39a8e96909e">These rectangles were added manually after the dendrogram was produced.</p>
      </fn>
      <fn id="footnote-fbcd64e496fea59562cd9ffd0fab4896">
        <label>7</label>
        <p id="paragraph-7dfb8111536906581f9597054f5babcc">With categorical data, it is not meaningful to ask what unit of distance is being used. The distance is only meaningful in relative terms. Thus 20.23 is half as far as 40.46 in <bold id="bold-de9bd5cd837d16c5a46282978bcdaa6b">this particular data set</bold>, but there is no independent standard against which 20.23 can be measured.</p>
      </fn>
      <fn id="footnote-f8ee1b63a2f304f917ca3b7b215f2276">
        <label>8</label>
        <p id="paragraph-633ddf1d66eea5ea1bb81ebc646dea24">I did run a third DISCO test on two completely randomized lists, and the result was not significant.</p>
      </fn>
    </fn-group>
  </back>
</article>