Using a coupled-oscillator model of speech rhythm to estimate rhythmic variability in two Brazilian Portuguese varieties (CE and SP)

This paper presents preliminary results of a semi-automatic methodology to extract three parameters of a dynamic model of speech rhythm. The model attempts to analyze the production of rhythm as a system of coupled oscillators which represent syllabicity and phrase stress as levels of temporal organization. The estimated parameters are the syllabic oscillator entrainment rate (alpha), the syllabic oscillator decay rate (beta), and the coupling strength between the oscillators (w0). The methodology involves finding the  combination that minimizes the distance between natural duration contours and simulated contours generated using several combinations of the parameters. The distance between natural and model-generated contours was measured in two ways by comparing: (1) plain or overt syllable to syllable duration and (2) relative change along both contours.We applied this methodology to read speech produced by five speakers of the state of Ceará (CE) and eight speakers of the state of São Paulo (SP). Mean w0 and alpha values are compatible with the view that Brazilian Portuguese is a mixed-rhythm language. Results from two bayesian hierarchical regression models do not suggest a difference between SP and CE speakers, but indicate a difference between the two methods, with the relative change method generating lower alpha values and higher w0 values, and the reverse for the plain duration method.

speech produced by five speakers of the state of Ceará (CE) and eight speakers of the state of São Paulo (SP). Mean w0 and alpha values are compatible with the view that Brazilian Portuguese is a mixed-rhythm language. Results from two Bayesian hierarchical regression models do not suggest a difference between SP and CE speakers, but indicate a difference between the two methods, with the relative change method generating lower alpha values and higher w0 values.

INTRODUCTION
The goal of this study is to test a methodology we are developing to study rhythmic variability across speakers and languages based on the concepts underlying Barbosa's dynamic model of speech rhythm (BARBOSA 2006;2007). In this initial exploration of this methodology, we use it to compare rhythmic variability in two Brazilian Portuguese regional varieties, the one spoken in the Northeastern state of Ceará and the one spoken in the Southeastern state of São Paulo.
We base our work on the version of the model presented in detail in Barbosa (2006Barbosa ( , 2007. Here we omit the mathematical details of the implementation for the sake of brevity and focus on a qualitative presentation of the key ideas underlying it. The model assumes the existence of two abstract oscillators. The first is a syllabic oscillator, which stands for the sequence of syllable-sized units (or VV units) that make up the speech chain. The second oscillator is the phrase stress oscillator and represents the sequence of beats or phrase stresses that occur along a given utterance. The phrase stress oscillator entrains the syllabic oscillator, such that its period gets progressively lengthened as it approaches the phrase stress beat location in the utterance. Points of maxima on the entrained oscillator cycle correspond to the onsets of vocalic gestures in speech.
Four model variables are relevant to our discussion on rhythmic variability: T0, the syllabic oscillator uncoupled period, which can be understood as the underlying speech rate; α, the syllabic oscillator entrainment rate, which modulates how fast the syllabic oscillator period changes in response to coupling; β, the decay or period reset rate, which modulates how fast the syllabic oscillator tries to return to the uncoupled period after the phrasal stress beat; and w0, the relative coupling strength, which modulates the degree of synchronization between the two oscillators. Barbosa (2006Barbosa ( , 2007 makes a connection between his model and the so-called rhythm typology or the classification of a particular language as syllable-timed or stresstimed. This typology has a long tradition in linguistic research -see Bertinetto (1989) for a review. At first, Pike (1945) treated the issue as a matter of a flexible tendency towards one or the other rhythmic type, but later on Abercrombie (1967) reframed the subject as a dichotomy. Even though the empirical research has been unsupportive of the dichotomy view on the matter (BERTINETTO, 1989;DAUER, 1983), the idea has persisted, augmented by the inclusion of moraic rhythm as a third type (LADEFOGED, 2001). Barbosa (2002Barbosa ( , 2004Barbosa ( , 2006Barbosa ( , 2007 suggests that the dynamical nature of his model allows the syllabletimed and stress-timed typology to be reinterpreted as a continuum rather than a discrete dichotomy. He argues, based on theoretical and empirical grounds, that by varying w0 from low (near zero) to high (near 1) values, it is possible to simulate duration patterns which go from syllable-timed to stress-timed. The author hypothesizes that w0 values "vary more extremely from language to language, than from speaker to speaker within a same linguistic community" (BARBOSA, 2007, p. 733), but points out that the claim of higher crosslinguistic variation compared to intralinguistic variation should be verified experimentally. He also suggests (Barbosa 2007, p. 734) that w0 should vary less than α and β within a language community and that α and β may vary as a function of both speakers and speaking styles.
The main goal of the present study is to take a step forward in defining a semiautomatic methodology that will make it possible to estimate α, β and w0 given a set of audio  Figure 1 shows a TextGrid file with the first few seconds of an audio sample segmented in VV units according to the procedure described earlier in the section. Segmentation was used to identify stress groups in speech samples. Stress groups were identified using the 3-step procedure described in Barbosa (2006Barbosa ( , 2007 and implemented as a Praat script 4 . First, the raw duration of each VV unit is normalized using a z-score transform; in the second step, the normalized duration contour is smoothed using a 5-point moving average; lastly, peaks in the normalized, smoothed contours are identified. Peaks on the contour are considered occurrences of phrasal stresses and two consecutive phrasal stresses define a stress group. Figure 1. Part of an audio sample from the SP variety (speaker SP1) segmented into VV units following the SAMPA-PB notation. The content shown in the sample is "Em seguida, apareceu um papagaio real" (Then a royal parrot appeared).

MODEL PARAMETER EXTRACTION
The semi-automatic methodology to estimate the parameters involves generating thousands of simulated contours with varying α, β and w0 values, which are later compared to the natural contours to arrive at the <α, β, w0> combination that best fits the natural 2 A VV unit is a syllable-sized segment delimited by two consecutive vowel onsets in running speech. A body of literature suggests VV units are better than the phonological syllable to reveal the rhythmic structure of speech (BARBOSA, P. A., 2006(BARBOSA, P. A., , 2007PETTORINO et al., 2013).
3 SAMPA-PB is a convention inspired by SAMPA to phonetically annotate Brazilian Portuguese using ASCII characters. See https://github.com/parantes/sampa-pb for more information. 4 The script can be found at https://github.com/parantes/duration_suite. It is a rewrite of the SGDetector script originally coded by Barbosa (2006). The reader is referred to Barbosa (2006) for further information on the rationale for each step of the procedure. contour. Syllable-sized durations generated by the model are expressed in time units. The normalized smoothed natural contours are expressed in z-score units. In order to make a comparison between the two types of contours possible, both must be in the same scale. To achieve this, we adopted the procedure described in Barbosa (2004, p. 42-43) to restore the normalized and smoothed duration of each VV unit in the contour to an abstract duration expressed in real time units by reversing the z-score transform. The restored abstract duration of a given VV unit (durabs) is obtained by the use of the formula in (1), where zsm stands for the normalized and smoothed duration of the VV unit, σr is the reference standard deviation value and μr is the reference mean value.
Barbosa suggests the use of the abstract VV unit "aC" as a reference in the restoration operation. The rationale is that "a" is the most frequent vowel in BP and "C" is a voiceless  Barbosa (2006).
In order to generate simulated contours using the coupled-oscillator model, it is necessary to specify, in addition to the α, β and w0 parameters, an estimate of the uncoupled syllabic oscillator period (T0), the number of stress groups, the number of VV units in each stress group and the magnitude of each phrase stress. The information on the number of stress groups and their size in VV units is provided by the output of the duration-suite script mentioned in the previous section. T0 was estimated by taking the median duration of VV units in stress groups with more than four VV units, excluding the first two and the last VV units. The first two VV units are excluded because the model stipulates that the syllabic oscillator period reset is most active at the start of the stress group. The last VV unit is excluded because it is the most affected by the influence of phrase stress. The magnitude of phrase stresses was kept constant at a value of 1. The rationale for this decision is that the formulation of the coupled-oscillator model we use here does not incorporate information on higher linguistic information that may affect boundary strength and especially the presence and duration of silent pauses at the boundary 5 .
5 Barbosa (2006Barbosa ( , 2007 presents a probabilistic approach to incorporate the effect of the syntax-prosody interaction on phrase stress magnitude, although it does not deal with silent pause insertion. When generating simulated contours, the information on the number of stress groups and their size (in VV units) was kept constant and values for the α, β and w0 parameters are systematically varied: α and β vary from 0.05 to 1.5 in steps of 0.05, and w0 varies from 0.05 to 1 in steps of 0.025. In doing so, a total of 35,100 simulated contours are generated based on the stress group structure of each speech sample, one for each <α, β, w0> combination.
In order to generate the simulated contours, an R implementation of the the coupledoscillator model was used 6 .

CONTOUR COMPARISON AND RANKING
In order to determine which <α, β, w0> combination generates the simulated contour that best matches a natural contour, the natural contour is compared to all simulated contours.
The distance between contours was measured using two methods:  Barbosa (2004). In that paper, the author outlines a procedure based on rearrangements of his model's mathematical formulas that allows estimating w0 by measuring relative change in duration contours 7 .
6 The code can be accessed at https://github.com/parantes/rhythm. 7 In the procedure described in Barbosa (2004), the value for α is fixed by the user, but the value may change for different speaking rates if the data to which the procedure is being applied systematically vary the speaking rate.   Also following Barbosa (2004), only stress groups with more than four VV units were included in the comparisons; and, within the stress groups, the first and the last VV units were excluded. The rationale for this is the same outlined in the previous section for the estimation of T0. Mean Absolute Error (MAE) was used as the measure of distance between natural and simulated contours 8 . Error was measured as the mean of the absolute difference in each VV unit between the corresponding values in the natural and simulated contours. Simulated contours that are impossible or unlikely in natural speech were excluded from the comparison, namely contours that have negative values. Contours that have a minimum that is smaller than the minimum VV duration found in the natural contour by more than 10% or a maximum that exceeds the maximum VV duration in the natural contour by more than 10% were also excluded.
In addition to the error measure, the ratio between the duration range of natural and simulated contours was used as a second index to rank a contour comparison. Range ratios close to 1 ensure that the simulated contour spans a range of VV unit durations that is closer to the one found in the natural contour.
In order to determine the parameter combination that yielded the simulated contour with the best fit to the natural one, the list of all valid simulated contours was sorted in an ascending order by the range ratio index. A list with the 50 candidates with ratios closest to 1 was generated and then sorted in descending order of error measure. The median of the values of α, β and w0 of the first 20 candidates in this list are considered the values that generate the simulated contour that best fits the natural contour.

RESULTS
Since β, out of the three parameters, is the one with the least impact on rhythmic typology, and for the sake of brevity, this analysis consisted of two response variables: α and w0; and two predictor variables: method (plain duration and relative change), and dialect (CE and SP   Considering the dialects separately, as shown in Figure 5, SP speakers had higher w0 values in both methods, and also had higher α values in plain duration. The only parameter that was higher for CE speakers was their α values under the relative change method. Figure 5. Boxplots of α and w0 values for the plain duration and relative change methods, and for CE and SP speakers.
Blue dots and blue lines are the means and standard errors, respectively.
In order to assess the tendencies revealed by the descriptive statistics and the boxplots presented as well as to evaluate possible effects of dialect and method on α and w0, two Bayesian mixed-effects regression models were fitted, one to the α data and another to the w0 data. Both models include random intercepts for "speakers", since each contributed with more than one data point. We also fitted a model with random slopes for "dialect", and another one with an interaction between dialect and method (which was not significant), but a model comparison using the Bayesian leave-one-out cross-validation 9 indicated the model with only varying intercepts with no interaction as the the one with more predictive ability, both for α and for w0 .  effects showed a probable standard deviation of 0.08 for individuals, meaning we could expect 68% of participants to vary their intercepts in + or -0.08 (1 SD), and 95% to vary their intercepts in + or -0.16 (2 SDs) 11 . Figure 5 presents the posterior probability distributions of the coefficients of the model, with the thick blue lines at the medians of the distributions (the "Estimates" from table 2), the shaded blue areas corresponding to the 50% CIs, and the tails of each distribution bounded at its 99% CI. The third distribution shows the most probable values of α for CE speakers under the plain duration method (Intercept). The second distribution, completely on the negative side, makes us confident that using the relative change method decreases α values. The first distribution, on the other hand, with 34% of its area on the negative side, indicates that changing the dialect from CE to SP, at least with our data, does not affect α 12 .
The graph with fitted values, in figure 7, reinforces the lack of effect of dialect in our data and the decrease of α values in the relative change method. Keeping the same system used so far, the dots represent the most probable value (median of the distribution) of α, the thicker portion of each line represents its 50% most credible interval, and the thinner portions extend up to the 95% CI.
11 Note, though, that random effects are also given as probability distributions and the 95% credible interval of the effect is given in the table. 12 Having 66% of the area of the posterior distribution on the positive side is not enough to recognize an effect of dialect. If we took the 95% / 5% (e.g., p = 0.05) value commonly used in frequentist statistics for decision making, 66% would be a definite "no effect".  Assuming Brazilian Portuguese as having a mixed rhythm, we set a regularizing prior distribution for w0 centered at 0.5, but allowing for values between 0 and 1. We defined the prior for the Intercept as a normal distribution with mean 0.5 and standard deviation 0.25.
The prior for the slopes were also regularizing priors, centered at zero to allow for changes in any direction (increasing or decreasing w0). It was defined as a normal distribution with mean 0 with a standard deviation of 0.35, to allow for both the increase or decrease of w0, but within the 0 to 1 range. Finally, the prior for sigma was set as a normal distribution with mean 0 and standard deviation 0.25 (truncated at 0).
The posterior distributions of the coefficients of the model for w0 are presented in table 3, with the same style previously used to report on α. The most probable value for w0 for a CE speaker in the plain duration method (Intercept) is 0.45 (ranging from 0.36 to 0.55 in the 95% most credible interval). Changing the method to relative duration increases w0, the most probable value being +0.2, but ranging from 0.11 to 0.28 in the 95% credible interval. Changing the dialect to SP seems to increase w0 values, but we cannot be certain at this point because part of the 95% CI of this effect crosses zero, showing there is probability that there is no effect of dialect. The  The graph with fitted values, in figure 9, reinforces the increase in w0 values derived by the relative change method. It also shows that the difference between CE and SP is higher for w0 than it was for α (figure 7), but still not high enough to lead to the conclusion of an effect of dialect in our data. The graph in figure 9 uses the same system used so far, where the dots represent the most probable value (medians of the distributions) of w0, the thicker portion of each line represents its 50% most credible interval, and the thinner portions extend up to the 95% CI. Figure 9. Fitted values of w0 for CE and SP dialects, and for plain duration and relative change methods.  Barbosa's assumption that w0 values should be fairly similar among varieties of the same language.
As for α, the results show that both varieties have similar values as well, and its overall variability is only slightly higher than the one seen for w0, not enough to allow us to infer that it has greater variability These results do not necessarily contradict Barbosa's assumptions, because the author does not assume that α should be more variable than w0. Our sample is too small in order to compare between-speaker variability with between-dialect variability, so a larger sample would unveil a difference in α if there is one. Given the scarcity of previous studies on rhythmic differences between the various BP varieties, we cannot yet firmly assume that the lack of dialect effect reflects the true nature of the two varieties or is a byproduct of our procedure.

FINAL REMARKS
Given the complex nature of speech rhythm and the number of degrees of freedom involved in the design of a procedure such as the one we are trying to develop, the results presented here are encouraging. Except for the labeling part, the procedure is driven by scripts, making it easier to scale up studies on rhythmic characterization and variability. In comparison with Barbosa's procedure (BARBOSA, 2004), it has the advantage of also estimating α and β. Regarding the development of the procedure, the present results indicate that further analysis of the performance of the contour comparison methods presented here are needed to gain a better understanding of the differences and make an informed decision on which of the two should be used going forward.
Notwithstanding the pending issue of the contour comparison methods, the present results suggest that the procedure generates linguistically sensible estimates for w0 and α, corroborating the basic assumption that BP is a mix-type language in terms of rhythmic organization. The results also lean towards corroborating some assumptions made by Barbosa about the cross-dialectal and between-speaker variability of w0 and α, although firmer conclusions require a larger sample.
On the linguistic side, once it becomes possible to select one single contour comparison method, we plan to apply the methodology to languages other than BP. Crucially, we plan to look at languages that are widely regarded in the literature as prime examples of stresstimed and syllable-timed rhythm, such as English and Spanish, respectively. This will allow us to test some of the most crucial hypotheses raised by Barbosa regarding the role of the coupling strength parameter (w0) as an index of rhythm typology. Further in the future, if the methodology proves to be sound, we also plan to apply it to the speech of L2 learners.

ACKNOWLEDGEMENTS
The first author would like to thank Plinio A. Barbosa for sharing the sound files for the São Paulo variety analyzed here. The second author would like to thank the Brazilian National Council for Scientific and Technological Development -CNPq (process 438823/2018-4) for partially supporting this project. The authors also thank the reviewers for their careful reading and for their comments, which greatly improved the paper.