Share

Theoretical Essay

Database Semantics

Roland Hausser

University of Erlangen-Nuremberg image/svg+xml


Keywords

agent-based
interface component
on-board database and orientation system
recognition and action
speak, think, and hear mode
concept, indexical, and name
type-token matching, pointing, and baptism
raw data for surface transfer

Abstract

For long-term upscaling, the computational reconstruction of a complex natural mechanism must be input-output equivalent with the prototype, i.e. the reconstruction must take the same input and produce the same output in the same processing order as the original. Accordingly, the modeling of natural language communication in Database Semantics (DBS) uses a time-linear derivation order for the speaker’s output and the hearer’s input. The language-dependent surfaces serving as the vehicle of content transfer from speaker to hearer are raw data without meaning or any grammatical properties whatsoever, but measurable by natural science.

PART I: FOUNDATION

To start from familiar ground, let us begin with Predicate Calculus as a widely accepted representation of meaning/content, followed by the DBS alternative. The most basic difference between them is their ontology: Predicate Calculus is sign-based substitution-driven, while DBS is agent-based data-driven.

1. Brief outline of predicate calculus

In Predicate Calculus (PredC), elementary meanings are treated as mini-propositions which denote truth values relative to a set-theoretic model. Defined as functors which may differ in the number of arguments, e.g. f(x) vs. f(x,y), they are connected by the operators of Propositional Calculus (PropC) and the quantifiers of PredC:

1.1. Noun, verb, and adj flattened into mini-propositions

Figure 1.

This allows to represent, for example, The dog found a bone as three mini-propositions which are coordinated with the propositional operator , and have the variables x and y bound by two quantifiers:

1.2. PredC representation of The dog found a bone

Figure 2.

The different meanings of dog, find, and bone are defined by the denotation function F and the assignment function g of a formal model:

1.3. Minimal model for the PredC formula 1.2

Let M be a model <A, B, F, g>, where A is an infinite set of objects or individuals, B a finite set of basic expressions, F a denotation function from B into A*, and g an assignment function from variables into A*.

For illustration, let us define A, B, F, and g as follows:

Figure 3.

Based on these definitions, the formula 1.2 is well-formed and true in M . However, if we defined F(dog) as a3, for example, the formula would be false.

2. Content in DBS: agent-based data-driven

Figure 4.

This content is defined as a set (order-free) of three proplets, defined as non-recursive feature structures with ordered attributes, connected by semantic relations of structure coded by address and a shared prn (proposition number) value, here 54. Proplets are the computational data structure of DBS.

The speak mode takes a content as input and produces a surface as output. The hear mode takes a surface as input and produces a content as output. As a minimal standard for natural language communication to be successful, (i) a content mapped by the speak mode into a surface and (ii) the content resulting from the hear mode’s interpretation of that surface must be the same (15.1, (ii)).

3. Additional communication conditions

For successful communication in the wider sense, the hearer must infer the speaker’s intent by reconstructing a possible inference by the speaker for a nonliteral content production, resulting in a nonliteral meaning (pragmatics).

3.1. First principle of pragmatics (POP-1, FOCL 4.3.3)

The speaker’s utterance meaning2 is the use of the sign’s literal meaning1 relative to an (agent-)internal context (of use).

A content is related to both meaning1 and meaning2 as follows:

3.2. Defining content in terms language

A content is a set of connected proplets without sur values.

A meaning1 is related to a content as follows:

3.3. Defining language meaning1 in terms of content

A meaning1 is a content with language-dependent sur values.

In DBS, language and nonlanguage cognition use the same kind of proplets connected by the same semantic relations of structure. The proplets of language and nonlanguage cognition differ in the presence vs. absence of language-dependent sur values (2.1).

4. Semantics, syntax, content, and mechanisms

Linguistics, symbolic logic, and philosophy use the following notions:

4.1. Related notions in linguistics, logic, and philosophy

Figure 5.

These are not merely different terms for the same things, but different terms for different aspects of the same things. In DBS, the distinctions are related as follows:

4.2. 1st correlation: semantics and their syntax

Figure 6.

The distinction between (i) Semantic and (ii) Syntactic kinds is complemented by a second, orthogonal pair of triple distinctions, namely (iii) three Content kinds and (iv) three associated computational Mechanisms:

4.3 2nd correlation: contents and their mechanisms

Figure 7.

The two dichotomies provide 12 basic notions, six basic pairs, and two correlations.

Theoretically, there are 12 classes of proplets which differ in the Semantic, Syntactic, Content, and Mechanism kind. Empirically, however, there are only six. They constitute the semantic building blocks of DBS cognition in general and natural language communication in particular. The six classes of proplets form the cognitive square of DBS, which is considered universal:

4.4. Cognitive square of DBS

Figure 8.

The most general content kind is the concepts which occur as all three semantic kinds referent, property, and relation. The most general semantic kind is the referents, which occur as name, indexical, and concept.

The cognitive square of DBS is empirically important because (i) figurative use is restricted to concepts, i.e. the bottom row, and (ii) reference is restricted to nouns, i.e. the left-most column. Thus only concept nouns may be used both figuratively and as referents, while indexical properties like here and now may not be used as either, and names only as referents.

5. Content kinds and computational mechanisms

Of the content kinds concept, indexical, and name, it is the concepts which establish the interaction between the agent’s cognition and the cognition-external raw data.

5.1. Concepts occur in all three semantic kinds

Figure 9.

The computational Mechanism of concepts is type/token1 matching (6.1, 6.3, 7.1, 7.3). The three Content kinds of the Semantic kind referent, i.e. names, indexicals, and concepts, all have literal use, but only the concepts allow figurative use.

The computational Mechanism of indexicals is pointing at values of the agent’s onboard orientation system (OBOS).

5.2. Indexicals occur in two of the three semantic kinds

Figure 10.

The semantic kind absent in indexicals is relation. Also, indexicals have no non-literal use. Indexicals depend on the pattern matching mechanism of concepts insofar as concepts are the range of the indexicals’ pointing mechanism. For example, if the indexical here points at the [S: veranda] feature of the agent’s OBOS, then the interpretation of the indexical depends on the concept ‘veranda’.

The functioning of names is a controversial topic in philosphy (Frege 1892, Russell 1905, Kripke 1972), but baptism is at the center. In DBS, the computational Mechanism of names inserts the address of a named referent into lexical name proplets as their core value (CASM 2017):

5.3. Baptism providing referent to lexical name proplet

Figure 11.

The semantic kinds absent in names are property and relation. Like indexicals, names have no nonliteral use.

5.4. Names occur in only one of the three semantic kinds

Figure 12.

Names depend on the pattern matching mechanism of concepts insofar as the named referent, e.g. [person x], is an address, which consists in part of a concept.

6. Using concepts of geometry in recognition and action

Natural languages may differ substantially, as in being isolating, inflectional, agglutinating, accusative, ergative, etc. Underneath, however, there is the universal hardware level which all speaker-hearers have in common. In accordance with Philip Lieberman (1984, 2000), DBS locates Chomsky’s intangible “language ability” in the human capability to produce and recognize raw acoustic data as language-dependent surfaces in the medium (CC 11.2.1) of speech. In evolution, the medium of writing, including Braille, was added later.

The functioning of a concept in nonlanguage recognition may be shown as follows:

6.1. Recognition: raw data matching type result in token

Figure 13.

The type defines the concept of a square. Replacing its variables with constants provided by the raw data, here α with 2cm, results in a concept token.

More schematically, the agent’s nonlanguage recognition of a square by means of type-token matching may be shown as follows:

6.2. Using a concept type in nonlanguage recognition

Figure 14.

As the action counterpart to 6.1, consider the adaptation of a concept type into a concept token by the agent’s cognition for a purpose:

6.3. Action: type-token adaptation results in raw data

Figure 15.

Adapting the type to a purpose results in a token which is realized as raw data.

More schematically, the agent’s nonlanguage action of drawing a square by means of a type-token adaptation may be shown as follows:

6.4. Using a concept token in nonlanguage action

Figure 16.

In DBS, the definition of concept types, corresponding concept tokens, and raw data relies on the natural sciences, here geometry. The raw input data are provided by the agent’s interface component. They are recognized as an instance of the twodimensional shape square because there are four lines of equal length and the angle of their intersections is 90o.

The concept square may be extended to other two-dimensional geometric objects:

6.5. Similarity and difference between concept shape types

Figure 17.

For retrieving the correct type, i.e. the one best matching the raw data at hand, the examples expand the concept analysis by embedding the geometric type shape into non-recursive feature structures with ordered attributes which specify the sensory modality, the semantic field, and whatever else is useful to aid retrieval of the type most suitable for matching the raw data at hand.

On the one hand, the lines and angles of two-dimensional geometry have counterparts in neurology, such as the line, edge, and angle detectors in the optical cortex of the cat (Hubel and Wiesel 1962), and the iconic or sensory memories from which the internal image representations are built (Sperling 1960) and Neisser (1967). On the other hand, robotic vision (Wiriyathammabhum et al. 2016) applies the natural science of optics in ways which differs markedly from the natural prototype (Pylyshyn 2009).

This is analogous to the difference between the natural flight of (i) birds, bats, and butterflies (flapping wings), and the artficial flight of (ii) air planes (fixed wings), and (iii) helicopters (rotors), all of which satify the laws of aero-dynamics (CLaTR Sect. 1.1). The list goes on with differences in earthbound locomotion (legs vs. wheels), and power supply (digestion vs. electricity).

Input-output equivalence with the natural prototype and alternative uses of the natural sciences are not in conflict. As illustrated in 8.3, input-output equivalence affects macro-processing while alternative uses of the natural sciences affect micro-processing (CLaTR Sect. 1.1).

7. Using concepts of color in recognition and action

Another homogeneous class of concepts are the colors. Their basic principles of recognition and production resemble those of the geometric shapes as follows:

7.1. Type matching raw data in color recognition

Figure 18.

The analyzed output token results from replacing the wavelength and frequency intervals of the type with the measurements of the raw input data.

More schematically, the agent’s nonlanguage recognition of colors by means of typetoken matching may be shown as follows:

7.2. Direct use of a concept in nonlanguage recognition

Figure 19.

In the corresponding nonlanguage action, the type is adapted to a purpose, as in the cuttlefish Metasepia pfefferi turning on the color blue:

7.3. Type-token adaptation in color production (cc 11.3.4)

Figure 20.

Cognition adapts the type to a purpose by replacing the wavelength interval of 450– 495nm and frequency interval of 670–610 THz of the type with the specific values of 470nm and 637 THz, resulting in a token. In the cuttlefish, these values are realized by natural actuators for color control (chromatophores) as raw data.

More schematically, the agent’s nonlanguage action of colors by means of type-token adaptation may be shown as follows:

7.4. Direct use of a concept in nonlanguage action

Figure 21.

What has been shown for the color concept blue may be extended to the whole class:

7.5. Similarity and difference between color concept types

Figure 22.

The three types differ in their wavelength and frequency intervals, and their place holder and samples values; they share the sensory modality, semantic field, and content kind values. For convenience, the place holders are in English.

In summary, the use of concepts in recognition vs. action differs in the origin of the specialization from a concept type to a concept token. In recognition, the specialization of the type into a token happens in the pattern matching with raw data. In action, the type is specialized into a token by the agent’s cognition as an adaptation to a purpose for use of the token to produce of suitable raw data.

8. Combining concepts into non-language content

The interaction of nonlanguage and language recognition and action is required for autonomous reference. For instance, requesting a robot to pick a blue square from a collection of colored two-dimensional geometric objects requires the artificial agent to understand the request, to recognize the object in question as the modifier|modified blue | square (reference), and to perform the action (hand-eye coordination). The following example shows the (a) concept type blue, the (b) associated nonlanguage proplet, and (c) the language proplet for German:

8.1. EMBEDDING THE CONCEPT blue INTO adj PROPLETS

Figure 23.

Concept (a) is common to the proplets (b) and (c), serving as their respective core value. The difference between the non-language proplet (b) and language proplet (c) is the absence vs. presence of a sur feature. The concept of all three is a type because the wavelength and frequency values are intervals instead of constants.

In evolution, nonlanguage cognition precedes language cognition and supplies the latter with essential constructs such as the media, sensory modalities, content kinds, and mechanisms for the combination of concept tokens into complex nonlanguage content. Accordingly, DBS language cognition uses the same semantic kinds (referent, property, relation), the same content kinds (concept, indexical, name), the same computational mechanisms (type-token matching, pointing, baptism), and the same syntactic kinds (4.4) as DBS non-language cognition.

Using the same elements and the same mode of composition does not mean, however, that nonlanguage and language cognition must be identical. For example, it is not a foregone conclusion that the nonlanguage composition corresponding to English the+blue+square should code the contribution of the definite article as a separate nonlanguage proplet. It is just as possible that nonlanguage cognition treats definiteness in the sense of known or familiar as a property of square instead of a separate determiner, as in the following hypothetical operation:

8.2. Nonlanguage modifier | modified operation

Figure 24.

Here the value def originates in the noun rather than the determiner.

The corresponding operation with the place holders replaced by the explicit nonlanguage concepts blue and square at the content level is shown as follows:

8.3. Nonlanguage concatenation with explicit concepts

Figure 25.

By binding the explicitly defined concepts of the input content to the variables α and β serving as core values in the input pattern, the output pattern produces a result with explicitly defined values at the content level.

9. Hear mode operations with place holders

The equivalent English language composition of the blue square differs from the nonlanguage composition 8.3 in that the [sem: def sg] feature of the square proplet originates in a separate language proplet,2 the, such that three rather than two proplets require concatenation in the hear mode and production in the speak mode.

9.1. CROSS-COPYING the AND blue WITH DET×AND

Figure 26.

The core value of the determiner the matching the first input pattern is the substitution variable n_1. Cross-copying establishes the modifier|modified relation by writing blue into the mdr slot of the and n_1 into the mdd slot of blue in the output. The operation number h52 refers to the DBS hear mode grammar in TExer 6.3.1.

The next proplet provided by automatic word form recognition is square. The proplet matches the second input pattern of the hear mode operation DET∪CN.

9.2. ABSORBING square INTO the WITH DET∪CN

Figure 27.

The connective ∪ indicates absorption by global substitution: the determiner proplet absorbs the common noun proplet by replacing all instances of the variable n_1 with the value square, after which the common noun proplet is discarded.

The time-linear hear mode concatenation of three lexical input proplets required two operations and resulted in two output proplets:

9.3. Result of the hear mode operations 9.1 and 9.2

Figure 28.

This equals the output of nonlanguage 8.2, except for the sur attributes.

10. Speak mode operations with place holders

The input to the speak mode is a content, defined as a set of non-language proplets connected with the semantic relations of structure, defined by address. Consider the nonlanguage content of Lucy found a big blue square:

10.1. Non-language content

Figure 29.

The two-place relation find combines the referents [person x] and square into the proposition [person(x)] find square. The property (modifier) big combines with square by functor-argument and with blue by coordination. The concepts person, find, big, blue, and square are shown by place holder values.

The equivalent graphical representation is shown in four views:

10.2. Semantic relations underlying speak mode

Figure 30.

View (i) is called the semantic relations graph (SRG). It is based on the proplets of the content 10.1 and uses the core values lucy, find, square, big and blue as nodes. View (ii) is called the signature and uses the core attributes N for noun, V for verb, and A for adj as nodes. View (iii) is called the numbered arcs graph (NAG) and supplements the SRG with numbered arcs which are used in the (iv) surface realization; it consists of three parallel lines which show the navigation as it activates content in the think mode and optionally realizes the language-dependent surfaces in a speak mode which rides piggyback on the think mode navigation.

Assuming the content 10.1 as input to the agent’s speak mode, the intermediate speak mode navigation step in arc 4 proceeds from the noun proplet square to the adnominal modifier big with N↓A:

10.3. NAVIGATING WITH N↓A FROM square TO big (arc 4)

Figure 31.

The traversal sequence VN – N↓A complies with the Continuity Condition of DBS (NLC 3.6.5), according to which a think-speak mode operation AcB (‘c’ for connective) may only be followed by an operation BcC which accepts the output of AcB as input. The Continuity Condition is the think-speak mode counterpart to the timelinearity in the hear mode. In fact, the Continuity Condition of the speak mode is the source of the hear mode’s time-linear derivation order.

11. Universal: ambiguity and paraphrase are mutually exclusive

The respective uni-directional derivations of the speak and the hear mode go naturally with a universal asymmetry between ambiguity and paraphrase:

11.1. Definition of ambiguity and paraphrase

ambiguity (hear mode: several contents for the same surface)

paraphrase (speak mode: several surfaces for the same content).

It follows that ambiguity and paraphrase in natural language are mutually exclusive. Understandably, this has been overlooked3 by the sign-based substitution-driven approaches in linguistics, which analyze language without separate derivations for the speak and the hear mode, such as Phrase Structure Grammar including ST, EST, REST, GB, GPSG, HPSG, etc., and Categorial Grammar in its various guises.

12. Lexical vs. Structural ambiguity

A standard distinction in philology is between lexical vs. structural ambiguity. A wellknown lexical ambiguity in English is Flying airplanes can be dangerous. One reading interprets flying as a present participle and is paraphrased as to fly airplanes. The other reading interprets flying as an adnominal modifier and is paraphrased as airplanes which fly. In DBS, the two readings are distinguished as follows:

12.1. LEXICAL AMBIGUITY OF flying airplanes

Figure 32.

In comparison, the structural ambiguity ‘Fido ate the bone on the table’ has anadverbial and an adnominal reading. On the adverbial reading, it is the eating which ison the table (TExer 3.2.2). Accordingly, the modifier|modified column is attached tothe predicate:

12.2. Structural ambiguity: adverbial (Reading 1)

Figure 33.

On the adnominal reading, in contrast, it is the bone which is on the table (TExer 3.2.1). Accordingly, the modifier|modified column is attached to the grammatical object:

12.3. Structural ambiguity: adnominal (Reading 2)

Figure 34.

Both readings show the phrasal adnominal modifier iteration on_the_table under_the tree in_the_garden (TExer Sect. 5.1).

13. Paraphrase

Speak mode paraphrase is the linguistic counterpart to hear mode ambiguity. Paraphrases of a content use the same semantic relations graph, but differ in their traversal order, as in the following example of the active-passive alternation (CLaTR Sect. 9.2):

13.1. Alternation between active and passive in english

Figure 35.

The active traverses the arcs of the NAG in the order 1, 2, 3, 4 (unmarked), while the passive traverses them in the order 3, 4, 1, 2 (marked).

The content common to the speak mode derivation of the active and the passive requires a transitive verb and is defined explicitly as the following set of proplets:

13.2. content common to active-passive variants

Figure 36.

The difference between ‘John read a book’ (surface realization a) and ‘A book was read by John’ (surface realization b) is provided by the lexicalization rules embedded into the sur slot of the goal pattern of the navigation operations (17.6, 17.7). Another example of a speak mode paraphrase in English is an alternation involving the indirect and the direct object, as in ‘The man gave the child an apple’ vs. ‘The man gave an apple to the child’:

13.3. Surface alternation in 3-place verb proposition

Figure 37.

Variant a is based on the traversal order 1, 2, 3, 4, 5, 6, while variant b is based on the order 1, 2, 5, 6, 3, 4. The prepositional phrase ‘to_the_child’ is produced by the language-dependent lexicalization rule in the goal pattern of transition 3 of variant b.

While the examples of paraphrase in 13.1 and 13.3 are within the English language, there are similar kinds of paraphrase also between languages:

13.4. Word order difference between english and german

Figure 38.

The SRG, the signature, and the NAG are the same for English and German because they characterize semantic relations in the same content, independently of the language-dependent surface realization. The common content is defined as follows:

13.5. CONTENT OF ‘Peter has read the book.’

Figure 39.

The word order, however, differs between the two languages. In English it is Peter has read the book. and in German Peter hat das Buch gelesen. (Peter has the book read.), as shown by the alternative surface realizations in 13.4.

In communication, the word order differences between two languages originate in the language-dependent lexicalization rules, which may segment surfaces differently in the traversal of arcs. For example, the complex verb form ‘has_read’ in English is realized in navigation step 2 and the period in step 4. In German, in contrast, ‘hat’ is realized in step 2 and ‘gelesen_.’ in step 4 (Satzklammer).

14. Computational complexity and grammatical disambiguation

According to Transformational Grammar, the computational complexity degree of natural language is undecidable 4 (formal proof by Peters & Ritchie 1973). According to DBS, natural language is linear (formal proof in TCS’92). How is this possible?

The different computational complexity degrees do not apply to natural language per se, but to different theoretical reconstructions of natural language: one as some spurious “language acquisition device,” the other as a means of communication between cognitive agents speaking the same natural language.

The argument for the linear complexity degree of DBS is as follows: the grammars of the DBS hear, think, and think-speak mode are all C-Lags.5 In a C-Lag, the total number of elementary computational steps in any single operation application is below a grammar-dependent worst case limit (finite constant C). Thus, an individual DBS operation application alone cannot cause higher complexity. The only possibility to increase complexity is systematically repeating (recursive or iterative) ambiguity, i.e. the systematic generation of parallel readings.

In the speak mode, there are no parallel readings because there are no ambiguities, only paraphrases (Sect. 13). Alternative paraphrases result from the choice between continuations in the traversal of a given input content and is decided by the agent’s rhetorical purpose. Human speakers can not produce paraphrases simultaneously and human hearers could not process them simultaneously.

In the hear mode, a reading of length n requires exactly n-1 operation applications. The only way to increase complexity in DBS is a systematic multiplication of readings with the length of the hear mode input. Such a recursive ambiguity is illustrated in formal language analysis (FoCL 11.5.8, SubsetSum), but can not be found in natural language. The reason is grammatical disambiguation, as in the following example of what is known in linguistics as an ‘unbounded dependency’:

14.1. DBS GRAPH OF AN UNBOUNDED DEPENDENCY

Figure 40.

There is no grammatical limit on the number of possible object sentences intervening between initial ‘Whom’ and the final clause, here ‘that Mary loves’. For example. ‘that Bill believes that Suzy said that Bob believes that Lucy said ...’ may theoretically be continued indefinitely with additional object clauses. However, with each new object clause there arises an obligatory choice between termination vs. continuation. For example, does John may be continued with love (no object clause, derivation terminates) or with say that (object clause, derivation continues). Because terminated ambiguities are discarded,6 i.e. do not add up in the result, the output is unambiguous:

14.2. Ambiguity structure of an unbounded suspension

Figure 41.

The time-linear cycle is terminated by the first verb which takes the initial Whom as its object. Because each continuation with another object clause n+1 removes the terminated variant n from the set of readings (grammatical disambiguation), the construction is complexity-theoretically harmless. As long as there are no natural language examples of recursively or iteratively repeating [+global] ambiguities, the Linear Complexity Hypothesis for natural language stands without counterexample.

15. Language communication

In primate evolution, nonlanguage cognition precedes language cognition. This raises the question of how nonlanguage cognition upscales to a natural language communication which is founded on cognition-external raw data. These data are without meaning or any grammatical properties whatsoever, but measurable by natural science:

15.1. combining nonlanguage into language cognition

Figure 42.

The speak and the hear mode of (ii) language cognition reuse and combine the mechanisms of (i) nonlanguage recognition and action.

More specifically, nonlanguage and language cognition are alike in that they apply type-token matching to raw data input. They differ in that nonlanguage cognition applies type-token matching to nonlanguage content, while language cognition applies it to language surfaces. In the medium of speech, a surface token differs from its type by specifying volume, pitch, speed, timbre, etc., and in the medium of writing by specifying font, size, color, etc., i.e., what Aristotle would call the accidental (non-necessary) properties.

As an example of a language proplet with (i) a language-independent concept type and (ii) a language-dependent surface token consider the German word form blaues:

15.2. Proplet with surface token and concept type

Figure 43.

Type-token adaptation in speak mode surface production may be illustrated as follows:

15.3. Speak mode: content to surface type to raw data

Figure 44.

The input, i.e. the content token blue of nonlanguage cognition, retrieves the corresponding language-dependent surface, here the type of German b l a u e s, based on a list which provides allomorphs using the input proplet’s core, cat, and sem values. This output serves as input to a realization operation which adapts the surface type into a token, realized as raw data. Type-token instantiation in hear mode surface recognition may be illustrated as follows:

15.4. Hear mode: raw data to surface type to content

Figure 45.

The input consists of raw data which are provided by the agent’s vision sensors and matched by the letters’ shape types provided by the agent’s memory. The output replaces the shape types, here b l a u e s, with the matching raw data resulting in shape tokens; shown as b% l% a% u% e% s%, they record the accidental properties. The function crucial for the understanding of the hearer, however, is using the place holder, here blue, for the lexical look-up of the correct nonlanguage concept in 7.5.

The purpose of producing surface tokens in the speak mode and of recognizing surface tokens in the hear mode is content transfer. For transfer to be successful, speaker and hearer must use the same surface-content pairs. This requires speaker and hearer to have learned (stored in memory) the (a) content, the (b) surface, and the (c) convention connecting (a) and (b) in language acquisition. Adhering to grammatical wellformedness is a functional precondition (Lewis 1969) for the successful transfer of content from speaker to hearer.

As a computationally effective reconstruction of natural language communication, 15.1 requires the agent-based data-driven ontology of DBS. The explanation of the evolutionary transition from nonlanguage to language cognition is in the spirit of Charles Darwin and out of reach for a sign-based substitution-driven ontology.

PART II: SOME TECHNICAL DETAILS

In DBS, the output of the speak mode and the input to the hear mode is an agentexternal sequence of word form surfaces in the form of raw data. Their strictly timelinear derivation order, imposed by input-output equivalence with the natural prototype, is called left-associative in computer science compiler construction.7 Left-associative composition adds one item after another at the end of a sequence and has the bracketing structure ((((((a b)c)d)e)f)... An unambiguous hear mode derivation consisting of n word form surfaces (length n) requires exactly n-1 left-associative compositions.

Each operation application of the DBS think, think-speak, and hear mode is of constant complexity (C-LAGs). Therefore the only way to increase computational complexity in DBS is recursive or iterative ambiguity. Because this kind of ambiguity does not occur in natural language (Sect. 14, grammatical disambiguation), the computational complexity degree of natural language in DBS is linear.

16. The four operation components of DBS

As a model of natural language communication, DBS consists of four operation components, all of which use the same left-associative (time-linear) derivation order:

16.1. Components using left-associative operations

Hear mode syntax:

a sentence start combines with a next word into a new sentence start.

Hear mode morphology:

a word form start combines with a next allomoph into a new word form start.8

Speak mode navigation:

a start content combines with a next proplet into a new start content.

Speak mode surface realization:

a start surface combines with a next surface into a new start surface.

Left-associative operations are incremental and strictly time-linear.9

17. The semantic relations of structure

In concord with the classical tradition in philology, DBS distinguishes four kinds of semantic relations of structure10 between proplets:

(1) subject/predicate

(2) object\predicate

(3) modifier|modified

(4) conjunct−conjunct.

The first three constitute the traditional notions of functor-argument, while the last is coordination. The first two are obligatory, while the latter two are optional.

DBS shows the four classical semantic relations of structure in (a) a graphical format and a (b) linear notation, and in the (i) hear and the (ii) speak mode.

17.1. (A, B) Static presentation of the 4 semantic relations

Figure 46.

Convention: in the linear notation of functor-arguments, the lower node in the binary graph precedes.

The direction of traversing a semantic relation of structure (activating a content in the think or the think-speak mode) is indicated by numbered arcs in the graph and by extending the slashes into arrows in linear notation, e.g. VN and NV:

17.2. (I, II) Dynamic presentation of the 4 seman. Relations

Figure 47.

Convention: in the linear notation of functor-arguments, the start node precedes the goal node.

The following examples compare (i) a content in the DBS proplet format (set of proplets concatenated by semantic relations of structure, coded by address) with (ii) the equivalent DBS graph analysis:11

17.3. Format I: CONTENT OF John washed the new car.

Figure 48.

The corresponding DBS graph analysis presents the content in four views, which use the graphical connectives /, \, |, and −, and may be listed as follows:

17.4. The four views of a DBS graph analysis

(i) the semantic relations graph (SRG)

uses the core values of a content

(ii) the signature

uses the core attributes of a content

(iii) the numbered arcs graph (NAG)

adds numbered arcs to the SRG.

(iv) the surface realization

shows three parallel lines with each segment indicating

(a) the number of the arc traversed

(b) the language-dependent surface produced

(c) the operation name of the transition.

17.5. Format II: semantic relations graph for 17.3

Figure 49.

The numbered arcs of the NAG are traversed by think-speak operations which take a single start proplet as input and produce a single goal proplet as output:

17.6. Downward traversal with a think-speak operation

Figure 50.

This operation application shows the downward traversal from the predicate to the subject via arc 1 in the NAG of 17.5. The operation name, here VN, is followed by the operation number, here (s1), referring to the DBS speak mode grammar in TExer 6.5.1. In DBS, the content kind of name provides automatic word form production with the feature [sur: lexnoun()], which uses and overwrites the name marker in the goal proplet (CASM).

The corresponding upward traversal (arc 2 in 17.5) is provided by NV:

17.7. Upward traversal with a think-speak operation

Figure 51.

Automatic word form production is based on the function lexverb(), which uses the core value wash and the sem values ind past to realize wash-ed.

18. Hear mode: automatic word form recognition

The raw data of the language surfaces transferring content from speaker to hearer occur in the following media (CC Sect. 11.2):

18.1. Media of natural language communication

(i) spoken,

(ii) written (including Braille), and

(iii) signed

The speak mode requires the input of content to produce language-dependent surfaces. The hear mode requires the input of language-dependent surfaces to produce content. The crucial problem of automatic word form recognition in the medium of speech is the segmentation of the input stream into word forms, e.g. continua ‘theolddoglooked...’, into ‘the+old+dog+looked+...’, and the word forms into allomorphs, e.g. ‘looked’ into ‘look-ed’, for such semantic distinctions as number, tense, and mood.

In the medium of speech, voice independent segmentation and allomorph lookup are the tasks of automatic speech recognition (ASR), which is largely based on statistics. In the medium of writing, used by DBS, the solution is based on a trie structure, string search, and a lexicon of allomorphs (FoCL Chap. 14).

Automatic word form recognition and production is the first obligatory step for any computational linguistic analysis of a natural language. Even when restricted to written language (which facilitates the task), it can keep a medium-sized team of researchers busy for decades, and then requires continuous maintenance. It involves not only recognition and production of surfaces, but also their lexical analysis. The programming is a major investment, but of solid long-term theoretical and practical use by many parties.

19. Hear mode: mapping surfaces into content

For development and maintenance, DBS word form lookup may be run in isolation. The input are huge lists of word forms, automatically produced from categorized base forms and the associated paradigms. The output is a list of unconnected word form proplets:

19.1. Isolated lookup of lexical proplets

Figure 52.

The method of choice is the allomorph approach (FoCL Chap. 14).

In contradististinction to isolated lookup, word form lookup in the hear mode is incrementally intertwined with syntactic-semantic composition:

19.2. Left-associative (incremental) lexical lookup

Sentence Start + Next Word Next Sentence Start

John washed+the ⇒ John washed the

John washed the+new ⇒ John washed the new

John washed the new+car ⇒ John washed the new car etc.

Instead of the generic connective ‘+’, syntax-semantics in the DBS hear mode uses three differentiated connectives:

× for cross-copying

for absorption

for suspension

The hear mode derivation taking the sequence of lexical proplets in 19.1 as input results in the following content:

19.3. CONTENT OF John washed the new car.

Figure 53.

This content differs from the lexical lookup 19.1 in that the proplets are connected by semantic relations of structure, coded by address.

20. Storage and retrieval of content in dbs database

An essential component of DBS is the agent’s on-board memory, defined as a contentaddressable database. Consider the schematic comparison of 20.1 and 20.2:

20.1. Conventional database interaction

Figure 54.

Interaction takes place between different agents using the same external database and the same artificial language, e.g. SQL: programmer P controls the storage and the user U controls the retrieval operations.

20.2. Speaker and hearer interacting in communication

Figure 55.

Interaction takes place between different agents using different on-board databases and the same natural language. The transfer of content from speaker to hearer is completely automatic and based on agent-external raw data. In the speak mode, automatic word form production takes cognitive content as input and maps it into language-dependent surfaces as raw data output. In the hear mode, automatic word form recognition takes raw data as input and maps it into cognitive content as output (turn-taking). The content 19.3 is stored alike in both agents’ DBS databases as follows:

20.3. Database schema of content-addressable a-memory

Figure 56.

Horizontally, the DBS database schema is token lines as lists of proplets with the same core value, stored in the temporal order of arrival. Vertically, the token lines are in the alphabetical order induced by their proplets’ core value. The schema is contentaddressable because it does not use a separate index (unlike an RDBMS).

The column of owners provides access to the token line of proplets to be stored or retrieved. In recognition, proplets are stored at the now front in the token line of their core value. In action, content is activated by navigating along the semantic relations between proplets, using the address for retrieval of the goal proplet. Because the semantic relations between proplets are coded by address, proplets are order-free: the semantic relations of structure between them hold regardless of where they are located.

Within the token lines, the field of member proplets is the permanent memory which may never be changed. The only way to correct is by adding new content, as in a diary. The now front is the arena of current content processing. It is cleared in regular intervals by moving it (together with the owners) into fresh memory territory, leaving its content behind in what is becoming permanent member proplets (loom-like clearance).

PART III: DATA COVERAGE

The most detailed and extensive DBS data coverage for English so far is provided in TExer (318 pp.). Here we conclude with a brief overview at a high level of abstraction.

The expressions of natural language occur at three levels of grammatical complexity, traditionally called (i) elementary, (ii) phrasal, and (iii) clausal. Phrasal contents are built from elementary contents, and clausal contents from elementary and phrasal contents. A text is analyzed as a conjunction of sentential clausal contents.

The following SRG shows an extrasentential coordination. It is a three sentence text, analyzed as a conjunction of three propositions, each with elementary arguments, and one with two phrasal arguments:

20.4. Clausal, elementary, and phrasal relations

Figure 57.

For explicit derivations see NLC, Chap. 13 (hear mode) and Chap. 14 (speak mode).

The next example compares two intrasentential constructions consisting of the same component contents, but connected in one by extrapropositional coordination (parataxis) and the other by subordination (hypotaxis):

20.5. Clausal coordination vs. Modification

Figure 58.

As contents, the constructions are semantically similar, but syntactically different. The variety of extrapropositional constructions results from the four different kinds of semantic relations of structure in natural language and is reflected graphically by the semantically interpreted /, \, |, and − lines (edges).

Finally consider the following variety of extrapropositional structures:

20.6. Relating two transitive verbs extrapropositionally

Figure 59.

Which kind of relation may connect the predicates of two component propositions depends on the verb class (Levin 2009).

21. Conclusion

In contradistinction to PSG, the DBS analysis of natural and formal languages does not use any non-terminal nodes. Instead, the nodes in a grammatical structure analysis are proplets, defined as nonrecursive feature structures with ordered attributes which take values of grammar and of content.

Instead of connecting nonterminal nodes with the notions of dominance and precedence (which are more appropriate for the social domain of pomp and circumstance than natural language semantics), DBS uses the classical semantic relations of subject/ predicate, object\predicate, modifier|modified, and conjunct−conjunct.

In contradistinction to substitution-driven systems like PSG and CG, DBS is inputoutput equivalent with the natural prototype. As a result, the computational complexity of DBS is of linear degree, in contrast to context-free Phrase Structure Grammar and Categorial Grammar, which are polynomial, and to Transformational Grammar, which is undecidable.

References

Aho, A.V. and J.D. Ullman (1977). Principles of Compiler Design, Reading, Mass.: Addison-Wesley.

Aristotle, Analytica Priora, in Aristoteles Latinus III.1–4, L. Minio-Paluello (ed.), Bruges–Paris: Desclée de Brouwer, 1962.

CASM = Hausser, R. (2017) “A computational treatment of generalized reference,” Complex Adaptive Systems Modeling, Vol. 5.1:1–26. Also at lagrammar.net and http://link.springer.com/article/10.1186/s40294-016-0042-7.

CC = Hausser, R. (2019) Computational Cognition: Integrated DBS Software Design for Data-Driven Cognitive Processing, pp. i–xii, 1–237, lagrammar.net .

CLaTR = Hausser R. (2011) Computational Linguistics and Talking Robots; Processing Content in DBS, pp. 286. Springer (preprint 2nd ed. at lagrammar.net).

FoCL = Hausser, R. (1999) Foundations of Computational Linguistics, Human-Computer Communication in Natural Language, 3rd ed. 2013, Springer.

Frege, G. (1892) “Über Sinn und Bedeutung,” Zeitschrift für Philosophie und philosophische Kritik, Vol. 100:25-50.

Gazdar, G. (1981) “Unbounded Dependencies and Coordinate Structure,” Linguistic Inquiry Vol. 12.2:155–184.

Harman, G. (1963)“Generative Grammar without Transformational Rules: a Defense of Phrase Structure,” Language Vol. 39:597–616.

]Haspelmath, M., and S.M. Michaelis (2017) “Analytic and synthetic: Typological change in varieties of European languages,” In: Isabelle Buchstaller & Beat Siebenhaar (eds.), Language variation – European perspectives VI: Selected papers from the 8th International Conference on Language Variation in Europe (ICLaVE 8), Leipzig 2015. Amsterdam: Benjamins.

Hubel, D.H., and T.N. Wiesel (1962) “Receptive Fields, Binocular Interaction, and Functional Architecture in the Cat’s Visual Cortex,” Journal of Physiology, Vol. 160:106–154.

Kripke, S. (1972) “Naming and Necessity,” in D. Davidson and G. Harmann (eds.), 253–355 Levin, B. (2009) “Where Do Verb Classes Come From?” http://web.stanford.edu/ ∼bclevin/ghent09vclass.pdf.

Lewis, D. (1969). Convention: a Philosophical Study, Hoboken, N.J.: Wiley-Blackwell Liebermann, P. (1984) The Biology and Evolution of Language , Cambridge Mass.: Harvard University Press.

Liebermann, P. (2000) Human Language and our Reptilian Brain, The Subcortical Basis of Speech, Syntax, and Thought, Cambridge Mass.: Harvard University Press.

Neisser, U. (1967) Cognitive Psychology, New York: Appleton-Century-Crofts.

NLC = Hausser R. (2006) A Computational Model of Natural Language Commmunication: Interpretation, Inference, and Production in DBS, pp. 360. Springer (preprint 2nd ed. at lagrammar.net).

Neumann, G. (1998) “Ambiguity vs. Paraphrasing” https://www.dfki.de/ ∼neumann/publications/diss/node47.html

Paul, H. (1889) Prinzipien der Sprachgeschichte, 2nd Ed., Halle a.S.: Max Niemeyer.

Peirce, C.S. (1903) ”Lectures on Pragmatism”, CP 5.171.

Peirce, C.S. (1931–1935) CP (Collected Papers), C. Hartshorne and P. Weiss (eds.), Cambridge, MA: Harvard Univ. Press.

Peters, S., and R. Ritchie (1973) “On the Generative Power of Transformational Grammar,” Information and Control Vol. 18:483–501.

Pylyshyn, Z. (2009) “Perception, Representation and the World: The FINST that binds,” in D. Dedrick & I.M. Trick (eds.) Computation, Cognition and Pylyshyn, Cambridge MA: MIT press.

Ross, J. R. (1986) Infinite syntax!, Norwood, NJ: ABLEX, Russell, B. (1905) “On denoting,” Mind, Vol. 14:479–493.

Saussure, F. de [1916](1972). Cours de linguistique générale, Édition critique préparée par Tullio de Mauro, Paris: Éditions Payot.

Sperling, G. (1960) “The Information available in Brief Visual Processing,” Psychological Monographs, 11, Whole No. 498.

TCS = Hausser R. (1992) “Complexity in Left-Associative Grammar,” Theoretical Computer Science, Vol. 106.2:283-308.

TExer = Hausser, R. (2020) Twentyfour Exercises in Linguistic Analysis, DBS software design for the Hear and the Speak mode of a Talking Robot (lagrammar.net).

Wiriyathammabhum, P., D. Summers-Stay, C. Fermüller, and Y. Aloimonos (2016) “Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics,” December 2016, ACM Computing Surveys 49(4):1-44

How to Cite

HAUSSER, R. Database Semantics. Cadernos de Linguística, [S. l.], v. 2, n. 1, p. e382, 2021. DOI: 10.25189/2675-4916.2021.v2.n1.id382. Disponível em: https://cadernos.abralin.org/index.php/cadernos/article/view/382. Acesso em: 19 apr. 2024.

Statistics

Copyright

© All Rights Reserved to the Authors

Cadernos de Linguística supports the Opens Science movement

Collaborate with the journal.

Submit your paper