Fuzzy Representation Systems in Linguistic Semantics
An Empirical Approach to the Reconstruction of Lexical Meanings from East- and West-German Newspapers1

Burghard B. Rieger
Arbeitsgruppe f. mathem.-empirische Systemforschung (MESY) German Department
Technical University of Aachen, Germany

Word Semantics is gaining increasing importance within linguistics. Due to the fact that both, formal and operational means have been devised to analyse and represent word connotation and/or denotation adequately, this paper discusses some of the empirical problems connected with natural languages' essentially varying and vague meanings, how these can be analysed statistically from discourse data, and represented formally as fuzzy system of vocabulary mappings.

Some examples computed from East- and West-German newspaper texts will be to illustrate the approach's feasibility.


1. INTRODUCTION

In talking about fuzzy representation systems in linguistic semantics I will confine myself on discussing the question of how lexical meanings may possibly be reconstructed empirically, i.e. analysed and represented. Tackling this problem of word semantics is to be concerned with at least two central aspects of it


    a) the specification of the data base to be analysed automatically, and
    b) the sort of algorithmic procedures to be employed in view of both word-meaning analysis and representation.
Those of you who happen to remember my paper [8] presented here on EMCSR/3 will probably expect to find some application of fuzzy sets theory to problem-area b) - and they are of course perfectly right in doing so.

But before I go about some of the procedures we have in the meantime developed and results tested in the Technical University of Aachen MESY-group so far, I will have to make some points on the frame-conditions, the basic language material has to satisfy in order to make our procedures work. And this, of course, concerns problem-area a).

As these issues have been discussed at some length elsewhere [5], [6], [7], [9], I shall only be refering to them here. However, as problems of word-semantics should be discussed where and when they come up, I would like to give an account of the philosophy (so to speak) behind my approach. I therefore will have to spend some time on aspects of formal and descriptive theory construction and the empirical complications to be expected in view of a semiotic domain like word-semantics.

Let me start with an introductory quotation which strikingly characterizes the situation: ''Semantics, the study of meaning, has a long and eminently respectable history as an activity for philosophers, logicians, grammarians, philologists and linguists, but unfortunately the obviousness of meaning of words and discourse is matched by its eel-like slipperyness when the philosopher or linguist tries to catch it.'' This quotation from Sparck-Jones/Kay [14] will hopefully stimulate your expectations (if necessary), or (if appropriate) will let you be prepared to be left empty-handed at the end.


2. AIMS OF WORD SEMANTICS

According to Moskovich [3] it is a truism by now that there is no linguistic theory of semantics that could explain why automatic retrieval procedures do in fact work - and that there is quite a number of indexing and retrieval systems' designers, who can do very well without any specific linguistic analysis of their material. And yet, when we look up linguistic theories of sentence- or even text-semantics on the one hand, and procedures of intellectual or statistical indexing systems on the other, and see what both of them can offer in respect of word-meaning, we will in either case be confronted with special word-lists.

The purpose of these lists, which may be relationally structured or just sequential, is to specify more or less comprehensively the conditions under which a term listed in a dictionary or thesaurus may be related to or even identified with certain meanings, represented by meaning-components, semantic-markers or semantic-descriptors. Thus, dictionary in generative grammars may be considered as a sequential word-list that specifies syntactical, semantical and perhaps pragmatical restrictions of each of its entries. These have to be observed for the proper insertion of elements, or groups of elements into sentential or textual structures to generate or parse grammatically correct and meaningful surfaces. And a thesaurus in indexing systems can be regarded as a structured word-list that specifies the lexicological or conceptual relations of each of its entries. These will serve in turn as meaning descriptors which are assigned to elements or groups of elements in sentences or texts, to constitute relevant meaning descriptions.

On the basis of such listings which provide different kinds of semantic information under each word-heading, sentence-semantics as well as indexing systems are making use of word-meaning instead of analysing it. Apart from tentative departures within generative semantics or statistical indexing, there have no operational procedures yet been devised for the semantic analysis and description of natural language terms, as a result of which, when applied to language data, a lexical structure may be obtained.

Now, this is what word-semantics should and could do, and where exactly the problems begin.


3. STATUS OF WORD SEMANTICS

If we agree that linguistics is, or at least ought to be an empirical discipline, then the paradigm of empirical sciences should be followed, although it needs modification in view of the scope of natural language semantics.

To adopt the paradigm of empirical sciences for linguistic research is tantamount to at least two postulates


    a) not to rely on ready-made theories or models taken from another domain, because these may be grossly inadequate, in respect to the phenomena, and
    b) not to rely on the introspective exploration of one's own knowledge and competence as the allegedly inexhaustible data-source, although it may produce valuable initial ideas.

Instead, the investigation of linguistic problems in general, and that of word-semantics in particular, should start with hypotheses, formulated and reformulated for continuous estimation and/or testing against observable data, then proceed to incorporate the findings tentatively in some preliminary theoretical set up which finally may perhaps get formalized to become part of an encompassing abstract theory. Our objective being natural language meaning, this operational approach would have to be, what I would like to call semiotic. This term is meant to refer to certain new proceedings which have in common that they do not insist to make imprecise phenomena precise [5]. According to Gaines [1] their descriptive and/or formal framework is designed to fit the phenomena, not to straiten the phenomena to fit a model or theory.

Following the line of Labov [2] and others, prevailing linguistic theory and linguistic semantics in particular is dominated by what has been called the "categorial view". According to it, linguistic entities should either be discrete, invariant, qualitatively distinct, and composed of atomic primes, or else be of no use in linguistic theory at all. This view has led to the exclusion of very obvious object-level features of language usage, which only recently have begun to be recovered by linguistics proper, in some cases reluctantly but nevertheless continuously. Most prominent among these features is that of word-meaning itself, which - although recognized - is not an integral part of linguistic sentence- or text-semantics yet. Features of language variation on the morpho-phonetic level and those of vagueness on the lexico-semantic level are other well-known instances. They too gain increasing importance since language usage regularities are investigated empirically.

These aspects of the object-level semiotic phenomena however, are to be complemented by aspects of their formal notation. Hence, even theories of language performance, designed to account for phenomena like word-meanings' vagueness or variation, have to meet basic conditions of theory construction. Consequently these entities should again be welldefined on the meta-theoretic levels of representation where the dominance and validity of the `categorial view' has to be maintained for formal, simulative, or descriptive reconstruction even of semiotic phenomena.

My admittedly rough-and-ready distinction of object- and meta-theory, corresponding to different notational levels, requires some mediation. This can be provided, as I see it, formally by means of fuzzy set theoretical notations, and operationally by means of empirical procedures assigned to them. Applied to natural language data, they will interrelate observable but essentially fuzzy language phenomena on the one hand, and formal but finally categorial notations of their linguistic descriptions on the other.

Thus, findings and/or hypotheses on either side may become testable against each other, allowing for mutual modifications in the course of gradual improvement and increasing adequacy of the model and what it represents.


4. STRUCTURE OF MEANING

Up to this point we have been reflecting upon only one part of the problem, or if you like to keep the picture, we have seen only one side of the slippery eel, namely, how semiotic phenomena (which are permanently experienced and observed in language use) should be accounted for by different notational levels of formal representation. What makes the study of natural language meaning an even more intricate problem, depends on the other part of the picture and that concerns the particular nature of what has to be represented, namely, a representational structure in its own. It is this representational aspect of language, which traditional theories of semantics have particularly been focussed on.

Figure 1

According to the most influential of them, natural language meaning can be characterized by its denotative and its connotative aspects (Fig. 1). Denotation is understood to constitute referential meaning as a system MD of relations between words or sentences of a language L and the object or processes they refer to in W. Connotation is defined to constitute structural meaning as a system MC by which words or sentences of a language L are conceptually related to one another. Referential semantic theory is truth-functionally and formally elaborated but as such not prepared to account satisfactorily for the vagueness of natural language meaning; whereas structural semantics has considered vagueness somewhat fundamental of language but, being based mainly upon intuitive introspection, it has not achieved the theoretical or methodological consistency of formal theories. Although both approaches differ in what they consider natural language meaning to be, they nonetheless converge on the central notion of it, being a relation between a representation (i.e. the body of natural language discourse) and that which it represents (i.e. referential or structural meaning constituted by this body).


5. ZADEH'S APPROACH

It is this throughout relational structure of meaning that obviously allowed the concept of fuzzy sets and relations to be employed to incorporate vagueness into referential theories of semantics.

The most recent, and at that most comprehensive formal approach (at least I know of) to tackle the problem of natural language meaning, is that of L.A. Zadeh [15]. Under the acronym PRUF for `Possibilistic, Relational, Universal, Fuzzy' he has devised a meaning representation language for natural languages which is possibilistic instead of truth-functional, and whose dictionary provides linguistically labeled fuzzy subsets of the universe, instead of sets of semantic markers under word-headings.

The basic idea, upon which this approach hinges, is that a referential meaning may be explicated as a fuzzy correspondence between language terms and a universe of discourse. This correspondence, L, is formally defined to be a fuzzy binary relation from a set of language terms, T, to a universe of discourse, U. As a fuzzy relation, L, is characterized by a membership-function

which associated with each ordered pair (x,z) its grade of membership FL(x,z), being a numeric value between 0 and 1, in L, so that

The fuzzy relation L now induces a bilateral correspondence according to which

Although formally satisfactory - as outlined and illustrated by PRUF - the basic assumption of the approach concerning the referential nature of natural meaning proves to be crucial for its empirical applicability: in order to determine the membership-grades of a fuzzy set, or fuzzy relation respectively, one has to have access to relevant empirical data defined to constitute the sets, and some operational means to calculate the numerical values from these data.

As the domain of the fuzzy relation FL contains not only the set of terms of a language, T, but also the set of objects and/or processes these terms are believed to denote in the universe, U, both these sets should be accessible in order to let an empirical procedure be devised that could be assigned to FL. All that Zadeh [15] is offering in that respect, stays empirically rather vague. He assumes that ''each of the symbols or names in T may be defined ostensively or by exemplification. That is by pointing or otherwise focussing on a real or abstract object in U and indicating the degree - on the scale from 0 to 1 - to which it is compatible with the symbol in question'' (p. 418).

This cannot be considered a solution which may be called both semiotic and operational in the above given sense. Taken to be executable, Zadeh's suggestion necessarily involves probands' questioning about what they think or believe a term denotes. Thus, the procedure would again have to rely on the individual introspection of a multitude of competent speakers, instead of making these speakers employ the term's denotational and/or connotational function in the course of communicative verbal interaction. However, experimental psychology has taught us to expect considerable differences between what people think they would do under certain presupposed conditions, and what in fact they will do when these conditions are real. And there is every reason to assume that this difference is found in cases of language performance, too.

So, it would appear more appropriate to make natural language usage the basis for identifying those language regularities, which real speakers/hearers follow and/or establish in discourse as a consequence of which natural language meaning (whatever that may be) can obviously not only be intended and understood, but may also be analysed and represented. As this seems to be the only certainty about meaning anyway, namely that it can only be constituted by means of natural language texts, these should also be able to provide the necessary data with the advantage of being empirically accessible. Assembled in a corpus, the usage regularities which the lexical items produce, may thus be analysed statistically with the numerical values obtained to define fuzzy vocabulary mappings [10].


6. EMPIRICAL RECONSTRUCTION

Following this line of argument is to ask for a connotational supplement to the denotational approach Zadeh forwarded so far. This goes along with a necessary re-interpretation of what the sets T and U (1) in the referential meaning relation possibly stand for.

From a structural point-of-view, T is not just a set of terms of a language any more, but a system of lexical units the usage regularities of which induce a relational structure. This structure does not just allow for a set of objects and/or processes in U to be denoted, but it constitutes them as a system of concept-points, which is dependent on, but not identical with the one induced by the usage regularities of terms as employed and identified in natural language discourse [9].

Thus, being a non-symmetric, fuzzy, binary relation, FL can empirically be reconstructed only on the basis of natural language discourse data. So far, statistical procedures have been used for the reconstruction by a consecutive mapping in three stages from T to U, providing the membership-grades for FL.

On the first stage co-occurrences of terms are not just counted but the intensities of co-occurring terms in the texts of the database are calculated. This is done by a modified correlation-coefficient a that measures mutual (positive) affinity or (negative) repugnancy of pairs of terms x,  x¢ Î T by real numbers from the interval [-1, +1]. a can therefore be considered a fuzzy relation in the cross-product of the set of terms T used in the texts analysed

By conditioning this fuzzy relation a on the xi Î T, we get a non-fuzzy mapping

This mapping assigns to each x Î T one and only one so-called corpus-point y defined by the n-tupel of membership-grades a(xi,x) in the corpus space C

Each corpus-point y¢ Î C may thus be considered a formal notation of the usage regularities, measured by grades of intensity, any one term x¢ shows against all the other terms xi Î T.

On the second stage the differences of usage are calculated. This is done by a distance measure d1, which yields real, non-negative, numerical values from an interval standardized to [0,1] to denote the distances between any two corpuspoints y,  y¢ Î C. d1 can also be considered a fuzzy, binary relation in the set of all corpus-points yi defined to constitute the corpus space

By conditioning this fuzzy relation d1 on the yi (or - following (7) - the xi respectively) we get a non-fuzzy mapping

This mapping assigns to each y Î C (or x Î T respectively) one and only one so-called meaning- or concept-point z defined by the n-tupel of distance-values in the semantic space U,

Each concept-point z¢ Î U may thus be considered a formal notation of all the differences of all usage regularities, as a function of which the meaning of a term x¢ Î T can be characterized.

Therefore it can be identified - according to (7) - with (4), i.e. the linguistic description, D(z¢), of a concept-point z¢ which is a fuzzy subset in T

On the third stage of the consecutive mapping, there will topological environments of concept-points be calculated - in analogy to (8) - by a distance measure d2 which specifies the distances between any two z,  z¢ Î U. Thus again, d2 may also be interpreted as a fuzzy, binary relation in the set of all concept-points zi defined to constitute the semantic space U

The conditioning of d2 on the zi results in a non-fuzzy mapping

which assigns to each z Î U (and - following (10) - x Î T respectively) one and only one n-tupel of distances that - scaled according to decreasing values - will constitute the environment E(z)

Any such environment E(z¢) can be considered a formal means to describe the position of a concept point z¢ by its adjacent neighbours in the semantic space which is constituted by functions of differences of language usage regularities. E(z¢) can therefore be identified - following (10) and (14) - with (3) the conceptual meaning, M(x¢), of a term x¢ which is a fuzzy subset in U

We are now in the position to assign to the fuzzy relation

and the two-sided correspondence (3) and (4) induced by it, the following operations.

The two distance measures d1 (8) and d2 (12), operating on numerical data obtained from the correlational analysis (5) of lexical items employed in a corpus of natural language texts, will determine the membership-grades to be associated with (16), namely for the correspondence (4) induced by FL according to (9) inserting

and for its inversion the correspondence (3) according to (13) inserting

This concludes the empirical reconstruction, leaving open only the coefficients alluded to above.

Given the lemmatized vocabulary V as a proper subset of T of lexical units

employed in a corpus K of natural language texts as specified above

where

is the sum S of all text-lengths st measured by the number of lexical units (tokens) in the corpus, and

is the total frequency H of a lexical unit x (type) computed over all texts in the corpus, then the modified correlation-coefficient a to be inserted into (5) reads

The distances d1 (8) and d2 (12) have been calculated according to the Euclidean measure which reads

As these distance measures are to be considered the metric of the corpus C and the semantic space U respectively, it should be noted here that so far the assumption of it being Euclidean is nothing but a first (although operational) guess. Experiments with different and more sophisticated distance measures developed are currently undertaken which eventually might prove to be more adequate in modelling word-semantic systems' structures.


7. EXAMPLES

Table 1 and Table 2

Linguistic Description D(z) and Conceptual Meaning M(x) of the lemmatized lexical entry ELEKTRO/NISCH (electro/nic) as employed in texts of German from the newspapers DIE WELT (West) and NEUES DEUTSCHLAND (East) calculated according to (17) and (18); the values listed behind the descriptors, however, have not been standardized to the unit interval.

To show the feasibility of the empirical approach and to leave you not completely empty-handed at the end, the following examples of linguistic description D(z) and of conceptual meanings M(x) may serve as an illustration. They are taken from the data of a pilot-study on semantic differences in lexical structure [11] that has been done within a major project on East-West-German language comparison.

So far, two samples from corpora consisting of texts from the East-German newspaper `Neues Deutschland' and the West-German newspaper `Die Welt' have been analysed according to the procedures outlined. Although the samples analysed are rather small - approximately 3000 running words (tokens) of roughly 300 lemmatized words (types) - the results look quite promising to the native speaker of German. In mapping the connotational difference which some morphologically identical German lexical entries have developed almost simultaneously after twenty years of usage in a devided country's rather strictly separated population, the pilot-study's results seem to indicate that - linguistically - an additional analysis of comparable text-corpora of earlier and/or later years could provide the diachronic complement to the so far synchronic investigation into the lexical structures concerned, allowing for the empirical reconstruction not only of their instantaneous word-meanings, but of their time-dependent procedural changes that Nowakowska [4] aims at. Being induced by varying language usages, these can operationally be analysed as regularities followed and/or established by language users to differing degrees, which hence may formally be represented as functions that constitute dynamic systems to model semiotic structures.

In the above Tables 1 and 2 the linguistic description D(z) of a concept point z is given as well as the conceptual meaning M(x) of a vocabulary term x from both of the newspaper corpora further details of which may be found in Rieger [11].


ACKNOWLEDGEMENT

This paper parts of which were presented in German under the title ''Probleme der automatischen Wortsemantik'' at the joint Annual Meeting of the Association for Literary and Linguistic Computing (ALLC) and the LDV-Fittings e.V. at the University of Bonn, Germany, in December 1979, is an abbreviated version of Rieger [12]. It draws from a pilot-study [11] which took up some model construction resulting from a project in Empirical Semantics supported by the Northrhine-Westphalia Ministry of Science and Research (II B 6 - FA 7519) applied to the language data provided by the German Research Foundation's project on East-West-German language comparison (DFG He 983/1/2).

I would like to thank Dr. H.M. Dannhauer for providing the necessary programming to process these language data at the Technical University of Aachen Computing Center.


REFERENCES

1.
GAINES, B.R.: "System Identification, Approximation, and Complexity", Intern. Journ. General Systems 3 (1977) 145-174
2.
LABOV, W.: "The boundaries of words and their meaning" in: Bailey/Shuy (Eds.): New Ways of Analyzing Variation in English, Washington 1973, 340-373
3.
MOSKOVICH, W.: "Perspective Paper: Quantitative Linguistics" in: Walker/Karlgren/Kay (Eds.): Natural Language in Information Science, Stockholm 1977, 57-74
4.
NOWAKOWSKA, Maria: "Semiotic Systems, knowledge representation, and memory" in: Rieger, B. (Ed.): Empirical Semantics. A Collection of New Approaches in the Field, Bochum 1980 (forthcoming)
5.
RIEGER, B.: "Bedeutungskonstitution. Einige Bemerkungen zur semiotischen Problematik eines linguistischen Problems", Zeitschrift für Literaturwissenschaft und Linguistik, LiLi 27/28 (1977) 55-68
6.
RIEGER, B.: "Vagheit als Problem der Linguistischen Semantik" in: Sprengel/Bald/Viethen (Eds.): Semantik und Pragmatik, Tübingen 1977, 91-101
7.
RIEGER, B.: "Unscharfe Semantik natürlicher Sprache. Zum Problem der Repräsentation und Analyse vager Bedeutungen", Nova Acta Leopoldina 1978 (in print)
8.
RIEGER, B.: "Fuzzy Structural Semantics: on a generative model of vague natural language meaning", Trappl/ Hanika/Pichler (Eds.): Progress in Cybernetics and Systems Research, Vol. V, New York/London/Sydney 1979, 495-503
9.
RIEGER, B.: "Revolution, Counterrevolution or a New Empirical Approach to Frame Reconstruction instead?" in: Petöfi, J.S. (Ed.): Text vs. Sentence. Basic Questions of Textlinguistics, Vol. II, Hamburg 1979, 555-571
10.
RIEGER, B.: "Repräsentativität: von der Unangemessenheit eines Begriffs zur Kennzeichnung eines Problems linguistischer Korpusbildung", in: Bergenholtz/Schaeder (Hrsg.): Textcorpora. Materialien für eine empirische Textwissenschaft, Kronberg/Ts 1979, 52-70
11.
RIEGER, B.: "Ein statistisches Verfahren zur lexikalisch-semantischen Beschreibung des in Texten verwendeten Vokabulars im Rahmen eines Strukturmodells unscharfer (fuzzy) Wortbedeutungen", in: Hellmann, M.W. (Ed.): Ost-West-Wortschatzvergleich. Sprache der Gegenwart, Schriften des Instituts für deutsche Sprache 48, Düsseldorf 1980 (forthcoming)
12.
RIEGER, B.: "Feasible Fuzzy Semantics. On some problems of how to handle word meaning empirically" in: Eikmeyer/Rieser (Eds.): New Approaches in Word Semantics, Berlin/New York 1980 (forthcoming)
13.
RIEGER, B.: (Ed.): Empirical Semantics. A Collection of New Approaches in the Field, Bochum 1980 (forthcoming)
14.
SPARCK-JONES, Karen/Kay, M.: Linguistics and Information Science, New York 1973, p. 120
15.
ZADEH, L.A.: ''PRUF - a meaning representation language for natural languages'', Intern. Journ. Man-Machine Studies 10 (1978) 395-460

Footnotes:

1Published in: Trappl, R./Findler, N.V./Horn, W. (eds.): Progress in Cybernetics and Systems Research, Vol. XI, Washington/New York/London (McGraw-Hill Intern.) 1982, pp. 249-256.