NooJ: A Linguistic Development Environment

 

ELLIADD, Université de Franche-Comté

Max Silberztein
Max Silberztein, Professor, Université de Franche-Comté

I wish to express many thanks to my colleagues and students and all NooJ users who have contributed to help enhance INTEX, and now NooJ, with their patience, criticisms, creative ideas and ambitious expectations.

1991-1997: M. Silberztein developed INTEX (v1-v3) at the LADL laboratory of the Université Paris 7 (Prof. Maurice Gross) for the NextStep Operating System, using the Objective-C programming language.

1997-2002: M. Silberztein developed INTEX (v4) from the ground up for the Windows OS at the Université de Franche-Comté (1997-2002), in C++. M. Silberztein entrusted INTEX sources and resources to Maurice Gross (Univ. Paris 7) before leaving to the USA.

Unitex was developed in 2002 at the Université Marne-La-Vallée by Sébastien Paumier, under the supervision of Eric Laporte, with funding from the ANR French National project OUTILEX, without the consent, nor even the knowledge of the INTEX author (in the USA for 3 years) nor his employer, the Université de Franche-Comté. Neither Sébastien Paumier, nor Eric Laporte, nor any member of the Université de Marne-La-Vallée, nor any employee of the companies of the OUTILEX consortium, had ever participated in the development of INTEX.

Short extract and translation of the Unitex report published since 2002
at the INTEX page of the Université de Franche-Comté Web site 
(never contradicted)

1. Linguistic Resources

The way INTEX represents grammars graphically is an original invention, presented in:

Max Silberztein, 1993. Dictionnaires électroniques et analyse automatique de textes : le système INTEX. Masson Eds: Paris. 

Max Silberztein has replaced finite-state automata states and transitions with nodes; only one initial node and one terminal node are allowed; outputs of finite-state transducers are displayed in bold under nodes, clickable auxiliary nodes refer to embedded graphs; unconnected nodes are treated as comments, etc. These conventions are exactly the same in Unitex.

A number of linguistic resources were taken from the INTEX package and commented in the Unitex documentation without their author's consent nor any citation, e.g.: 

Fig.1: INTEX Unitex

Notes (such as “Règle no1” [Rule#1]) refer to the INTEX manual... but are nowhere mentioned in the Unitex manual.

Comments' red colour in INTEX became blue in Unitex, which is inconsistent with Unitex parameters (see Fig. 7). C++ and JAVA use inverted byte-orders to represent RGB colors: this mistake proves that Sébastien Paumier used INTEX C++ sources and converted them into Java to construct Unitex.

Fig.2: INTEX Unitex

The figure presented in the Unitex manual is a cropped screenshot of INTEX. This proves that Sébastien Paumier had INTEX installed on his PC when he wrote Unitex manual.

2. Interface and Functionalities

The two software have the exact same functionalities that are unique and cannot be found in any other software at the time, and they share an almost identical user interface, e.g.:

Fig.3: INTEX Unitex

Note that in Windows applications, the Edit menu is usually located at the second left position (right after File/Text). There is no reason why Unitex would follow INTEX's idiosyncratic presentation, unless of course it is a just a quick copy.

Fig.4: INTEX Unitex

The Unitex presentation of results is a partial copy of INTEX's. As Unitex does not display what lexical resources were used to process the text, the list of unrecognized words is useless. Sébastien Paumier copied INTEX' interface without fully understanding what INTEX was displaying, and why.

Fig.5: INTEX Unitex

 

Fig.6: INTEX Unitex

The menu item "FSGraph" corresponds to INTEX's FSGraph graphical editor. Sébastien Paumier kept this menu name in Unitex, even though Unitex's graphical editor is supposed to be named "Unigraph", as seen in Fig. 12. This mistake proves that Unitex interface is just a quick copy of INTEX's.

Fig.7: INTEX Unitex

The presentation options are exactly the same but are not consistent in Unitex: for instance, "Comment nodes" are supposed to be displayed in red, but they are in fact displayed in blue, as seen in Fig. 1.

Fig. 8: INTEX Unitex

3. Methodology

The method to perform lexical and morphological analyses of texts, as well as the representation of intermediate analyses are identical in the two software: 

Fig.9: INTEX Unitex

 The differences in presentation are superficial: "Remove Xxx lexical items" is functionally the same as "Clean Text FST", and INTEX's check boxes (e.g. dlf and dlc) were replaced with a label (where dlf corresponds to DLF and dlc to DLC).

Fig.10: INTEX (transducer of the text) Unitex (transducer of the text)

The difference in background color is meaningful in INTEX (the text is read-only whereas the graph can be edited). In Unitex, the text's background is white, even though it is read-only. Sébastien Paumier did not know the common convention.

The tools used to process inflectional morphology, as well as their resources are the same, e.g.:

Fig.11: INTEX (inflection of Noun “cheval”) Unitex (inflection of Noun “cheval”)

Here too, the Unitex manual contains a cropped screenshot of INTEX, which proves that Sébastien Paumier had INTEX opened on his PC while he was writing Unitex's manual. Symbols and code formats used in morphological and syntactic grammars are identical; dictionary maintenance tools are identical (sometimes renamed, such as “recondic” renamed as “checkDic”) ; Unitex uses the exact same algorithms as INTEX to sort dictionaries (even though Unitex does not need to, since it uses Unicode); Dictionaries are managed the same way (e.g. 3 levels of priority), and applied the same way to texts.

4. Programs, Algorithms and File Formats

The two software use the same programs, used the same way and constructed on the same algorithms, with transparent modifications, e.g. replace “char” with “wchar”; convert C++ methods in Java, etc.

Several Unitex functionalities can only be explained by the fact that Unitex is a quick copy of INTEX. For instance, INTEX needs to allow users to describe lexicographic orders for each language. As Unitex is based on Unicode which includes support for lexicographic orders, it should not need this functionary. Why then does Unitex use the same exact functionalities and method as INTEX to describe lexicographic order, rather than simply drop them in favour or Unicode?

File format are either identical, or almost identical, e.g.:

Fig.12a: INTEX Unitex
Fig.12b:INTEX Unitex

The header “FSGraph” (name of INTEX's graph editor) that was included in the linguistic resources constructed by Max Silberztein and his colleagues with INTEX was replaced with the header “Unigraph” and distributed in various Unitex packages, in order to pretend that they were developed with Unitex.

The order of the parameters used to describe INTEX graphs (DBOX, DFRAME, DDATE, etc.) is not relevant. The fact that they are not in alphabetical order (e.g. DFRAME before DDATE) and still identical in INTEX and Unitex, is such an unlikely coincidence that can only be explained explained if INTEX sources were copied into Unitex.

Unitex’s GUI still kept the menu item “FSGraph” (as seen in Fig.6).

5. Conclusion

See the statement from the Claude Condé, Dean and Jean-Marie Viprey, member of the Scientific committee of the Université de Franche-Comté:

A l’automne 2002, la communauté scientifique a vu apparaître, sur le site de l’Institut Gaspard-Monge de l’Université de Marne-la-Vallée, un nouveau logiciel de reconnaissance linguistique nommé Unitex, signé de M. Sébastien Paumier sous la responsabilité du Pr. Eric Laporte.
Tout chercheur ou usager familier du logiciel INTEX, créé par Max Silberztein d’abord au LADL (CNRS-Univ. Paris 7), puis depuis 1997 à l’Université de Franche-Comté, a constaté que la méthodologie, l’architecture, les programmes, l’interface utilisateur et la documentation d’ Unitex sont quasi-identiques à celles d’INTEX. De plus, de nombreux fichiers (dictionnaires et graphes, etc.), ont été inclus dans l’ensemble Unitex, malgré l’interdiction sur le site WEB d’INTEX : “None of the programs and linguistic resources included in the INTEX package should be copied, redistributed, incorporated into other software, or published without their author’s consent and proper citation.”
L’apparition d’Unitex soulève de nombreuses questions. Quel intérêt peut avoir un laboratoire associé au CNRS de publier un doublon périmé d’INTEX, alors que ses membres concernés, à commencer par M. Paumier lui-même, appartenaient à la communauté INTEX, comme le montre bien leur participation régulière aux Journées INTEX, jusqu’en mai 2002 à Marseille ? Rappelons que l’Université de Marne-la-Vallée utilisait intensivement INTEX pour ses besoins en Enseignement comme en Recherche.
Si, comme on l’a entendu dire, il s’agissait de dépasser certaines limitations d’INTEX en matière de portabilité, pourquoi ne pas avoir soulevé ce problème ouvertement dans la communauté INTEX, voire dans la communauté TAL tout entière ? Où sont les traces de la discussion, et de l’échec de cette discussion, qui justifierait la rupture inamicale et violente que représente le surgissement d’Unitex ? Pendant plusieurs mois, le nom même d’INTEX ou de son auteur ne figurait pas, ni dans la documentation, ni sur aucune des pages WEB liées à Unitex.
Notre point de vue est explicite : INTEX a été conçu par le Pr. Max Silberztein à l’Université Paris 7, et est développé depuis 1997 à l’Université de Franche-Comté. Nous sommes prêts à discuter, dans quelque cadre public que ce soit des raisons qui ont poussé MM Laporte et Paumier à considérer qu’ils avaient le droit de construire une copie d’INTEX et de la présenter comme un travail original. Nous sommes toujours, avec la direction de notre Université, à la recherche d’un arbitrage scientifique.
En attendant, nous tenons à bien clarifier ce point aux yeux des chercheurs et utilisateurs : paradoxalement, si Unitex apparaît bien comme étant largement une copie d’INTEX, l’avenir de ces deux logiciels est antagonique. Le traitement de la communauté scientifique par l’équipe de Marne-la-Vallée laisse entrevoir une conception très particulière de l’activité scientifique, et de la mission des enseignants-chercheurs.

— Pr. Claude Condé
Head of the SLHS Department (Arts and the Humanities) of the Université de Franche-Comté.
— Pr. Jean-Marie Viprey
Scientific Committee of the Université de Franche-Comté.

(1) Here is a link that defines software counterfeit in France: 
http://www.app.asso.fr/centre-information/base-de-connaissances/code-logiciels/la-protection-du-logiciel/agir-contre-les-atteintes

(2) Here is an extract of the Wikipedia entry about authors' rights in France [Droit d'auteur]:
"...the moral right, which recognizes in particular the author's authorship of the work and respect for its integrity. In some countries, including France, it is perpetual, inalienable and imprescriptible"
[...le droit moral, qui reconnait notamment à l'auteur la paternité de l’œuvre et le respect de son intégrité. Dans certains pays, dont la France, il est perpétuel, inaliénable et imprescriptible]

**********

Unitex's interface, methodology, file formats, functionalities and programs as well as its inconsistencies and useless functionalities prove that Unitex was written quickly by someone who had full access to INTEX sources, and copied them without always understanding what he was doing.

If you are still interested in using Unitex and support its promoters, you may have fun playing at "Spot the difference" by looking at INTEX manual:

Silberztein, M. (2000). INTEX Manual. available at http://intex.univ-fcomte.fr (230 pages).

If you would rather pass, check out the NooJ linguistic platform, which implements an original and innovative methodology backed by powerful computational algorithms, described in:

Silberztein, M. (2015). La formalisation des langues : l’approche de NooJ. ISTE Ed.: Londres. (429 pages).

Silberztein, M. (2016). The Formalisation of Natural Languages: the NooJ approach. Wiley Eds.: Hoboken NJ, USA (346 pages).

NooJ is free and GPL open source, endorsed and distributed by the European Community (Metashare program), is used by over 100 honest researchers in the world to describe over 30 languages, and contains over 20 linguistic, computational and statistical functionalities that make it objectively more technically and scientifically valuable than Unitex:

- NooJ represents texts in UTF8, which is more efficient than Java UTF16
- NooJ manages equivalent characters, absent diacritics (e.g. in Arabic), ligatures (e.g. "oe" = "œ") and Unicode combined characters
- NooJ corpus processor handles over 150 file formats, including XML, PDF, WORD, etc.
- NooJ does not need to produce DELAF- nor DELACF-type dictionaries; this is crucial for heavy-morphology languages such as Hungarian
- NooJ dictionaries handle spelling variants and synonyms in a unified way
- NooJ dictionaries handle simple words, contracted words (e.g. "can't") and multiword units (e.g. "as a matter of fact") in a unified way (no need to separate simple words from multiwords)
- NooJ dictionaries can represent multilingual information accessible from any grammar (that allows for easily written MT systems)
- NooJ handles lexicon-grammars by pairing dictionaries and grammars without compiling gigantic finite-state automata
- NooJ dictionaries are represented by machines that are neutral vis-à-vis Parsing/Generation applications and can even be mixed (crucial for paraphrase generation).
- NooJ represents all text analyses in a Text Annotation Structure which allows for efficient cascades and linguistically-natural grammars
- TAS may represent lexical, morphological, syntactic or semantic (unsolved) ambiguities; all NooJ parsers can manage ambiguous TAS
- in the TAS, annotations might be discontiguous, e.g. 1 single annotation for "turn off" in "They turned the light off."
- NooJ unifies Inflectional and Derivational grammars
- NooJ inflectional and derivational grammars for simple words are reused to describe multiword units (w/wout agreement constraints)
- NooJ morphological engine uses programmable operators that can be adapted to or added for any language
- NooJ can process Agglutination morphology; this is crucial for Germanic and Semitic languages
- NooJ grammars are embedded sets of graphs, which allows teams to share resources with no risk of conflicts
- NooJ processes Context-Sensitive Grammars like LFG; this is crucial for Slavic languages
- NooJ processes Unrestricted Grammars like HPSG; this is crucial for languages with unordered syntax such as Hungarian
- All NooJ grammars can be described either by rules entered in a text editor, or graphically using a graphical editor
- NooJ contains a set of unique maintenance tools for its linguistic resources, e.g. contracts
- NooJ syntactic parser transparently processes agglutinated and contracted words as well as multiwords and discontiguous expressions
- NooJ parser can produce and display graphically derivation trees, e.g. Main(NP(ProperName "John")) V(SingleVerb("ate"), NP(Det(two) Noun(apples)))
- NooJ parser can produce and display graphically constituent trees, e.g. Sentence (NP(John), V(ate), NP (two apples))
- NooJ parser can produce and display graphically dependency trees, e.g. eat (John, apple (two))
- NooJ includes several disambiguation tools: local grammars, negative grammars, global grammars and contains a powerful annotation management system
- NooJ's Transformational engine can produce paraphrases, perform automatic generation, Machine Translation, etc.
- NooJ contains various statistical tools used in the Digital Humanities (vocabulary frequency, normal distribution, standard score, tf-idf, etc.)
- etc.