Progetti di ricerca

Pubblicazioni

Collaborazioni

Tesi

SuperThes

SuperThesVIZ

Concord


SuperThes

In late 2000, a Memorandum of Understanding (MOU) has been concluded  between EKOLab, Umweltbundesamt Berlin, Umweltbundesamt Vienna and the Technisches Bureau Hermann Stallbaumer (TBHS). The MoU was recently renewed for the period 2004-2006. 
The MoU aims to build a co-operation in order to develop a new software tool for building and maintaining multi-lingual, poly-hierarchical Thesauri, based on the experiences of the existing Thesaurus software tools used within the UDK (Umweltdatenkatalog) co-operation.

General information about SuperThes

SuperThes is based firstly on the experience gained working with the software package THESmain/THESshow and secondly on the specifications of the partners. Major design goals for the new software are:
Compatibility with data maintained in the existing software

  • Enhanced user interface, in order to ease system operability

  • Integration of "state of the art" operating system support

  • Support to create user-defined Thesaurus structures

  • Support to create user-defined export formats

  • Support for interfacing with Microsoft standard software

  • Provision of maximum flexibility for creating reports.

Description of the Thesaurus Maintenance Software

SuperThes is used for visualisation and maintenance of thesaurus data according to DIN 1462 /ISO 2788 and DIN /ISO 5964 The application has been developed in Delphi, a Pascal dialect by Borland/Inprise. The programming language offers true object-oriented programming support as well as the stability common to Pascal programs. SuperThes will run by default on all modern 32-Bit Microsoft Operating systems as a client/server application. All persistent data will be kept within the relational database system Interbase. SuperThes as an application has an English user interface. Language codes follow ISO-standard 639-1.
Thesauri created with this application may therefore comprise more than 100 languages. All languages, character sets and glyphs installed on the system may be used. Thus languages like Greek, Russian but also Chinese or Hebrew are no problem. Input method editors, which are common to Asian languages, are possible. Different fonts are also available.

Figure 1: Versatile configurable data windows in Unicode

Thesaurus Specification

The application provides user interface and functionality to create and maintain monolingual and multilingual thesauri. The structural functionality, as well as the terminology used, conforms to ISO standards 2788 and 5964. The user definable set of hierarchical rules is a superset of ISO standards 2788 and 5964. Other rules, like the minimum requirements for field entry, are also user definable. The adherence to these rules is automatically enforced by the application itself. Main rules are the enforcement of reciprocity, the check for arithmetic loops, the check for duplicate entries, the enforcement for uniqueness of concept, to name just the more important ones. Besides the main Thesaurus structure, a SuperThes Thesaurus may contain other user definable tables, which may be used to describe the main Thesaurus in a more detailed way or to keep attached data in them. An example might be the themes-table in EARTh, or a table containing geographic names or other tables. Thesaurus data are kept in standard database files, that allow the exchange of thesauri, the creation and deletion of thesauri on a simple file base. SuperThes allows to keep on the same computer system an unlimited number of thesauri and of course of different Thesaurus systems with different data definitions. More than one instance of the program may be used simultaneously on the same computer, showing different data contents e.g. EARTh together with UDK-Thesaurus. All Thesaurus data may contain "translations" (linguistic equivalents) for up to 30 languages using all character sets and glyphs available in the installation system. Input method editors are supported. 

Figure 2: Multilingual Support, ISO 639-1 compliant

Data Types

In SuperThes, a Thesaurus may be defined due to the required data model. There are only three fields required for any model: one is the record neutral identifier; the second is a field for the status notation controlling the relations, the third is the history field where the information on any modification is stored. All further fields are definable. Beside the standard data types advanced data types like images or formatted text (text forms) are available. SuperThes will contain predefined templates for UDK-Thesaurus and EARTh Thesaurus.

Figure 3: Build your own data structure. User definable tables and data fields 

Figure 4: Various editors for text, forms, images etc.

User Interface

The program is designed to create and edit thesauri using a graphical user interface according to rules defined for Microsoft Windows. The program functions are hierarchically ordered. Each window provides context sensitive help. State of the art features, like drag and drop and right click context menus, will be used where useful. All functions are available via mouse control, as well as via keyboard control. Operation of similar functions will be similar in all windows. The following key functions are implemented:

  • Management for thesauri, including creation, deletion, and copying thesauri

  • Definition of thesauri due to the required data model, including auxiliary tables

  • Definition of languages

  • Visualisation of all data contents of a Thesaurus, either in tabular or in hierarchical view

  • Editing all Thesaurus data

  • Generation of reports

  • Data exchange with external applications. 

Figure 5: Server management utilities included Data Exchange

Data exchange may be performed in several ways:

  • SGML Export and Import
    For the use in the European Environmental Agency, a data exchange format utilising SGML has been designed. This format is already used in a number of related applications. SuperThes will also be able to write and read this data exchange format. Benefits of this method is the general available document type definition and the possibility to read and write every detail of a Thesaurus. The drawback is the amount of work that must be put into a custom reader or writer.

  • Microsoft compatible Export and Import
    Import and Export of "flat" (non-hierarchical) data directly to and from Microsoft applications like MS-Excel or MS-Access. This method was already used with great success in porting GEMET into other languages. It may also be used to import any flat file (list of terms or similar) into the system.

  • Interface to attached applications
    For applications designed to be used in conjunction with SuperThes, a specialized interface will be available. This feature may be used for THESshow, the standalone visualizer, as well as for the Web visualizer, "THESweb", which is in the design phase.

top


SuperThesVIZ

SuperThesVIZ is a web-based tool which allows access to the SuperThes databases via the Internet. The main goal for the software development is to ensure the convenience of the user interface for the MS Windows-based application. SuperThesVIZ is platform independent, based on Java servlet technology.

Requirements for the SuperThes visualisation module

Requirement that is different from standard web application

  1. Not the developer, but the user determines the thesaurus structure

  2. The visualisation module must therefore:

  • adapt to different database structures

  • be easy to configure

Design objectives:

  • System independence

  • Compliance with standards

  • Use of proven technologies

  • Easy to configure

  • Combination of several technologies

  • Must be suitable for use by non programmers

Applied technologies

  • XHTML 1.0

  • Java Server Pages

  • Java Servlets

  • Firebird 1.5 (database)

  • Runs on standard servlet containers, such as Tomcat 5.5

Features

  • Simple way of deploying SuperThes databases on the Web.

  • Runs on Windows computers as well as on Unix.

  • Mixed environments possible (e.g Webserver Apache/Tomcat on Unix, Database on Windows 2003 Server)

  • Supports user authentication via Database Server

  • Supports https for secure working

  • Feature set is determined automatically from attached database

  • Thesaurus display is similar to THESshow, so users familiar with it will already know the user interface

Configuration of thesaurus presentation

  • Setup is done by filling out a configuration file

  • Database structure, available tables, fields, languages are extracted by the software automatically

  • User related information (contact, terms of use, presentation of user organisation) is kept in simple html files.

  • Html files are included into the JSP's at runtime.

  • Layout of web pages is controlled from a central CSS-file, so adapting the appearence is an easy task.

Technical Requirements

Clientside:

XHTML 1.0 compliant browser (IE, Netscape Firefox, Opera)

Web-Server:

J2EE container (compliant to Servlet 2.4specs, JSP 2.0 specs)
jdbc driver (Firebird XA-compliant JDBC driver Version 1.5)
optional access to smpt mail service (for form mailer)

Database-Server:

Firebird Database engine (Version 1.5)

 

 

top


Concord

Concord has been conceived and developed in the framework of Čulogos’ language engineering environment.

From a terminological and linguistic point of view, Concord is language-independent and can be applied to any Latin-characters language. Special characters are fully supported. Two interface languages are available: Italian and English. An easy-to-use interface has been developed according to the terminologist’s point of view. The linguistic engine of Concord manages atoms, stop words and complex terms according to a self-learning logic, allowing the system to apply on new terms each learned element and structure. For each term, Concord proposes a pre-assigned representation of its elements. The user reviews the proposal and can modify it in each element. All terms and elements are browseable as concordances. The concordance method of Concord has been derived from the IntraText Digital Library.

The working flow

Concord project can be seen as divided into three elaboration steps:

  1. Analysis of terms

  2. Analysis of atoms

  3. Generation of the index.

The first two steps require a direct intervention by the user, the last one is performed by the software. Concord interface is divided in tabs corresponding to such steps. In the first phase (fig. 1) the user reviews each term splitting it into its different parts corresponding to:

  1. atom

  2. stop word

  3. impossible word

 

Figure 1 - The first tab – term analysis

A preliminary subdivision is proposed by the software and is based on the results of former analysis. In such a subdivision, the user can set:

  •  consider/ignore the part of a term closed into brackets

  • case sensitive or case insensitive logic.

A full text search tool allows to locate any term in the term list by searching for the whole term or a part of it.
When an atom (e.g. “acid”) which is part of another one (e.g. acidic particle) has been already validated, the systems proposes a subdivision including a “+”. The result is like “acid+ic|particle”. If it is decided that “acidic” is an atom by itself, it is sufficient to manually delete the “+”. The same applies to any other automatic subdivision proposed by the software.
Once an atom is validated, it will be used to identify other similar atoms in other terms.

The second phase: analysis of atoms

The second phase allows the user to review the terms starting from the atoms in an atom-to-term interface. Reviewing atoms includes the definition of sub-atoms, i.e. prefixes and suffixes.

 

Figure 2 - The second tab – atom analysis

In the left window, the complete list of the atoms validated during the first phase is presented. In the right window all the elements that will be used for the validation of sub-atoms are presented. In the lower part of the window, the bar for the navigation and editing of the table of atoms is displayed. The navigation bar presents 10 buttons. Using this bar it will be possible to navigate through the table, insert or delete records, edit, confirm and undo an operation, update the table
Once an atom is selected it is displayed in the window followed by a partial index made by the concordances of the selected atom and by the terms containing it.

Figure 3 - The partial index

It is now possible, using the Concord symbolism, to make changes to the atom structure and divide it into sub-atoms. A sub-atom could be a prefix or a suffix.
To identify a prefix, the left angle bracket “<” is used so that all the characters preceding the symbol are interpreted as belonging to a prefix sub-atom. Following this methodology the term “biome” becomes “bio ”is used so that all the characters following the symbol are interpreted as belonging to a suffix sub-atom. Following this methodology the term “absorption” becomes “ab>sorption” and the system considers “sorption” as a suffix and “ab” is not considered.
There are more complex cases, like in the term “chlorobenzene” where two sub-atoms appear of which one is the prefix (“chloro”) and the other is a suffix (“benzene”). In this case a combination of the two symbols is used (“chloro<>benzene”). In this context the terms “prefix” and “suffix” are used referring to a simple string logic.

 

Figure 4 - The result of the subdivision of “chlorobenzene” in sub-atom

When a term contains not only a prefix and a suffix but also a final part that has to be ignored like in “bioacidic”, the symbol “.” is used (bio<>acid.ic”). The symbols “<” and “>” could also be used in combination like in “agrifoodstuff”; the resulting string is “agri<>food<stuff” where “agri” is a prefix, “food” is a suffix of “agri” and a prefix of “stuff” while “stuff” is not considered.

The third phase: the index

Once the phase 2 is completed, next step is represented by the generation of the final index and to export the results in text format. It is sufficient to click on “File, export” choosing the name of the file to be generated.

Results, their use and future development

Concord can be easily applied to thesauri, dictionaries and any other lexical/terminological content since it refers to standard databases and allows parametric configuration. It is foreseen the application of Concord also to microthesauri on which EKOLab is currently working like the GIS and Remote Sensing micro-thesaurus and the SnowTerm project. Another point under development is the integration at the level of tables between Concord and other software like SuperThes. Once implemented, this will allow to work using the same database storing data and creating internal links.

top