Multilingual Cataloguing of Product Information of Specific Domains:
Case Mkbeem System
Aarno Lehtola, Jarno Tenni and Tuula Käpylä
VTT Information TechnologyPOB 1201, FIN-02044 VTT
Finland
Aarno.Lehtola@vtt.fi
Abstract
This paper describes the Cataloguing Toolof the Mkbeem multilingual eCommercemediation system. The Cataloguing Toolis used by product suppliers to providenecessary product information in a formrequired by the mediation system. Therelevant information includes product ar-ticles that are maintained in a pivot lan-guage and automatically translated toother supported languages. From the sup-plier viewpoint the Cataloguing Tool im-plements \"write-once-publish-many\"paradigm. Other functionalities of the toolinclude automatic extraction of productproperties from text articles along to anontological product model and automaticclassification of products based on theirproperties and ontology models. This pa-per describes the Cataloguing Tool anddiscusses in more detail about the use ofontologies in automatic interpretation ofthe semantics of product articles and userqueries.
1 Introduction
In the year 2000 native English speakers becameoutnumbered by native users of other languages inthe Internet population. Since then the trend hascontinued and in September 2002 the others com-prised already over 63 % of the population[GlobalReach 2002]. The linguistic diversity ishuge among the population. Even within Europethere can easily be counted over 60 languages
among which even the smaller ones, like the over100000 Icelandic speakers, comprise potentialcustomers groups for international eShops. Thereis a remarkable need for cost effective IT solutionsfor enabling multilinguality in consumer Internettrading. This is the market where the Mkbeem me-diation system has been positioned.
The Mkbeem mediation system (MultilingualKnowledge-Based European Electronic Market-place, outcome of the EC project IST-1999-10589)adapts the language and the trading conditions ofan Internet sales point according to its internationalcustomers [Leger & al. 2001, Mkbeem 2000-2002]. The system supports three use cases (shownby the big arrows in Figure 1). Each use case hasits own tool. The multilingual Cataloguing Tool isfor content and service providers to describe theirproducts by means of text articles in a write-once-publish-many manner so that the information ismaintained monolingually. There is a tool for theproviders to define the contract conditions of theirgoods within the context of a consumer sales re-lated legislation model, which enables adaptationof the contracts depending on how the transactionscross national borders. Finally, for consumers thereis a system that implements cross-lingual IR fromthe product databases based on combining NL que-ries, graphical navigation in localised product hier-archies and use of search forms – the user choosesthe modality. The language and contract adaptationis based on ontological models covering productmodels, legislation and related generic knowledgelike time, materials and colours. The ontologiesfunction for narrowing the scope of NL processingand for mediating in between different languages[Leger & al. 2000, Gomez-Perez & al. 2001].
Figure 1: The operating context of the Mkbeem mediation system.
Next this paper describes the multilingualCataloguing Tool. Its very central functionality, themeaning extraction from product articles and NLqueries is presented in more detail. In the end thereis a short summary of experiences from the use ofthe Cataloguing Tool.
the automatic processing. This means that theterms used must be found in the lexicon and thesentence syntax must comply with the languagemodel. In a case of unknown or erroneous word orsentence, the checking tool suggests possible cor-rections. If needed, it also allows a qualified userto edit language model (terminology, languagerules and ontology definitions).
Property extraction functionality finds prod-uct properties from the textual product descriptionsalong the provided ontological product model,which describes the parts and their properties ofproduct types in terms of concepts and their attrib-utes. The extracted properties present informationabout a product in a language-independent wayand they are stored in a relational database forfurther uses in inference, product classification andend-user information request processing. Inferencemeans that new information is concluded based onontologies. E.g., for a given cloth qualitative factsare inferred based on a material ontology and theknown material composition of the cloth. We canalso infer origin (e.g. are they natural) of materials.
2 Multilingual Cataloguing Tool
Multilingual Cataloguing Tool is used to publishmonolingual product information in multiple lan-guages in a write-once-publish-many manner.Cataloguing of a new product goes through multi-ple steps, taking into account linguistic, productmodel and culture-specific issues. These steps in-clude:
1. Text checking2. Property extraction3. Categorisation
4. Machine translation5. NL query processing
Text checking functionality is used to verifythat the description of the product conforms to thelanguage model in order to guarantee the quality of
Based on a colour ontology, colour similarities andharmonies can be inferred.
Categorisation functionality provides cultureand market-specific categorisation of products intoproduct hierarchies in an eCommerce site. For in-stance, this means that for a wind-proof jacket thesystem suggests that this product should be foundboth in outdoor clothing and in jacket categories ofthe eCommerce site. The categorisation is based onthe extracted properties and simple rules. Cultureand market-specific issues may arise. For instance,the concept of a winter cloth differs in Finland andin Greece.
Machine translation in the Cataloguing Toolrelies on the Webtran machine translation system.A checked and accepted product description text ispassed to Webtran to be translated to multiple tar-get languages.
Natural language query processing is usedfor testing the newly added products so that theyare found easily and from the correct places in thecatalogue. The queries are analysed and the ex-tracted properties are matched against saved prop-erties of the products in the database.
After the processing steps the product informa-tion is stored into the product database. This in-cludes the translated product articles, the extractedproperties, and the results of inferences based onthe properties and the market specific categorieswhere the product belongs.
The Cataloguing Tool supports access profilesfor four classes of users. The Proof-readers haverights to correct existing product information, thecataloguers can add new products to and removeold ones from the catalogue, the language model-lers can add new terms to the lexica and the con-tent provider manager has all the rights, includingediting the ontologies and the language models.The user interface is adapted to the profile of thecurrent user so that only relevant and accessibleparts are shown on the screen.
From the language processing point of view thecataloguing involves two central processes: auto-matic sublanguage translation, and meaning ex-traction from product articles and NL informationrequests. The text checking and the machinetranslation are directly based on the Webtran MTsoftware of VTT [Lehtola & al. 1998, Lehtola &al. 1999a & 1999b, Tenni 1999]. Section 3 con-centrates on describing the meaning extractionprocess that associates the input texts to ontologi-
cal domain models. It finds for input texts a lan-guage-independent semantic representation interms of the description logic language CARIN[Levy & Rousset 1998] and based on the provideddomain ontologies and the associated languagemodels. The analysed text inputs include productdescription articles and NL queries. Meaning ex-traction is central when product properties are rec-ognised from textual product articles, whenproducts are classified, and when NL queries areprocessed. Webtran system has been modified toassist in the meaning extraction, as well.
The Cataloguing Tool is implemented usingEnterprise Java Beans. The overall system consistsof two parts: the core server and the end-user inter-face. The core server provides natural languageprocessing services and an inference engine forusing the ontologies. The end-user interface is aJava Applet running in a WWW browser. The toolinteracts with the core server using internet proto-cols. The architecture needs just one installation ofthe core server and then the Cataloguing Tool canbe used from multiple places without basically anyextra installations, e.g., through company intranetat different subsidiaries or through extranet bysubproviders. This makes also maintenance of thesystem relatively easy.
3 Meaning Extraction Process
Webtran MT system has its own formalism fordescribing domain-specific sublanguages. ThisAugmented Lexical Entries (ALE) formalism pro-vides multidirectional rules denoting equal, non-directed natural language excerpts on the desiredlinguistic abstraction levels. An entry can describelinguistic information in one, two or more lan-guages. In an entry, each language is represented inits own section. Entries can also be understood aspartial dependency parse trees. For detailed techni-cal description of the ALE formalism, see [Lehtolaet al. 1999]. Below is an example of an ALE: [cloth.material.composition
[fi ^(A){clothProd} tag_percentage(X) (B){textileMaterial ptv}]
[fr ^(A){clothProd} en tag_percentage(X) (B){textileMaterial}]
[en ^(A){clothProd} of tag_percentage(X) (B){textileMaterial}]
[se ^(A){clothProd} av tag_percentage(X) (B){textileMaterial}]
Figure 2: An excerpt from an ontology with ALE rules associated.
The previous ALE concerns a phrase withproduct name and the material that the product ismade of with the material percentage. For exam-ple: 'housut 100% puuvillaa' (fi) translates into'pantalon en 100% coton' (fr), 'trousers of 100%cotton' (en) and 'byxa av 100% bomull' (se). It alsomarks the product name to be the head of the im-plied partial dependency tree.
ALEs are used also to describe the linguisticconstituents for concept matching in meaning ex-traction. Concept matching ALEs can be includedinto the concept property and relations descriptionsin domain ontologies, e.g., the product models ofMkbeem. The idea is schematically illustrated inFigure 2. When corresponding language constructis recognised the system automatically associates itto the ontology concept. The given ALEs controlhow concepts and relations can be recognised fromconstructs of human language, as well as, how ahuman language paraphrase can be generated froman ontological expression. This reverse function isnot currently used.
The meaning extraction process includes fivephases:
1. Lexical analysis2. Dependence analysis
3. Concept matching and verification
4. Refining semantics in particular themes5. Syntactic translation into CARIN
Figure 3 illustrates the process and names theintermediate results. The phases 1-3 make up syn-tactico-semantic analysis.
The lexical analysis includes tokenising of theinput and incorporating morpho-lexical informa-tion to each token. After this phase we know foreach token (word, number, abbreviation etc.) whatis known from it based on the information that isavailable from the lexicon and without consideringany context information.
The dependence analysis involves findingsyntactical relationships between the constituentsof the sentence. It produces a set of syntax trees.The concept matching and verification in-volves finding the conceptual bindings to the do-main ontology that the user input embedded. Afterthis phase we have a set of syntax trees with therelationships of its subparts to the domain ontologyconcepts explicitly marked. In fact, we have a de-pendence syntax tree that is extended with therelevant parts of the domain ontology throughthese relationships. We call this presentationbriefly a semantic graph.
After the set of semantic graphs has been de-rived, there follows ontological inference of theCARIN formulas in the phases 4 and 5.
The refining semantics in particular themesis based on additional generic ontologies like on-tologies of colours, materials, distances etc. Therefining takes a semantic graph and deduces andmakes explicit in it additional knowledge con-cerning the particular themes. The deduction proc-ess with colours, materials and expressions of timeis described in more detail in [Lehtola & al. 2003].The analysis results are translated into CARINlanguage in a format that is standard for all furtherprocessing in the Mkbeem system. The translationinvolves, e.g., removing of the linguistic informa-
Figure 3: The meaning extraction process in the Cataloguing Tool.
Figure 4: An example of the data flow when extracting meaning of a NL query.
Domain specific lexicon (morphological entries)Domain specific lexicon (number of entries)ALE Rules:- total number
- of which bound to ontologies
Cataloguing Rules
Ontology concepts/attribute values
Finnish45002800
French17001300
96515096307/1050
English15001400
Table 1: The sizes of the Mkbeem specific linguistic knowledge bases.
tion that has been retained this far in the inter-mediate data structures.
Figure 4 contains an example about how themeaning extraction process goes with the NLquery “musta hame, jossa halkio ja taskut” (“ablack skirt with split and pockets”). The corre-sponding lexical semantic graph is shown. Thegraph includes both linguistic analysis results andreferences of the constituents to the recognisedconcepts in a domain ontology, which describesproperties of clothing products.
Table 1 summarises the sizes of the linguisticknowledge bases that were implemented for thecataloguing of descriptions of clothing productsfrom Finnish to French and to English in the fieldtrial tests of the Cataloguing Tool.
the tests they were made an interview that focusedsystematically to every function of the cataloguingtool. Testing period was one month during whichthey were using the system from their own ma-chines.
Test results were very positive. The cataloguingprocess as a whole was seen as an easy and effi-cient way of producing and classifying productinformation. Also the tool itself got good remarks:it was considered to be a useful tool for the pro-duction of multilingual product information andeach of the main features (see Section 2) was con-sidered as good. Besides the very important possi-bility of semi-automatic translation into targetlanguages, test-users named functionalities likeproperty extraction and inference with colours andmaterials to be important in bringing the customersnew possibilities to find complementary informa-tion from goods deduced from additional sources.One important advantage in an integrated cata-loguing environment is that it helps in producingconsistent and uniform information as the wholecataloguing process is based on joint language andproduct models that conform to the companyknowledge of the domain. Moreover, the test usersanticipated that the use of the Cataloguing Toolcan make the working process faster and it reducesthe amount of manual, repeated routine procedures.Also the knowledge base maintenance tools wereconsidered to suit to their task well.
The MT component Webtran of the Catalogu-ing Tool has been in production use at Ellos sincethe year 2000. The EUROMAP case study by CSCInc. [Loimaranta 2000] reports savings of over30% in translation time having been reached after arelatively short use of the MT tool.
4 Test User Experiences
The testing of the Cataloguing Tool was carriedout by the mail-order company Ellos PostimyyntiOy with their sales articles being women clothes.The first tests were carried out in the middle of theproject in September 2001 in order to guide thedevelopment work of the following second phase.The test results presented here concern the secondphase trials. The field tests were done during Sep-tember 2002. The idea of the testing was first totest the concept of the Cataloguing Tool, i.e. themaintenance of the multilingual catalogue using asingle tool. Secondly, the tests concerned the us-ability of the Cataloguing Tool in a real workingenvironment. The test group consisted of 3 cata-loguing professionals. Testers were interviewedtwice. Before the tests they were asked about theirbackground, experiences and expectations. After
5 Conclusions
This paper described the Cataloguing Tool of theMkbeem multilingual eCommerce mediation sys-tem. The Cataloguing Tool is the software forproduct suppliers to author multilingual productinformation. This paper describes briefly the mainfunctionalities (text checking, property extraction,product categorisation, machine translation, naturallanguage query processing) of the CataloguingTool. The central linguistic function, called mean-ing extraction, was presented in more detail.Meaning extraction associates linguistic expres-sions to the domain ontology concepts and rela-tions and provides a language neutral semanticrepresentation for input texts.
The ALE formalism for describing domain-specific sublanguage was briefly explained. Therewas also outlined how ALE rules can be associatedinto concepts and relations of an ontology model inorder to user them for analysing meanings of NLinputs. The results are expressed in specific onto-logical formulas using CARIN language. The ALEformalism includes required elements to describethe rules for language checking, machine transla-tion, and matching NL inputs to the concepts andrelations of an ontology model.
Finally the user-test settings and results weresummarised. Experienced catalogue maintenanceprofessionals carried out tests and the overall im-pression was positive. Both the concept of a\"cataloguing tool\" and the overall software wereseen as very useful. The cataloguing process as awhole was seen as an easy and efficient way ofproducing and classifying product information.The tool itself was considered to be a useful toolfor production of multilingual product informationand each of the main features was considered im-portant. The results give us a good reason to con-tinue this work into the future and to bring thistechnology into everyday use and to adapt it tonew domains of goods and new languages. Of thecompanies involved in the Mkbeem project, Ellosas well as SNCF and France Telecom are planningto utilise the technology in their business opera-tions.
Acknowledgements
The authors would like to thank all their colleaguesin the Mkbeem consortium (France Telecom,SNCF, Fidal, SchlumbergerSema, UPM, NTUA,CNRS and VTT) for the excellent co-operation andthe European Commission for supporting the re-ported work.
References
GlobalReach (2002): Statistics on the website of GlobalReach Inc, September 2002, URLhttp://www.glreach.com/Gomez-Perez, Asun; Corcho Carcia, Oscar; FernandezLopez, M.; Lehtola, Aarno; Taveter, Kuldar; Sorva,Juha; Käpylä, Tuula; Toumani, Farouk; Soualmia, L.;Barboux, Cecile; Castro, E.; Sallatin, Jean; Arbant,Geraldine; Bonnaric, Annabelle (2001): Require-ment, Choice of a Knowledge Representation andTools. Public Report of MKBEEM project (EC IST-1999-10589), Version 2.0, available e.g. fromwww.mkbeem.com, 2001, 93 p.Jaaranen, Kristiina, Lehtola, Aarno, Tenni, Jarno, Boun-saythip, Catherine (2000): Webtran tools for in-company language support. In: Language Technolo-gies for Dynamic Business in the Age of the Media.Köln, 23 - 25 Nov. 2000. Vereinigung für Spracheund Wirtschaft. Köln (2000), pp. 145 - 155.Lehtola, A., Bounsaythip, C., and Tenni, J. (1998):Controlled Language Technology in MultilingualUser Interfaces. In: Proceedings of the 4th ERCIMWorkshop on User Interfaces for All (UI4ALL´98),Stockholm, 1998, pp. 73-78.Lehtola, A., Heinecke, J., Bounsaythip, C (2003): Intel-ligent Human Language Query Processing inMkbeem. In: Proceedings of HCI Interna-tional/UAHCI 2003, Crete, June 22-27, in print.Lehtola, A., Tenni, J., Bounsaythip, C., and Jaaranen, K.(1999a): Controlled Languages as the Basis for Mul-tilingual Catalogues on the WWW. In: Jean-YvesRoger, Brian Stanford-Smith and Paul T. Kidd(Eds.). In: Business and Work in the Information So-ciety: New Technologies and Applications. IOS-Press, Amsterdam, pp. 207-213.
Lehtola A., Tenni J., Bounsaythip C., and Jaaranen K.(1999b): WEBTRAN: A Controlled Language Ma-chine Translation System for Building MultilingualServices on Internet. In: Proceedings of: MachineTranslation Summit VII `99 (MT Summit 99), Sep-tember 13-17, 1999, Singapore, pp. 487 - 495.Levy, A.; Rousset, M.C. (1998): CARIN: A Represen-tation Language Combining Horn rules and Descrip-tion Logics. Artificial Intelligence Journal, vol 104.September 1998.Leger, Alain; Michel, Geraldine; Barrett, Peter; Gitton,Sylvain; Gomez-Pere, Asuncion; Lehtola, Aarno;Mokkila, Kristiina; Rodrigez, Santiago; Sallantin,Jean; Varvarigou, Theodora; Vinesse, Jerome. On-tology domain modeling support for multi-lingualservices in E-Commerce: MKBEEM 14th EuropeanConference on Artificial Intelligence ECAI'00, Work-shop on Applications of Ontologies and Problem-Solving Methods. Berlin, DE, 20 - 25 August 2000.Berlin (2000), 4 p. URL http://delicias.dia.fi.upm.es/WORKSHOP/ECAI00/19.pdfLeger, Alain; Lehtola, Aarno, Villagra, Victor (2000):MKBEEM – Developing Multilingual Knowledge-Based Marketplace. ERCIM News, July 2001, pp. 50-52. Reprinted in Research News of VTT InformationTechnology, December 2001, pp. 1 – 3.Loimaranta, Outi (2000): EUROMAP HLT Case Study:Webtran – a controlled language machine translationsystem for building multilingual services on Internet.December 2000, http://www.hltcentral.org/usr_docs/case_studies/euromap/FIN_webtrans.docMkbeem (2000-2002). The Mkbeem project. URLhttp://www.mkbeem.com/
因篇幅问题不能全部显示,请点此查看更多更全内容