Download Intelligence Integration in Distributed Knowledge Management Dariusz Krol ebook file with all chapters
Download Intelligence Integration in Distributed Knowledge Management Dariusz Krol ebook file with all chapters
com
https://ebookgate.com/product/intelligence-integration-in-
distributed-knowledge-management-dariusz-krol/
OR CLICK HERE
DOWLOAD NOW
https://ebookgate.com/product/handbook-of-research-on-knowledge-
intensive-organizations-1st-edition-dariusz-jemielniak/
ebookgate.com
https://ebookgate.com/product/distributed-database-management-systems-
saeed-k-rahimi/
ebookgate.com
https://ebookgate.com/product/case-studies-in-knowledge-management-
murray-jennex-editor/
ebookgate.com
https://ebookgate.com/product/the-libor-market-model-in-practice-
dariusz-gatarek/
ebookgate.com
Distributed data management for grid computing 1st Edition
Michael Di Stefano
https://ebookgate.com/product/distributed-data-management-for-grid-
computing-1st-edition-michael-di-stefano/
ebookgate.com
https://ebookgate.com/product/win-loss-reviews-a-new-knowledge-model-
for-competitive-intelligence-1st-edition-marcet/
ebookgate.com
https://ebookgate.com/product/in-search-of-knowledge-management-
pursuing-primary-principles-1st-edition-annie-green/
ebookgate.com
https://ebookgate.com/product/essentials-of-knowledge-management-1st-
edition-bryan-bergeron/
ebookgate.com
Intelligence Integration
in Distributed Knowledge
Management
Dariusz Król
Wroclaw University of Technology, Poland
Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by
any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does
not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Intelligence integration in distributed knowledge management / Dariusz Krol and Ngoc Thanh Nguyen, editors.
p. cm.
Summary: "This book covers a broad range of intelligence integration approaches in distributed knowledge systems, from Web-based
systems through multi-agent and grid systems, ontology management to fuzzy approaches"--Provided by publisher.
1. Expert systems (Computer science) 2. Intelligent agents (Computer software) 3. Electronic data processing--Distributed processing. I.
Krol, Dariusz. II. Nguyên, Ngoc Thanh.
QA76.76.E95I53475 2009
006.3--dc22
2008016377
British Cataloguing in Publication Data
A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book set is original material. The views expressed in this book are those of the authors, but not necessarily of
the publisher.
If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating
the library's complimentary electronic access to this publication.
Table of Contents
Section I
Advanced Methods for Integration
Chapter I
Logical Inference Based on Incomplete and/or Fuzzy Ontologies ......................................................... 1
Juliusz L. Kulikowski, Polish Academy of Sciences, Poland
Chapter II
Using Logic Programming and XML Technologies for Data Extraction from Web Pages .................. 17
Amelia Bădică, University of Craiova, Romania
Costin Bădică, University of Craiova, Romania
Elvira Popescu, University of Craiova, Romania
Chapter III
A Formal Analysis of Virtual Enterprise Creation and Operation ........................................................ 48
Andreas Jacobsson, Blekinge Institute of Technology, Sweden
Paul Davidsson, Blekinge Institute of Technology, Sweden
Chapter IV
Application of Uncertain Variables to Knowledge-Based Resource Distribution ................................ 63
Donat Orski, Wroclaw University of Technology, Poland
Chapter V
A Methodology of Design for Virtual Environments............................................................................ 85
Clive Fencott, University of Teesside, UK
Chapter VI
An Ontological Representation of Competencies as Codified Knowledge ........................................ 104
Salvador Sanchez-Alonso, University of Alcalá, Spain
Dirk Frosch-Wilke, University of Applied Sciences, Germany
Section II
Integration Aspects for Agent Systems
Chapter VII
Aspects of Openness in Multi-Agent Systems: Coordinating the Autonomy
in Agent Societies ............................................................................................................................... 119
Marcos De Oliveira, University of Otago, New Zealand
Martin Purvis, University of Otago, New Zealand
Chapter VIII
How Can We Trust Agents in Multi-Agent Environments? Techniques and Challenges ................... 132
Kostas Kolomvatsos, National and Kapodistrian University of Athens, Greece
Stathes Hadjiefthymiades, National and Kapodistrian University of Athens, Greece
Chapter IX
The Concept of Autonomy in Distributed Computation and Multi-Agent Systems........................... 154
Mariusz Nowostawski, University of Otago, New Zealand
Chapter X
An Agent-Based Library Management System Using RFID Technology .......................................... 171
Maryam Purvis, University of Otago, New Zealand
Toktam Ebadi, University of Otago, New Zealand
Bastin Tony Roy Savarimuthu, University of Otago, New Zealand
Chapter XI
Mechanisms to Restrict Exploitation and Improve Societal Performance
in Multi-Agent Systems ...................................................................................................................... 182
Sharmila Savarimuthu, University of Otago, New Zealand
Martin Purvis, University of Otago, New Zealand
Maryam Purvis, University of Otago, New Zealand
Mariusz Nowostawski, University of Otago, New Zealand
Chapter XII
Norm Emergence in Multi-Agent Societies ........................................................................................ 195
Bastin Tony Roy Savarimuthu, University of Otago, New Zealand
Maryam Purvis, University of Otago, New Zealand
Stephen Cranefield, University of Otago, New Zealand
Chapter XIII
Multi-Agent Systems Engineering: An Overview and Case Study .................................................... 207
Scott A. DeLoach, Kansas State University, USA
Madhukar Kamar, Software Engineer, USA
Section III
Fuzzy-Based and Other Methods for Integration
Chapter XIV
Modeling, Analysing, and Control of Agents Behaviour.................................................................... 226
František Čapkovič, Institute of Informatics, Slovak Academy of Sciences, Slovak Republic
Chapter XV
Using Fuzzy Segmentation for Colour Image Enhancement of Computed
Tomography Perfusion Images ........................................................................................................... 253
Martin Tabakov, Wrocław University of Technology, Poland
Chapter XVI
Fuzzy Mediation in Shared Control and Online Learning .................................................................. 263
Giovanni Vincenti, Research and Development at Gruppo Vincenti, Italy
Goran Trajkovski, Algoco eLearning Consulting, USA
Chapter XVII
Utilizing Past Web for Knowledge Discovery .................................................................................... 286
Adam Jatowt, Kyoto University, Japan
Yukiko Kawai, Kyoto Sangyo University, Japan
Katsumi Tanaka, Kyoto University, Japan
Chapter XVIII
Example-Based Framework for Propagation of Tasks in Distributed Environments ......................... 305
Dariusz Król, Wrocław University of Technology, Poland
Chapter XIX
Survey on the Application of Economic and Market Theory for Grid Computing ............................ 316
Xia Xie, Huazhong University of Science and Technology, China
Jin Huang, Huazhong University of Science and Technology, China
Song Wu, Huazhong University of Science and Technology, China
Hai Jin, Huazhong University of Science and Technology, China
Melvin Koh, Asia Pacific Science & Technology Center, Sun Microsystems, Singapore
Jie Song, Asia Pacific Science & Technology Center, Sun Microsystems, Singapore
Simon See, Asia Pacific Science & Technology Center, Sun Microsystems, Singapore
Section I
Advanced Methods for Integration
Chapter I
Logical Inference Based on Incomplete and/or Fuzzy Ontologies ......................................................... 1
Juliusz L. Kulikowski, Polish Academy of Sciences, Poland
In this chapter, a concept of using incomplete or fuzzy ontologies in decision making is presented. A
definition of ontology and of ontological models is given, as well as their formal representation by
taxonomic trees, bi-partite graphs, multigraphs, relations, super-relations and hyper-relations. The
definitions of the corresponding mathematical notions are also given. Then, the concept of ontologies
representing incomplete or uncertain domain knowledge is presented. This concept is illustrated by an
example of decision making in medicine. The aim of this chapter is to give an outlook on the possibility of
ontological models extension in order to use them as an effective and universal form of domain knowledge
representation in computer systems supporting decision making in various application areas.
Chapter II
Using Logic Programming and XML Technologies for Data Extraction from Web Pages .................. 17
Amelia Bădică, University of Craiova, Romania
Costin Bădică, University of Craiova, Romania
Elvira Popescu, University of Craiova, Romania
The Web is designed as a major information provider for the human consumer. However, information
published on the Web is difficult to understand and reuse by a machine. In this chapter, we show how
well established intelligent techniques based on logic programming and inductive learning combined
with more recent XML technologies might help to improve the efficiency of the task of data extraction
from Web pages. Our work can be seen as a necessary step of the more general problem of Web data
management and integration.
Chapter III
A Formal Analysis of Virtual Enterprise Creation and Operation ........................................................ 48
Andreas Jacobsson, Blekinge Institute of Technology, Sweden
Paul Davidsson, Blekinge Institute of Technology, Sweden
This chapter introduces a formal model of virtual enterprises, as well as an analysis of their creation and
operation. It is argued that virtual enterprises offer a promising approach to promote both innovations and
collaboration between companies. A framework of integrated ICT-tools, called Plug and Play Business,
which support innovators in turning their ideas into businesses by dynamically forming virtual enterprises,
is also formally specified. Furthermore, issues regarding the implementation of this framework are
discussed and some useful technologies are identified.
Chapter IV
Application of Uncertain Variables to Knowledge-Based Resource Distribution ................................ 63
Donat Orski, Wroclaw University of Technology, Poland
The chapter concerns a class of systems composed of operations performed with the use of resources
allocated to them. In such operation systems, each operation is characterized by its execution time
depending on the amount of a resource allocated to the operation. The decision problem consists in
distributing a limited amount of a resource among operations in an optimal way, that is, in finding
an optimal resource allocation. Classical mathematical models of operation systems are widely used
in computer supported projects or production management, allowing optimal decision making in
deterministic, well-investigated environments. In the knowledge-based approach considered in this
chapter, the execution time of each operation is described in a nondeterministic way, by an inequality
containing an unknown parameter, and all the unknown parameters are assumed to be values of uncertain
variables characterized by experts. Mathematical models comprising such two-level uncertainty are
useful in designing knowledge-based decision support systems for uncertain environments. The purpose
of this chapter is to present a review of problems and algorithms developed in recent years, and to show
new results, possible extensions and challenges, thus providing a description of a state-of-the-art in the
field of resource distribution based on the uncertain variables.
Chapter V
A Methodology of Design for Virtual Environments............................................................................ 85
Clive Fencott, University of Teesside, UK
This chapter undertakes a methodological study of virtual environments (VEs), a specific subset of
interactive systems. It takes as a central theme the tension between the engineering and aesthetic
notions of VE design. First of all method is defined in terms of underlying model, language, process
model, and heuristics. The underlying model is characterized as an integration of Interaction Machines
and Semiotics with the intention to make the design tension work to the designer’s benefit rather than
trying to eliminate it. The language is then developed as a juxtaposition of UML and the integration of
a range of semiotics-based theories. This leads to a discussion of a process model and the activities that
comprise it. The intention throughout is not to build a particular VE design method, but to investigate
the methodological concerns and constraints such a method should address.
Chapter VI
An Ontological Representation of Competencies as Codified Knowledge ........................................ 104
Salvador Sanchez-Alonso, University of Alcalá, Spain
Dirk Frosch-Wilke, University of Applied Sciences, Germany
In current organizations, the models of knowledge creation include specific processes and elements that
drive the production of knowledge aimed at satisfying organizational objectives. The knowledge life cycle
(KLC) model of the Knowledge Management Consortium International (KMCI) provides a comprehensive
framework for situating competencies as part of the organizational context. Recent work on the use of
ontologies for the explicit description of competency-related terms and relations can be used as the basis
for a study on the ontological representation of competencies as codified knowledge, situating those
definitions in the KMCI lifecycle model. In this chapter, we discuss the similarities between the life cycle
of knowledge management (KM) and the processes in which competencies are identified and assessed.
The concept of competency, as well as the standard definitions for this term that coexist nowadays, will
then be connected to existing KLC models in order to provide a more comprehensive framework for
competency management in a wider KM framework. This paper also depicts the framework’s integration
into the KLC of the KMCI in the form of ontological definitions.
Section II
Integration Aspects for Agent Systems
Chapter VII
Aspects of Openness in Multi-Agent Systems: Coordinating the Autonomy
in Agent Societies ............................................................................................................................... 119
Marcos De Oliveira, University of Otago, New Zealand
Martin Purvis, University of Otago, New Zealand
In the distributed multi-agent systems discussed in this chapter, heterogeneous autonomous agents
interoperate in order to achieve their goals. In such environments, agents can be embedded in diverse
contexts and interact with agents of various types and behaviours. Mechanisms are needed for coordinating
these multi-agent interactions, and so far they have included tools for the support of conversation protocols
and tools for the establishment and management of agent groups and electronic institutions. In this
chapter, we explore the necessity of dealing with openness in multi-agent systems and its relation with
the agent’s autonomy. We stress the importance to build coordination mechanisms capable of managing
complex agent societies composed by autonomous agents and introduce our institutional environment
approach, which includes the use of commitments and normative spaces. It is based on a metaphor in
which agents may join an open system at any time, but they must obey regulations in order to maintain
a suitable reputation, that reflects its degree of cooperation with other agents in the group, and make
them a more desired partner for others. Coloured Petri Nets are used to formalize a workflow in the
institutional environment defining a normative space that guides the agents during interactions in the
conversation space.
Chapter VIII
How Can We Trust Agents in Multi-Agent Environments? Techniques and Challenges ................... 132
Kostas Kolomvatsos, National and Kapodistrian University of Athens, Greece
Stathes Hadjiefthymiades, National and Kapodistrian University of Athens, Greece
The field of Multi-agent systems (MAS) has been an active area for many years due to the importance
that agents have to many disciplines of research in computer science. MAS are open and dynamic systems
where a number of autonomous software components, called agents, communicate and cooperate in
order to achieve their goals. In such systems, trust plays an important role. There must be a way for
an agent to make sure that it can trust another entity, which is a potential partner. Without trust, agents
cannot cooperate effectively and without cooperation they cannot fulfill their goals. Many times, trust
is based on reputation. It is an indication that we may trust someone. This important research area is
investigated in this book chapter. We discuss main issues concerning reputation and trust in MAS. We
present research efforts and give formalizations useful for understanding the two concepts.
Chapter IX
The Concept of Autonomy in Distributed Computation and Multi-Agent Systems........................... 154
Mariusz Nowostawski, University of Otago, New Zealand
The concept of autonomy is one of the central concepts in distributed computational systems, and in
multi-agent systems in particular. With diverse implications in philosophy, social sciences and the theory
of computation, autonomy is a rather complicated and somewhat vague notion. Most researchers do
not discuss the details of this concept, but rather assume a general, common-sense understanding of
autonomy in the context of computational multi-agent systems. In this chapter, we will review the existing
definitions and formalisms related to the notion of autonomy. We re-introduce two concepts: relative
autonomy and absolute autonomy. We argue that even though the concept of absolute autonomy does
not make sense in computational settings, it is useful if treated as an assumed property of computational
units. For example, the concept of autonomous agents facilitates more flexible and robust architectures.
We adopt and discuss a new formalism based on results from the study of massively parallel multi-agent
systems in the context of Evolvable Virtual Machines. We also present the architecture for building such
architectures based on our multi-agent system KEA, where we use an extended notion of dynamic and
flexibly linking. We augment our work with theoretical results from chemical abstract machine algebra
for concurrent and asynchronous information processing systems. We argue that for open distributed
systems, entities must be connected by multiple computational dependencies and a system as a whole
must be subjected to influence from external sources. However, the exact linkages are not directly
known to the computational entities themselves. This provides a useful notion and the necessary means
to establish an autonomy in such open distributed systems.
Chapter X
An Agent-Based Library Management System Using RFID Technology .......................................... 171
Maryam Purvis, University of Otago, New Zealand
Toktam Ebadi, University of Otago, New Zealand
Bastin Tony Roy Savarimuthu, University of Otago, New Zealand
The objective of this research is to describe a mechanism to provide an improved library management
system using RFID and agent technologies. One of the major issues in large libraries is to track misplaced
items. By moving from conventional technologies such as barcode-based systems to RFID-based systems
and using software agents that continuously monitor and track the items in the library, we believe an
effective library system can be designed. Due to constant monitoring, the up-to-date location information
of the library items can be easily obtained.
Chapter XI
Mechanisms to Restrict Exploitation and Improve Societal Performance
in Multi-Agent Systems ...................................................................................................................... 182
Sharmila Savarimuthu, University of Otago, New Zealand
Martin Purvis, University of Otago, New Zealand
Maryam Purvis, University of Otago, New Zealand
Mariusz Nowostawski, University of Otago, New Zealand
Societies are made of different kinds of agents, some cooperative and uncooperative. Uncooperative
agents tend to reduce the overall performance of the society, due to exploitation practices. In the real
world, it is not possible to decimate all the uncooperative agents; thus the objective of this research is to
design and implement mechanisms that will improve the overall benefit of the society without excluding
uncooperative agents. The mechanisms that we have designed include referrals and resource restrictions.
A referral scheme is used to identify and distinguish noncooperators and cooperators. Resource restriction
mechanisms are used to restrict noncooperators from selfish resource utilization. Experimental results
are presented describing how these mechanisms operate.
Chapter XII
Norm Emergence in Multi-Agent Societies ........................................................................................ 195
Bastin Tony Roy Savarimuthu, University of Otago, New Zealand
Maryam Purvis, University of Otago, New Zealand
Stephen Cranefield, University of Otago, New Zealand
Norms are shared expectations of behaviours that exist in human societies. Norms help societies by
increasing the predictability of individual behaviours and by improving cooperation and collaboration
among members. Norms have been of interest to multi-agent system researchers, as software agents
intend to follow certain norms. But, owing to their autonomy, agents sometimes violate norms, which
needs monitoring. In order to build robust MAS that are norm compliant and systems that evolve and
adapt norms dynamically, the study of norms is crucial. Our objective in this chapter is to propose a
mechanism for norm emergence in artificial agent societies and provide experimental results. We also
study the role of autonomy and visibility threshold of an agent in the context of norm emergence.
Chapter XIII
Multi-Agent Systems Engineering: An Overview and Case Study .................................................... 207
Scott A. DeLoach, Kansas State University, USA
Madhukar Kamar, Software Engineer, USA
This chapter provides an overview of the Multi-agent Systems Engineering (MaSE) methodology for
analyzing and designing multi-agent systems. MaSE consists of two main phases that result in the
creation of a set of complementary models that get successively closer to implementation. MaSE has
been used to design systems ranging from a heterogeneous database integration system to a biologically
based, computer virus-immune system to cooperative robotics systems. The authors also provide a case
study of an actual system developed using MaSE in an effort to help demonstrate the practical aspects
of developing systems using MaSE.
Section III
Fuzzy-Based and Other Methods for Integration
Chapter XIV
Modeling, Analysing, and Control of Agents Behaviour.................................................................... 226
František Čapkovič, Institute of Informatics, Slovak Academy of Sciences, Slovak Republic
An alternative approach to modeling and analysis of agents’ behaviour is presented in this chapter. The
agents and agent systems are understood here to be discrete-event systems (DES). The approach is
based on the place/transition Petri nets (P/T PN) that yield both the suitable graphical or mathematical
description of DES and the applicable means for testing the DES properties as well as for the synthesis
of the agents’ behaviour. The reachability graph (RG) of the P/T PN-based model of the agent system and
the space of feasible states are found. The RG adjacency matrix helps to form an auxiliary hypermodel
in the space of the feasible states. State trajectories representing the actual interaction processes among
agents are computed by means of the mutual intersection of both the straight-lined reachability tree
(developed from a given initial state toward a prescribed terminal one) and the backtracking reachability
tree (developed from the desired terminal state toward the initial one; however, oriented toward the
terminal state). Control interferences are obtained on the base of the most suitable trajectory chosen
from the set of feasible ones.
Chapter XV
Using Fuzzy Segmentation for Colour Image Enhancement of Computed
Tomography Perfusion Images ........................................................................................................... 253
Martin Tabakov, Wrocław University of Technology, Poland
This chapter presents a methodology for an image enhancement process of computed tomography perfusion
images by means of partition generated with appropriately defined fuzzy relation. The proposed image
processing is used to improve the radiological analysis of the brain perfusion. Colour image segmentation
is a process of dividing the pixels of an image in several homogenously- coloured and topologically
connected groups, called regions. As the concept of homogeneity in a colour space is imprecise, a measure
of dependency between the elements of such a space is introduced. The proposed measure is based on
a pixel metric defined in the HSV colour space. By this measure a fuzzy similarity relation is defined,
which next is used to introduce a clustering method that generates a partition, and so a segmentation.
The achieved segmentation results are used to enhance the considered computed tomography perfusion
images with the purpose of improving the corresponding radiological recognition.
Chapter XVI
Fuzzy Mediation in Shared Control and Online Learning .................................................................. 263
Giovanni Vincenti, Research and Development at Gruppo Vincenti, Italy
Goran Trajkovski, Algoco eLearning Consulting, USA
This chapter presents an innovative approach to the field of information fusion. Fuzzy mediation
differentiates itself from other algorithms, as this approach is dynamic in nature. The experiments reported
in this work analyze the interaction of two distinct controllers as they try to maneuver an artificial agent
through a path. Fuzzy mediation functions as a fusion engine to integrate the two inputs to produce a
single output. Results show that fuzzy mediation is a valid method to mediate between two distinct
controllers. The work reported in this chapter lays the foundation for the creation of an effective tool
that uses positive feedback systems instead of negative ones to train human and nonhuman agents in
the performance of control tasks.
Chapter XVII
Utilizing Past Web for Knowledge Discovery .................................................................................... 286
Adam Jatowt, Kyoto University, Japan
Yukiko Kawai, Kyoto Sangyo University, Japan
Katsumi Tanaka, Kyoto University, Japan
The Web is a useful data source for knowledge extraction, as it provides diverse content virtually on
any possible topic. Hence, a lot of research has been recently done for improving mining in the Web.
However, relatively little research has been done taking directly into account the temporal aspects of
the Web. In this chapter, we analyze data stored in Web archives, which preserve content of the Web,
and investigate the methodology required for successful knowledge discovery from this data. We call
the collection of such Web archives past Web; a temporal structure composed of the past copies of Web
pages. First, we discuss the character of the data and explain some concepts related to utilizing the past
Web, such as data collection, analysis and processing. Next, we introduce examples of two applications,
temporal summarization and a browser for the past Web.
Chapter XVIII
Example-Based Framework for Propagation of Tasks in Distributed Environments ......................... 305
Dariusz Król, Wrocław University of Technology, Poland
In this chapter, we propose a generic framework in C# to distribute and compute tasks defined by
users. Unlike the more popular models such as middleware technologies, our multinode framework is
task-oriented desktop grid. In contrast with earlier proposals, our work provides simple architecture to
define, distribute and compute applications. The results confirm and quantify the usefulness of such ad-
hoc grids. Although significant additional experiments are needed to fully characterize the framework,
the simplicity of how they work in tandem with the user is the most important advantage of our current
proposal. The last section points out conclusions and future trends in distributed environments.
Chapter XIX
Survey on the Application of Economic and Market Theory for Grid Computing ............................ 316
Xia Xie, Huazhong University of Science and Technology, China
Jin Huang, Huazhong University of Science and Technology, China
Song Wu, Huazhong University of Science and Technology, China
Hai Jin, Huazhong University of Science and Technology, China
Melvin Koh, Asia Pacific Science & Technology Center, Sun Microsystems, Singapore
Jie Song, Asia Pacific Science & Technology Center, Sun Microsystems, Singapore
Simon See, Asia Pacific Science & Technology Center, Sun Microsystems, Singapore
In this chapter, we present a survey on some of the commercial players in the Grid industry, existing
research done in the area of market-based Grid technology and some of the concepts of dynamic pricing
model that we have investigated. In recent years, it has been observed that commercial companies
are slowly shifting from owning their own IT assets in the form of computers, software and so forth,
to purchasing services from utility providers. Technological advances, especially in the area of Grid
computing, have been the main catalyst for this trend. The utility model may not be the most effective
model and the price still needs to be determined at the point of usage. In general, market-based approaches
are more efficient in resource allocations, as it depends on price adjustment to accommodate fluctuations in
the supply and demand. Therefore, determining the price is vital to the overall success of the market.
Preface
Rapid advances and wide availability have caused knowledge management to permeate the lives of
people from all walks of life. The development of the distributed knowledge technologies has extended
the reach of computer intelligence to almost everyone.
In our book, intelligence integration can be understood in two aspects. The first is referred to as
methods for integration of human intelligence useful for management and social sciences. The second
aspect is related to integration methods for intelligent computer systems such as agent systems, Web-
based systems, ad hoc systems and so forth. The subject of this edited book is focused on the second
aspect. It covers a broad range of intelligence integration approaches in distributed knowledge systems,
from Web-based systems through multi-agent and grid systems, and ontology management to fuzzy
approaches. It presents cutting edge research in knowledge management in the first decade of the 21st
century. The new directions include integration of computational intelligence, distributed computing
and data mining.
In order to achieve the goals of better knowledge integration in the field of distributed environment
that collect modern approaches from artificial intelligence, computer communication, and information
systems, several issues need to be addressed. These issues can be summarized by new computing ideas
for, among other things:
The research reported in this book is focused first and foremost on the above topics. The approach
followed to explain these topics is intentionally broad and exploratory.
This volume is focused on topics worthy of interest due to their significant advances. From the sub-
missions, the editors have selected 19 of the most interesting chapters for publication. These chapters
have been divided into three parts: Advanced Methods for Integration, Integration Aspects for Agent
Systems, and Fuzzy-based and other Methods for Integration.
xv
The first section, Advanced Methods for Integration, consists of six chapters.
It starts with the chapter of J.L. Kulikowski, which gives an outlook on the possibility of ontologi-
cal models extension serving to effective and universal domain knowledge representation in computer
systems supporting decision making in various application areas. It is given a definition of ontology and
of ontological models as well as their formal representation by taxonomic trees, bi-partite graphs, multi-
graphs, relations, super-relations and hyper-relations. The definitions of the corresponding mathematical
notions are also given. Then, the concept of ontologies representing incomplete or uncertain domain
knowledge is presented. This concept is illustrated by an example of decision making in medicine.
The second chapter is by A. Bădică et al., and discusses data extraction from Web pages. The Web
is designed as a major information provider for the human consumer. However, information published
on the Web is difficult to understand and reuse by a machine. In this chapter, the authors show how
well established intelligent techniques based on logic programming and inductive learning combined
with more recent XML technologies might help to improve the efficiency of the task of data extraction
from Web pages. Their work can be seen as a necessary step of the more general problem of Web data
management and integration.
In the third chapter, A. Jacobsson and P. Davidsson introduce a formal model of virtual enterprises as
well as an analysis of their creation and operation. It is argued that virtual enterprises offer a promising
approach to promote both innovations and collaboration between companies. A framework of integrated
ICT-tools, called Plug and Play Business, which support innovators in turning their ideas into businesses
by dynamically forming virtual enterprises, is also formally specified. Furthermore, issues regarding the
implementation of this framework are discussed and some useful technologies are identified.
The fourth chapter, by D. Orski, concerns a class of systems composed of operations performed with
the use of resources allocated to them. In such operation systems, each operation is characterized by its
execution time depending on the amount of a resource allocated to the operation. The decision problem
consists in distributing a limited amount of a resource among operations in an optimal way, that is, in
finding an optimal resource allocation. In the knowledge-based approach considered in this chapter, the
execution time of each operation is described in a nondeterministic way, by an inequality containing an
unknown parameter, and all the unknown parameters are assumed to be values of uncertain variables
characterized by experts.
In the fifth chapter, C. Fencott undertakes a methodological study of virtual environments, a specific
subset of interactive systems. The underlying model is characterized as an integration of interaction ma-
chines and semiotics with the intention to make the design tension work to the designer’s benefit rather
than trying to eliminate it. The language is then developed as a juxtaposition of UML and the integration
of a range of semiotics-based theories. This leads to a discussion of a process model and the activities
that comprise it. The intention throughout is not to build a particular design method, but to investigate
the methodological concerns and constraints such a method should address.
In the last chapter of the first section, S. Sanchez-Alonso and D. Frosch-Wilke discuss the similarities
between the life cycle of knowledge management and the processes in which competencies are identified
and assessed. This chapter also presents the framework’s integration into the knowledge life cycle of
the knowledge management consortium international in the form of ontological definitions. It includes
a brief discussion on some current definitions of the term competency and details the most interesting
efforts in the standardization of competency definitions. At the end, it provides a preliminary mapping
of competency-related concepts to terms in upper ontologies.
The second section of this book refers to Integration Aspects for Agent Systems and consists of seven
chapters.
xvi
The first chapter, by M. Oliveira and M. Purvis is about some interesting aspects of coordinating and
integrating the autonomy in agent societies. In such environments, agents can be embedded in diverse
contexts and interact with agents of various types and behaviors. In this chapter, Oliveira and Purvis
explore the necessity of dealing with openness in multi-agent systems and its relation with the agent’s
autonomy. They stress the importance of building coordination mechanisms capable of managing
complex agent societies composed by autonomous agents and introduce their institutional environment
approach, which includes the use of commitments and normative spaces. It is based on a metaphor in
which agents may join an open system at any time, but they must obey regulations in order to maintain
a suitable reputation, that reflects its degree of cooperation with other agents in the group, and make
them a more desired partner for others. Colored Petri Nets are used to formalize a workflow in the
institutional environment defining a normative space that guides the agents during interactions in the
conversation space.
Next, in the following chapter, K. Kolomvatsos and S. Hadjiefthymiades present techniques and
challenges for trusting agents in multi-agent environments. In such systems, there must be a way for
an agent to make sure that it can trust another entity, which is a potential partner. Without trust, agents
cannot cooperate effectively and without cooperation they cannot fulfill their goals. Many times, trust
is based on reputation. They discuss main issues concerning reputation and trust in MAS. They present
research efforts and give formalizations useful for understanding the two concepts.
The third chapter, by M. Nowostawski, presents some novel concepts of autonomy management in
distributed computation and multi-agent systems. He re-introduces two concepts: relative autonomy
and absolute autonomy. He argues that even though the concept of absolute autonomy does not make
sense in computational settings, it is useful if treated as an assumed property of computational units.
For example, the concept of autonomous agents facilitates more flexible and robust architectures. He
adopts and discusses a new formalism based on results from the study of massively parallel multi-agent
systems in the context of evolvable virtual machines. He also presents the architecture for building such
architectures based on his multi-agent system KEA, where he uses extended notion of dynamic and
flexibly linking. This provides a useful notion and the necessary means to establish autonomy in open
distributed systems.
In the fourth chapter, M. Purvis et al., give an analysis of agent-based library management system
using RFID technology. One of the major issues in large libraries is to track misplaced items. By mov-
ing from conventional technologies such as barcode-based systems to RFID-based systems and using
software agents that continuously monitor and track the items in the library, they believe an effective
library system can be designed. Due to constant monitoring, the up-to-date location information of the
library items can be easily obtained.
The authors of the fifth chapter, S. Savarimuthu et al., present several original mechanisms to restrict
exploitation and improve societal performance in multi-agent environments. Societies are made of dif-
ferent kinds of agents, some cooperative and some uncooperative. Uncooperative agents tend to reduce
the overall performance of the society, due to exploitation practices. In the real world, it is not possible
to decimate all the uncooperative agents; thus, the objective of this research is to design and implement
mechanisms that will improve the overall benefit of the society without excluding uncooperative agents.
The mechanisms that they have designed include referrals and resource restrictions. A referral scheme
is used to identify and distinguish noncooperators and cooperators. Resource restriction mechanisms
are used to restrict noncooperators from selfish resource utilization. Experimental results are presented
describing how these mechanisms operate.
The sixth chapter is by B. Tony et al., and gives proof that norms can be shared expectations of
behaviours that exist in human societies and can help societies by increasing the predictability of indi-
xvii
vidual behaviours and by improving cooperation and collaboration among members. Norms have been
of interest to multi-agent system researchers as software agents intend to follow certain norms. But,
owing to their autonomy, agents sometimes violate norms, which needs monitoring. In order to build
robust MAS that are norm compliant and systems that evolve and adapt norms dynamically, the study
of norms is crucial. Their objective is to propose a mechanism for norm emergence in artificial agent
societies and provide experimental results. They also study the role of autonomy and visibility threshold
of an agent in the context of norm emergence.
In the last chapter in this section, S. DeLoach and M. Kumar present an overview of the multi-agent
systems engineering methodology for analyzing and designing multi-agent systems. This methodology
has been used to design systems ranging from a heterogeneous database integration system to a biologi-
cally based, computer virus-immune system to cooperative robotics systems. The authors also provide
a case study of an actual system developed using their methodology in an effort to help demonstrate the
practical aspects of developing such systems.
The last section consists of six chapters which are related to Fuzzy-based and other Methods for
Integration.
The first chapter, by F. Čapkovič, presents an approach based on Petri nets for modeling and analysing
agent behaviour. The agents and agent systems are understood here as Discrete-Event Systems (DES).
The approach is based on the place/transition Petri Nets (PN) that yield both the suitable graphical or
mathematical description of DES and the applicable means for testing the DES properties, as well as
for the synthesis of the agent’s behaviour. The reachability graph of the PN-based model of the agent
system and the space of feasible states are found. Control interferences are obtained on the base of the
most suitable trajectory chosen from the set of feasible ones.
The second chapter, by M. Tabakow, includes a novel method of using fuzzy segmentation for color
image enhancement to computed tomography perfusion images. The proposed image processing is used
to improve the radiological analysis of the brain perfusion. Color image segmentation is a process of
dividing the pixels of an image in several homogenously colored and topologically connected groups,
called regions. As the concept of homogeneity in a color space is imprecise, a measure of dependency
between the elements of such a space is introduced. The proposed measure is based on a pixel metric
defined in the HSV color space. By this measure a fuzzy similarity relation is defined, which next is
used to introduce a clustering method that generates a partition and so a segmentation. The achieved
segmentation results are used to enhance the considered computed tomography perfusion images in
purpose to improve the corresponding radiological recognition.
G. Vincenti’s and G. Trajkovski’s chapter presents a fuzzy mediation method for shared control and
online learning. Fuzzy mediation differentiates itself from other algorithms, as this approach is dynamic
in nature. The experiments reported in this work analyze the interaction of two distinct controllers as
they try to maneuver an artificial agent through a path. Fuzzy mediation functions as a fusion engine to
integrate the two inputs to produce a single output. Results show that fuzzy mediation is a valid method
to mediate between two distinct controllers. The work lays the foundation for the creation of an effective
tool that uses positive feedback systems instead of negative ones to train human and nonhuman agents
in the performance of control tasks.
In the fourth chapter, A. Jatowt et al. present a method for analysing data stored in Web archives
which preserve content of the Web, and investigating the methodology required for successful knowl-
edge discovery from this data. The Web is a useful data source for knowledge extraction, as it provides
diverse content virtually on any possible topic. They call the collection of such Web archives past Web,
a temporal structure composed of the past copies of Web pages. First, they discuss the character of the
data and explain some concepts related to utilizing the past Web, such as data collection, analysis and
xviii
processing. Next, they introduce examples of two applications, temporal summarization and a browser
for the past Web.
The next chapter is by D. Król and proposes a generic framework in C# to distribute and compute
tasks defined by users. Unlike the more popular models, such as middleware technologies, his multi-
node framework is task-oriented desktop grid. In contrast with earlier proposals, this work provides
simple architecture to define, distribute and compute applications. The results confirm and quantify the
usefulness of such ad-hoc grids. Although significant additional experiments are needed to fully char-
acterize the framework, the simplicity of how they work in tandem with the user is the most important
advantage of his current proposal.
And, last but not least, the chapter by X. Xie et al. includes an interesting survey on the application of
economic and market theory for grid computing. In recent years, it has been observed that commercial
companies are slowly shifting from owning their own IT assets in the form of computers, software and
so forth, to purchasing services from utility providers. Technological advances, especially in the area
of grid computing, have been the main catalyst for this trend. The utility model may not be the most ef-
fective model and the price still needs to be determined at the point of usage. In general, market-based
approaches are more efficient in resource allocations, as it depends on price adjustment to accommodate
fluctuations in the supply and demand. Therefore, determining the price is vital to the overall success
of the market.
The material of each chapter of this volume is self-contained. The editors hope that the book with
many papers provided by leading experts from all over the world can be useful for graduate and PhD
students in computer science; participants of courses in Knowledge Management, Collective Intelli-
gence, and Multi-agent Systems; and researchers and all readers working on knowledge management
and intelligent systems.
The editors would like to thank the authors who present very interesting research results in their
chapters. We are indebted to them for their reliability and hard work done in due time. We are looking
forward to the same fruitful collaboration during the next edition, which is planned for the near future.
We cordially thank the reviewers for their detail and useful reviews. Special thanks are also given to the
IGI Global Team members for their friendly help and excellent editorial support in preparing the final
version of this volume.
Chapter I
Logical Inference Based on
Incomplete and/or Fuzzy
Ontologies
Juliusz L. Kulikowski
Polish Academy of Sciences, Poland
AbstrAct
In this chapter, a concept of using incomplete or fuzzy ontologies in decision making is presented. A defini-
tion of ontology and of ontological models is given, as well as their formal representation by taxonomic
trees, bi-partite graphs, multigraphs, relations, super-relations and hyper-relations. The definitions of
the corresponding mathematical notions are also given. Then, the concept of ontologies representing
incomplete or uncertain domain knowledge is presented. This concept is illustrated by an example of
decision making in medicine. The aim of this chapter is to give an outlook on the possibility of onto-
logical models extension in order to use them as an effective and universal form of domain knowledge
representation in computer systems supporting decision making in various application areas.
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
ontology. The aim of this chapter is contributing Figure 1. Example of a taxonomic tree based on
to this concept in the particular cases when ontolo- the attribute “Status”
gies are being used in computer-based decision
supporting systems have not been enough finely
People living in the town (I)
described. The chapter is organized as follows.
In the beginning a concept of ontological mod- Inhabitants Temporarily staying
els and their application to decision making are
presented. Here, the models based on taxonomic Commuters Visitors
trees, graphs, multigraphs, relations and hyper-
relations are shortly described. Nondeterministic
ontological models, including fuzzy models and
models based on a concept of semi-ordering of Figure 2. A taxonomic tree based on the attribute
syndromes of relations, are described next. Short “Gender”
conclusions are collected in the last section of the
chapter. Our aim in this chapter is the presentation People living in the town (II)
of intuitive aspects of the proposed approach to
decision making, rather than revealing its strong Men Women
theoretical backgrounds.
However, the same concept may be presented
in several other ways (Figures 2 and 3) and so
ONtOLOGIEs AND ONtOLOGIcAL forth. The roots of the trees have been assigned
MODELs above to the basic concept People living in the
town, while the subjected nodes correspond to
taxonomies some subordered concepts. It is also assumed
that on each level of any tree the subordered
In the simplest cases, the idea of ontology can concepts totally cover the corresponding higher-
be reduced to a taxonomy of concepts assigned level concept. So-interpreted rooted trees are
to objects, phenomena or processes appearing called taxonomic trees. The fact that even in this
in an examined part of abstract or of real world simple case the part of real world under exami-
and being analyzed from some fixed points of nation is represented by an ontology consisting
view. For instance, in sociological investigations not of a single but of several semantically linked
a concept of People living in the town can be taxonomic trees is worthy of being remarked. In
specified by a structure called a rooted tree, as general, formal structures constituting ontolo-
shown in Figure 1. gies (in the above-defined, narrow sense) will be
called ontological models. This given ontology
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
thus consists of three ontological models having so that any edge can be assigned to at most one
the form of taxonomic trees, linked semantically pair of nodes. An edge lij assigned to the pair
because their roots have been assigned to the same [ci, cj] of nodes is called outgoing from ci and
top-level concept. incoming to cj.
And still, the class of problems whose solu- There are several possibilities of defining a
tion might be supported by this ontology is rather tree as a sort of graph. The simplest one is based
poor. It might contain, for example, designing a on a statement that a graph becomes a tree if the
database of inhabitants of the town, planning some number of its nodes is 1 larger than the number
social activities or investments in the town, or it linking those edges. A tree is called a rooted tree
might be used in any deliberations concerning the if: 1) it contains exactly one node, called a root,
population of the town. However, more advanced to which no incoming edge is assigned and 2)
applications of this ontology are limited by its to each other node exactly one in-coming edge
evident deficiencies: is assigned. The nodes of a rooted tree to which
no outgoing edges have been assigned are called
1. The taxonomic trees contain no information leafs of the tree.
about the statistical structure of the world The taxonomies of an ontology are repre-
as a composition of designates (real entities) sented by rooted trees whose roots have been
represented by a given tree; assigned to the top-level concepts, while other
2. No relationships between the concepts be- nodes correspond to the subordered concepts.
longing to different taxonomic trees have Any concept in a taxonomic tree is characterized
been described by the ontology; and by its level-number, that is, the number of edges
3. Taxonomic tree do not define concepts, but connecting the corresponding node with the root.
only characterize hierarchical relationships For example, in the given taxonomy of People
between higher level and lower level con- living in the town (I) the concept Inhabitants is
cepts. a first-level, while Visitors is a second-level one.
The top-level concepts are 0-level ones.
Graphs On the basis of graph algebra operations
(Kulikowski, 1986) simple ontological models
Ontologies reduced to taxonomic trees only are represented by graphs can be used to create more
thus rather ineffective in real world description sophisticated ontological models. For instance,
and as tools supporting decision making. Let us several taxonomic trees corresponding to the
also remark that trees in their graphical form are same top-level concept can be represented in
suitable to be analyzed by a man in the case of the form of a unified taxonomic tree. This can
low numbers of nodes, and for computer-aided be illustrated in the case of two taxonomic trees.
analysis they should be represented in the form For this purpose, a Cartesian product of the trees
of digital data structures. (in general, of the graphs) can be used. Let G(1) =
However, trees are a sort of graph, the last being [C(1), Λ(1), ϕ(1)], G(2) = [C(2), Λ(2), ϕ(2)] be two graphs.
formally described by a triple (Tutte, 1984): Their Cartesian product G = G(1)× G(2) is defined
as a graph such that:
G = [C, Λ, ϕ] (1)
1. The set of its nodes C = C(1) × C(2), which
where C denotes a set of nodes, Λ stands for a set means that each node of G is an ordered pair
of edges and ϕ is a function (incidence function) of some nodes of G(1) and G(2);
assigning edges to some ordered pairs of nodes
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
2. The set of its edges Λ = Λ(1) × Λ(2); and connecting only persons with their attributes so
3. Its incidence function ϕ assigns an edge lprqs that each person is connected with exactly two
= [l(1)pr, l(2)qs] to the pair of nodes cpr = [c(1)p, attributes: first, belonging to the subset {M, W}
c(2)r], cqs = [c(1)q, c(2)s], if and only if l(1)pr is and second belonging to {I, ST}, as illustrated
assigned by ϕ(1) to the pair of nodes [c(1)p, in Figure 5.
c(1)r] and l(2)qs is assigned by ϕ(2) to the pair This bipartite graph represents a distribution
of nodes [c(2)q, c(2)s] . of the attributes Gender and Status in a subset
A of persons living in the town. However, it is
For example, a Cartesian product of the first two not a tree, as it can be proved by counting and
taxonomic trees based on the attributes “Status” comparing the numbers of its nodes and edges.
and “Gender” takes the form in Figure 4. Using graphs as ontological models makes it
For the sake of formal accuracy, it has been possible using typical algebraic operations on
assumed that the graphs G(1 )and G(2) admit exis- graphs to construct more sophisticated models
tence of edges of the form l(1)ii , l(2)jj (loops) linking as compositions of some simpler ones. This can
each node with itself. be illustrated by the following example.
Using graphs (instead of trees only) in on- Let G’ be the bipartite graph illustrated in
tologies provides some additional possibilities Figure 5 and G” be a bipartite graph representing
to describe relationships between concepts. For the distribution of the attribute Age = {Y – Young,
example, let us take once more into consideration MA – Middle aged, O – Old} in the defined set A
the first two taxonomic trees canceled to their up- of elements (persons), as shown in Figure 6.
per two levels. Let A = {a1, a2, …, aK} be a set of The sum of graphs G’ ∪ G” can be defined
persons living in the given town. They can be clas- as a graph G* = [C*, Λ*, ϕ*] such that C* = C’∪
sified according to the above-given ontology, that C”, Λ* =Λ’ ∪ Λ”, and ϕ* is an incidence function
is, assigned to the leaves of the taxonomic trees. such that to a pair of nodes an edge is assigned
However, we would like to represent the persons if it is assigned by at least one of the incidence
and the assigned to them attributes: Gender = {M functions, ϕ’ or ϕ” (if two different edges to the
– Man, W – Woman}, Status = {I – Inhabitant, given pair of nodes have been assigned by both
TS – Temporarily staying} in the form of a more incidence functions, then the problem, whose
concise structure. For this purpose, a graph G’ edges should be finally assigned to it, can be
will be constructed whose set of nodes C’ consists arbitrarily solved).
of three disjoint subsets: C’ = A ∪ {M,W} ∪ {I, Using the definition of a sum of graphs to
ST} and the incidence function ϕ admits edges the graphs shown in Figure 5 and Figure 6, one
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
M W I ST
a1 a2 a3 a4 a5 a6 a7 aK
Y MA O
1 a2 a3 a4 a5 a6 a7 aK
M W I ST Y MA O
a1 a2 a3 a4 a5 a6 a7 aK
obtains a graph G* illustrating the distribution of in the form of the corresponding connection ma-
three attributes, shown in Figure 7. trices or incidence matrices (Kulikowski, 1986).
In similar way, ontological models describing However, the graphs presented in Figures 5, 6
distribution of higher numbers of attributes over and 7 as ontological models are rather untypical,
fixed sets of elements using the algebra of graphs because the idea of ontological models consists
operations can be constructed. Such models may in knowledge presentation in aggregated form
have the form of bipartite graphs, the subset of rather than by individual listing of instances. The
nodes representing the attributes being subdivided algebra of graphs provides us with more universal
into mutually disjoint lower-level subsets of values and flexible tools for ontological models construc-
of the given attributes. For effective calculations, tion than taxonomic trees. Alas, in certain cases
the graphs should be stored in computer memory this tool is not quite suitable to a presentation
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
of knowledge about the real world, as it can be A hypothetical part of such a graph is shown
shown by the following example. in Figure 8.
Let us assume that a problem consists in de- Let us take into consideration a partial graph
scription of the impact of papers published in a sci- consisting of the nodes v2, v5, v8 and t2 shown in
entific journal on distribution of scientific results Figure 9.
in the world. For this purpose, a corresponding Several interpretations of these partial graphs
ontological model should be constructed. Let us try are possible:
to construct it in the form of a bipartite graph:
1. An author from v2 has published in the journal
G = [V ∪ T, Λ’ ∪ Λ”,ϕ], (2) a paper on t2;
2. An author from v5 has published in the journal
where V is a subset of nodes assigned to affiliations a paper on t2;
of authors, T is a subset of nodes assigned to the 3. An author from v8 has published in the journal
topics covering the profile of the journal, Λ’ is a a paper on t2;
subset of oriented edges (arcs) connecting nodes 4. authors from v2 and v5 have commonly pub-
belonging to V with these belonging to T, Λ” is a lished in the journal a paper on t2;
complementary subset of oriented edges connect- 5. authors from v2 and v8 have commonly
ing nodes belonging to T with these belonging to published in the journal a paper on t2;
V and ϕ is an incidence function such that: 6. authors from v5 and v8 have commonly
published in the journal a paper on t2;
1. An edge (arc) l’ ip is connecting a node vi, vi 7. authors from v2, v5 and v8 have commonly
∈ V, with a node tp, tp ∈ T, if and only if at published in the journal a paper on t2; or
least one paper has been published in the 8. an author from v5 has cited at least one of
journal such that affiliation of (at least one) the above-mentioned papers.
its authors was vi and the topic of the paper
can be classified as belonging to tp; and However, in the last case, it is not clear: was it
2. An edge (arc) l”qj is connecting a node tq, a self-citation (four possibilities) or a citation of
tq ∈ T, with a node vj, vj ∈ V, if and only if papers written by other authors (three possibili-
at least one paper published in the journal, ties)? Therefore, the ontological model presented
whose topic can be classified as belonging in Figure 8 does not reflect all types of scientific
to tq, tq ∈ T, has been cited somewhere by information distribution caused by papers pub-
an author whose affiliation was vj, vj ∈ V. lished in the given journal.
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 vN
t1 t2 t3 t4 t5 tM
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
Figure 9. A partial graph of the graph shown in • Σ is a set of nodes assigned to the regional
Figure 8 schools (extended by adding the category
“Other” for the schools outside the re-
gion);
v2 v5 v8 • F is a set of oriented edges (arcs) assigned to
the flows of pupils and students between the
schools within the region, as well as coming
t2 from outside or going away from the region;
the edges should also indicate a subclas-
sification of flows based on the taxonomies
following from the type a) models;
Multigraphs • R+ is a non-negative real half-axis used as
a scale of flow intensity; and
In general, many relationships existing in the real • ϕ is a multigraph (vector) incidence function
world cannot be adequately presented by ontologi- assigning to any pair of nodes [Si, Sj] an arc
cal models given in the form of graphs. Some larger f(r)ij ∈ F and a value u(r)ij ∈ R+ if and only if
possibilities are offered using multigraphs, that between the corresponding pair of schools
is, graphs whose incidence function admits more a flow of intensity u(r)ij of the r-th category
than one edge to any given pair of nodes. This can pupils (students) takes place.
be illustrated by the following example.
Let us take into consideration a problem of A part of a multigraph of this type is illustrated
young population flow and migration between in Figure 10. For the sake of simplicity multiple
the schools in a certain region. For analysis of arcs have been replaced by the single ones and
the problem, an ontology consisting of several the denotations of arcs have been reduced to the
ontological models should be created, such as: weights of arcs (flow intensities in persons/year)
presented in a concise symbolic form (in fact, they
a. Taxonomic models of a regional population are numerical vectors whose components cor-
of pupils and students (sexuality, social respond to different sorts of pupils, for example,
background, etc.); to Boys and Girls).
b. Taxonomic model of regional schools of any This model makes it possible to show, for ex-
types and levels; or ample, which universities in the region are directly
c. Ontological model describing the flow of supplied with former pupils by given secondary
young population between the schools. schools or what is a social background of pupils
or students entering the given schools. However,
Our attention here will be focused on the last on the basis of this model it is not possible give
ontological model. For this purpose, it will be a reply to a question such as, which elementary
defined as a weighted multigraph: schools educate the highest percentage of pu-
pils who graduate from the universities? The
M = [Σ , F, R+,ϕ ] (3) inadequacy of the above-described ontological
model to answer these kinds of questions consists
where: in the fact that graphs as well as multigraphs
describe relationships between pairs of objects
only, while our question concerns relationships
among (in the simplest case) triples of elements:
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
Figure 10. A simplified partial multigraph representing the flow (migration) of pupils (students) between
regional schools
[elementary school, secondary school, univer- to it the general set-algebraic rules (Rasiowa &
sity]. The information about a detailed structure Sikorski, 1968) which in this case becomes an
of flows entering a given node is lost as a result algebra of relations described on the family of
of aggregation of flow components, and it is not sets [Q1, Q2,…, Qn]. Moreover, this algebra can
possible to reconstruct it by any outgoing flows’ also be extended on all relations described on any
components analysis. subsets of this family assuming that the original
linear order has been preserved (Kulikowski,
relations 1972). The extended algebra of relations, be-
ing in fact a sort of Boolean algebra, becomes a
If Q1, Q2,…, Qn are some nonempty sets taken in flexible tool not only for description of relations
the given linear order and C = Q1 × Q2 ×…× Qn is between any final number of arguments, but also
their Cartesian product, then any subset: for the creation of more sophisticated relations as
algebraic compositions of some simpler ones, as
r⊆C (4) well as for the creation of higher-order relations
(superrelations) defined as relations between
is called a relation described on the (linearly or- relations (Kulikowski, 1992).
dered) family of sets [Q1, Q2,…, Qn]. According to Multi-argument relations cannot be easily
the definition, r is a set of n-tuples of the form [a, presented in graphical form. However, there are
b,…,h] such that a∈Q1, b∈ Q2,…, and h∈Qn, called several methods of description of a new rela-
sometimes syndromes of the relation. tion:
For a fixed linearly ordered family of sets
and the corresponding Cartesian product C it is • By listing the syndromes of the relation;
possible to take into consideration a family Φ of • Be presenting it as an algebraic composition
all possible subsets of C including C itself and an of some other, known relations; or
empty subset ∅ . Φ is thus a family of all possible • By presenting a testing function making it
relations that can be defined on the given family possible to decide whether the relation is
of sets. On the other hand, it is possible to apply satisfied by any given syndrome.
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
The first method can be illustrated by the fol- can be reached by summing over F all values w
lowing example. The problem of young popula- of the syndromes of r’.
tion flow and migration between the schools will As mentioned before, the algebra of relations
be considered again. We would like to create an makes possible the combining of ontological
ontological model making possible the investiga- models in order to get more suitable forms of
tion of contribution of elementary schools in the reality description. For example, if r(κ), r(λ) are
region to the educational productivity of universi- two relations of similar structure described on
ties, taking into account the sex of the graduate the same family of sets [Q1, Q2,…, Qn], then a
students. For this purpose five sets will be taken sum of relations
into consideration:
r = r(κ) ∪ r(λ) (7)
• Q1 = {B, G} describing sex (Boys, Girls};
• Q2 = {Sa, Sb,…,Sh} describing elementary is a relation consisting of all syndromes satisfying
schools in the given region; r(κ) or r(λ). In the above-described example, if r(κ)
• Q3 = {S’p, S’q,…, S’t} describing secondary and r(λ) describe the flow of pupils (students) in
schools; two consecutive school years, then r describes it
• Q4 = {S*i, S*j,...,S*k} describing universities; in the two school years taken together.
and Another situation arises if the relations r(κ) and
• Q5 ≡ R+ a non-negative real half-axis repre- r(λ) are described on different families of sets, say,
senting flow intensities. respectively, on [Q(κ)1, Q(κ)2,…, Q(κ)n] and [Q(λ)1,
Q(λ)2,…, Q(λ)m]. In this case, assuming that both
On the basis of the Cartesian product C = Q1 families are conformably ordered, the algebraic
× Q2 × Q3 × Q4 × Q5 it can be defined a relation operations can be defined according to the ex-
r given in the form of a list of syndromes of the tended relations algebra rules (Kulikowski, 1992).
form In particular, a sum of relations can be defined
as a relation described on the sum of families of
v = [x, Sα , S’β, S*γ, w], (5) sets [Q(κ)1, Q(κ)2,…, Q(κ)n] ∪ [Q(λ)1, Q(λ)2,…, Q(λ)m]
consisting of syndromes such that each syndrome
where x∈ Q1, Sα ∈ Q2, S’β ∈ Q3, S*γ ∈ Q4, w∈ Q5. even 1) in its part belonging to [Q(κ)1, Q(κ)2,…, Q(κ)n]
Each syndrome represents a component of the satisfies r(κ), or 2) in its part belonging to [Q(λ)1,
flow with the additional characterizing it param- Q(λ)2,…, Q(λ)m] it satisfies r(λ).
eters. The relation can easily be represented in As an example, let a problem of air-passengers
computer, however, it cannot be so easily plotted flow intensity in selected airports be considered.
on a plane. Answering the former question: what For this purpose, a set A = {a1, a2, …, aK} of in-
is the contribution of a given elementary school, ternational airports will be considered. It will be
let it be Sα , to supplying a given university, let it multiplied in three versions: as departure airports
be S*γ, with, say, girl students (G) is then reduced A’, as transit airports A* and destination airports
to selection from r, a subrelation r’ ⊆ r consisting A”. In addition, a set V of flow intensity values,
of all syndromes of the form V ≡ R+, where R+ is a non-negative half-axis, will
be taken into account. Then two Cartesian prod-
v’ = [G, Sα , F, S*γ, w], (6) ucts will be constructed: C= A’ × A” × R+, C* =
A’ × A* × A” × R+. Let us also select a subset D
where F denotes an undefined data value (here ⊂ A of particular interest, say, of international
denoting any secondary school). The final answer airports in a certain country. On the basis of C
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
two relations can be described: 1) r’ describing a problem arises of the construction of a relation r
direct flights starting from any airport of D, D containing all syndromes consisting of teachers,
⊂ A’, and terminating in any airport of A’, and subjects, scholar classes, classrooms, weekdays
2) r” describing direct flights starting from any and scholar hours satisfying the constraints. This
airport of A’ and terminating in any airport of D, relation can be defined as an extended algebraic
D ⊂ A.” In addition, on the basis of C* a relation intersection of relations:
r* describing transit flights from any airport of
A’ through any airport of D to any airport of A” r = r’ ∩ r” ∩ r’’’ (8)
will be described. The syndromes of the above-
mentioned relations thus indicate the names of whose syndromes, by definition, projected on C’
starting, transit or terminating airports between satisfy the relation r’, projected on C” satisfy r”,
which the flights took place within a certain time- and projected on C’’’ satisfy r’’’. Then, finally, on
period, as well as intensity of the corresponding the basis of the relation r, an optimized timetable
flow of passengers. Let us assume that a total can be constructed.
flow of passengers through the airports of D are It might seem that the extended algebra of
of interest. Then, an extended algebraic sum of relations is a tool sufficient enough to construct
relations: r = r’ ∪ r * ∪ r” should be taken into a large class of ontological models. The following
account and the corresponding arithmetic sum of examples show that it is not quite so.
intensities should be calculated. The syndromes
of r are quadruples of a general form: starting Hyper-Graphs
airport, transit airport, terminate airport, inten-
sity of the flow of passengers, such that exactly Let C be a set of scholar handbooks offered at
one starting, transit or terminate airport belongs a book market. A problem of recommending
to D, and the other airports within the set A are collections of handbooks for teaching given
unlimited. subjects during a multiyear education process
In a similar way, extended intersection of will be considered. For this purpose, an ontology
relations can be used in ontological models cre- describing the regional educational subsystem
ation. For example, let us take into consideration should be constructed. However, our attention
a family F = [Q1, Q2, Q3, Q4, Q5, Q6] of sets where will be focused on ontological models describing
Q1 denotes a set of names of teachers, Q2 a set the admissible collections of handbooks satisfy-
of subjects, Q3 a set of scholar classes, Q4 a set ing some educational requirements. Otherwise
of classrooms, Q5 a set of weekdays and Q6 a set speaking, it is necessary to select according to
of scholar hours. On the Cartesian product C’ some educational criteria a family of subsets of
= Q1 × Q2 × Q3 it may be defined as a relation r’ C assuming that the subsets are not obviously
between teachers, subjects and scholar classes. mutually disjoint. The first possibility is to con-
On the Cartesian product C” = Q2 × Q3 × Q5 a struct a hyper-graph (Berge, 1973) whose nodes
relation r” between subjects, scholar classes and are assigned to the elements of C and any subset
weekdays in a similar way can be defined. At last, of nodes assigned to the handbooks satisfying
a relation r’’’ between classrooms, weekdays and the educational criteria constitutes a hyper-edge
scholar hours on the Cartesian product Q4 × Q5 of the hyper-graph. Such hyper-graphs can be
× Q6 can be established. The relations r’, r” and represented by a diagram, shown in Figure 11.
r’’’ can be established independently of each on In this diagram vertical lines represent nodes,
each other by taking into account some constraints while dots lying on horizontal lines represent
imposed on the corresponding syndromes. Then, hyper-edges. A serious shortcoming of hyper-
0
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
• A family of sets F = {A, B, D}; accuracy. “Suitable” means here the covering of
• A family of its subfamilies KF = {∅, {A},{ the area of interest without making it too large,
B},{D},{A,B},{A,D},{B, D},{A, B, D}}; based not on aggregated concepts, nor going too
• Selected subfamilies of sets H6 = {A, D}, H7 deeply into the details. However, ontologies being
= {B, D}; a form of presentation of our knowledge about
• Families of permutations of the subfamilies the world, they may be also based on ambiguous
H6 and H7: concepts and nondeterministic relations. Decision
U6 = {[A, D], [D, A]}, U7 = {[B, D], [D, making based on uncertain information is one of
B]}; basic problems in artificial intelligence investiga-
• Cartesian products based on U6 and U7: tions (Bubnicki, 2002; Grzegorzewski, Hrynie-
C6,1 = A×D, C6,2 = D×A, C7,1 = B×D, C7,2 = wicz, & Gil, 2002; Rutkowski, Tadeusiewicz,
D×B; Zadeh, & Zurada, 2006). Only certain aspects
• Selected relations described on the above- of this problem, strongly connected with using
given Cartesian products: nondeterministic ontological models in decision
r’ ⊆ C6,1 , r” ⊆ C6,2 , r’’’ ⊆ C7,2; making, will be considered here.
• h-relations:
H1 =A ∪ D ∪ r’ ∪ r”, H2 = r’ ∪ r” ∪ r’’’, Fuzzy Ontological Models
and so forth.
Let us go back to the taxonomic trees shown
The syndromes of H1 are linearly ordered in Figure 2 and Figure 3. In the first case, the
strings consisting of one or two element while concepts Men and Women are strongly defined
all syndromes of H2 are strings consisting of two and the respective ontological model is no doubt
elements. deterministic. On the other hand, the concepts
On the basis of any given family F of sets, a Young, Middle-aged and Old used in the second
universe UF of all possible h-relations created on ontological model can be interpreted:
the basis of F can be considered. The elements
of UF (i.e., h-relations) being defined as some a. Deterministically, as:
sets are subjected to the set algebra rules, which Young ≡ [aged not more than 18 years],
in this case can be interpreted as an algebra of Middle-aged ≡ [aged more than 18 and not
h-relations. This makes it possible to create more more than 60 years],
sophisticated h-relations as algebraic composi- Old ≡ [aged more than 60 years]; or
tions of some simpler ones. Hyper-relations, as b. Nondeterministically, say, using a fuzzy sets
well as the algebra of hyper-relations, are thus a approach (Zadeh, 1975a, 1975b, 1975c) and
flexible tool for the creation of ontological models, the membership functions shown in Figure
more powerful than graphs or relations. 13.
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
t [years]
0 20 40 60 80
between the “Middle-aged” and “Old” concepts Let it be known that a) a certain drug D is
possibly due to a defuzzyfication, that is, to an op- effective and recommended in icterus hepato-
eration consisting in fixing strong limits between genes and rather ineffective in the case of icterus
the concepts. However, a difference between the hepatocellularis therapy, and b) in a given popu-
deterministic and fuzzy ontology becomes evident lation p% of patients diagnosed as affected with
if a practical decision based on fuzzy ontology is liver jaundice are in fact suffering from icterus
to be made. For example, if building of a network hepatogenes and (100-p)% are suffering from
of sport fields for young people in the town is icterus hepatogenes. Then, if a patient has been
considered, then the fuzzy concept of “young” roughly diagnosed as affected with liver jaundice
better suits to a characterization of the expected without indication of the type of jaundice and he
users of the fields than the deterministic one. The is recommended to take the drug D, the decision
problem is that even if a border between Young is based on an incomplete model, canceled to its
and Middle-aged at the age of 18 years was fixed, higher level ontological model. The expected ef-
not all people younger than 18 years would like fectiveness of the therapy in this case is at most
to attend the sport fields and, on the other hand, p%. On the other hand, if diagnostic methods
some people older than 18 years would like to used to discriminate between the hepatogenes and
attend them. This example shows that the fuzzy hepatocellularis icterus work with q% specifity
or nondeterministic ontological model does not (i.e., the percentage of patients diagnosed as af-
mean worse than a deterministic one in the case fected by a given disease really suffering from it),
if it better describes the state of our knowledge the given patient has been diagnosed as affected
about the area of interest and more exact knowl- by icterus hepatogenes and, consequently, he has
edge is not available. been recommended to take D, then the expected
One should distinguish between decision effectiveness of this therapy will be at most q%.
making based on incomplete and on fuzzy ontol- The ontological model on which this decision is
ogy. Let us consider a taxonomic subtree of liver based is complete; however, if it is interpreted
diseases (Coté, 1975).
Jaundice (D 63440)
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
such that its syndromes projected on Q1 × Q2 × defining K(3) as a Kantorovich space (Kantorovich,
Q3 × Q4 satisfy r’ and projected on Q1 × Q’2 × Q’3 Vulich, & Pinsker, 1950). As a component of on-
× M satisfy r” will be constructed. The relation tology, K(3) represents the preferences established
r is fuzzy due to the credibility component µ by the decision-makers (medical doctors, in the
∈ M of its syndromes. However, it is not quite above-described example) for choosing the best
suitable to the requirements of making decisions decision(-s) from those, indicated by the fuzzy
about recommendation of a drug for the given relational ontological model r*. From a formal
patient. This is because it is not quite sure that: point of view, the principle of semi-ordering of
1) the real patient’s disease is identical to the vectors in a K-space consists in defining in a
result of diagnosis (the element of Q’2) and, as a given linear vector space a non-negative cone
consequence, whether there is a full consistency K+, its mirror-reflection being denoted by K−, as
in the syndromes between the elements of Q2 and illustrated in Figure 15.
Q’2, and 2) for similar reasons, whether there is a If v(1) and v(2) are two vectors belonging to
full consistency between the elements of Q3 and the K-space and their difference satisfies the
Q’3. Therefore, the relation r should be extended condition:
by adjoining to it two components: a) a measure
ν of logical consistency between the syndrome’s v(1) − v(2) ∈ K+ (10)
components of Q2 and Q’2, and b) a measure ρ
of logical consistency between the syndrome’s then it is said that v(1) is preceded by v(2) (v(1) is
components of Q3 and Q’3. The way of defining preferred with respect to v(2), what can be shortly
the logical consistency is not substantial here. denoted as v(1 ) v(2). If neither v(1 ) v(2) nor v(1
We would like only to show that the extended )
v(2), then it is said that v(1 )and v(2) are mutually
relation r* contains three parameters, µ, ν and ρ, incomparable. In the last case, some additional
causing its fuzziness. At last, it becomes neces- criteria should be used in order to select the best
sary to establish a method of relative assessment decision from a subset of mutually incomparable
of the syndromes according to the values of the ones.
weight vectors w = [µ, ν, ρ]. For this purpose, an
additional ontological model can be created: a
linear 3-dimensional semi-ordered real vector cONcLUsION
space K(3). One possibility of doing this exists in
It was shown in the chapter that ontologies used
as a support in computer-aided decision making
Figure 15. Illustration of a 3-dimensional Kan- usually consist of several ontological models be-
torovich space ing a form of presentation of knowledge about a
given area of interest. Ontological models can be
x3 constructed on the basis of various formal models:
K taxonomic trees, graphs, multigraphs, relations,
hyper-relations, and so forth. However, deter-
x2
ministic models not always describe adequately
x1 the state of our knowledge about the area of
interest. That is why in certain cases canceled
K or otherwise incomplete ontological models as
well as nondeterministic models should be used.
The nondeterministic ontological models, in the
Logical Inference Based on Incomplete and/or Fuzzy Ontologies
simplest case, may be presented as fuzzy models, Kulikowski, J.L. (1986). Outline of the theory of
that is, models based on fuzzy concepts in the graphs (in Polish). Warsaw, Poland: PWN.
Zadeh sense. In a more general case, nondeter-
Kulikowski, J.L. (1992). Relational approach to
ministic models can be presented in the form of
structural analysis of images. Machine Graphics
nondeterministic relations, that is, relations whose
and Vision, 1(1/2), 299-309.
syndromes have been semi-ordered. In particular,
the concept of a semi-ordered linear vector space Kulikowski, J.L. (2006). Description of irregu-
to construction of nondeterministic ontological lar composite objects by hyper-relations. In K.
models can be used. Wojciechowski et al. (Eds.), Computer vision and
graphics (pp. 141-146). Springer-Verlag.
Pisanelli, D.M. (Ed.). (2004). Ontologies in medi-
rEFErENcEs
cine. Amsterdam: IOS Press.
Berge, C. (1973). Graphs and hypergraphs. Am- Rasiowa, H., & Sikorski, R. (1968). The mathemat-
sterdam: North-Holland. ics of metamathematics. Warsaw: PWN.
Bubnicki, Z. (2002). Uncertain logics, variables Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., &
and systems. Springer-Verlag. LNICS No. 276. Zurada, J. (Eds.). (2006). Artificial intelligence and
soft computing–ICAISC 2006. Berlin: Springer-
Chute, C.G. (2005). Medical concept representa-
Verlag.
tion. In H. Chen et al. (Eds.), Medical informat-
ics: Knowledge management and data mining in Tutte, W.T. (1984). Graph theory. Menlo Park,
biomedicine (pp. 163-182). Springer-Verlag. CA: Addison-Wesley.
Coté, R.A. (Ed.). (1975). SNOMED: Systematized Zadeh, L.A. (1975a). The concept of a linguistic
nomenclature of medicine. Diseases. ACP. variable and its application to approximate reason-
ing. Part I. Information Science, 8, 199-249.
Grzegorzewski, P., Hryniewicz, O., & Gil, M. A.
(Eds.). (2002). Soft methods in probability, sta- Zadeh, L.A. (1975b). The concept of a linguistic
tistics and data analysis. Heidelberg, Germany: variable and its application to approximate reason-
Physica Verlag. ing. Part II. Information Science, 8, 301-357.
Kantorovich, L.B., Vulich, B.Z., & Pinsker, A.G. Zadeh, L.A. (1975c). The concept of a linguistic
(1950). Functional analysis in semi-ordered variable and its application to approximate reason-
spaces (in Russian). Moscow: GITTL. ing. Part III. Information Science, 9, 43-80.
Kulikowski, J.L. (1972). An algebraic approach to
the recognition of patterns. CISM Lecture Notes
No. 85. Wien: Springer-Verlag.
Chapter II
Using Logic Programming
and XML Technologies for Data
Extraction from Web Pages
Amelia Bădică
University of Craiova, Romania
Costin Bădică
University of Craiova, Romania
Elvira Popescu
University of Craiova, Romania
AbstrAct
The Web is designed as a major information provider for the human consumer. However, information
published on the Web is difficult to understand and reuse by a machine. In this chapter, we show how
well established intelligent techniques based on logic programming and inductive learning combined
with more recent XML technologies might help to improve the efficiency of the task of data extraction
from Web pages. Our work can be seen as a necessary step of the more general problem of Web data
management and integration.
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
searching and filtering, and also more complex Data extraction is related to the more general
tasks like analysis, decision making, reasoning problem of information extraction that is tradi-
and integration. tionally associated with artificial intelligence
For example, in the e-tourism domain one and natural language processing. Information
can note an increasing number of travel agen- extraction was originally concerned with locating
cies offering online services through online specific pieces of information in text documents
transaction brokers (Laudon & Traver, 2004). written in natural language (Lenhert & Sundheim,
They provide useful information to human us- 1991) and then using them to populate a database
ers about hotels, flights, trains or restaurants, in or structured document. The field then expanded
order to help them plan their business or holiday to cover extraction tasks from Web documents
trips. Travel information, like most of the infor- represented in HTML and attracted other commu-
mation published on the Web, is heterogeneous nities including databases, electronic documents,
and distributed, and there is a need to gather, digital libraries and Web technologies. Usually,
search, integrate and filter it efficiently (Staab et the content of these data sources can be character-
al., 2002) and ultimately to enable its reuse for ized as neither natural language, nor structured,
multiple purposes. In particular, for example, and therefore usually the term semi-structured
personal assistant agents can integrate travel and data is used. For these cases, we consider that
weather information to assist and advise humans the term data extraction is more appropriate than
in planning their weekends and holidays. Another information extraction and consequently, we shall
interesting use of data harvested from the Web use it in the rest of this chapter.
that has been recently proposed (Gottlob, 2005) A wrapper is a program that is used for per-
is to feed business intelligence tasks, in areas like forming the data extraction task. On one hand,
competitive analysis and intelligence. manual creation of Web wrappers is a tedious,
Two emergent technologies that have been error-prone and difficult task because of Web
put forward to enable automated processing of heterogeneity in both structure and content. On
information published on the Web are semantic the other hand, construction of Web wrappers
markup (W3C Semantic Web Activity, 2007). is a necessary step to allow more complex tasks
and Web services (Web Services Activity, 2007). like decision making and integration. Therefore,
However, most of the current practices in Web a lot of techniques for (semi-)automatic wrapper
publishing are still being based on the combina- construction have been proposed. One applica-
tion of traditional HTML-lingua franca for Web tion area that can be described as a success story
publishing (W3C HTML, 2007) with server-side for machine learning technologies is wrapper
dynamic content generation from databases. induction for Web data extraction. For a recent
Moreover, many Web pages are using HTML overview of state-of-the-art approaches in the field
elements that were originally intended for use see Chang, Kayed, Girgis, and Shaalan (2006).
in structure content (e.g., those elements related In this chapter, we propose a novel class of
to tables), or for layout and presentation effects, wrappers, L-wrappers (i.e., logic wrappers), that
even if this practice is not encouraged in theory. fruitfully combine logic programming paradigm
Therefore, techniques developed in areas like with efficient XML processing technologies (W3C
information extraction, machine learning and Extensible Markup Language (XML), 2007). Our
wrapper induction are still expected to play a wrappers have certain advantages over existing
significant role in tackling the problem of Web proposals: i) they have a declarative semantics,
data extraction. and therefore their specification is decoupled from
their implementation; ii) they can be generated
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
using techniques and algorithms inspired by in- structured text to multimedia documents and iii)
ductive logic programming (ILP hereafter); iii) rapid growth, that is, old information is continu-
they are implemented using XSLT – the “native” ously being updated in form and content and new
language for processing XML documents (W3C information is constantly being produced.
Extensible Stylesheet Language Family (XSL), The HTML markup language is the lingua
2007); and iv) they have also a visual notation franca for publishing information on the Web,
making them easier to read and understand than so our core data sources are in fact HTML docu-
their equivalent XSLT coding. ments. HTML was initially devised for modeling
The chapter is structured as follows. We start the structure and content of Web documents,
with a brief review of logic programming, XML rather than their presentation layout. However,
technologies and related approaches to Web data with the advent of graphic Web browsers, software
extraction. Then, we discuss flat relational and providers like Microsoft or Netscape added many
hierarchical approaches to Web pages concep- features to HTML that were mainly addressing
tualization for data extraction. We follow with a the visual representation and interactivity of Web
concise definition of L-wrappers covering both documents, rather than their structure and con-
their textual and visual representations. Both flat tent. The effects of this process were that initially
and hierarchical cases are considered. Next, we HTML was developed (and consequently used) in
discuss efficient algorithms for semi-automatic a rather unsystematic way. However, starting with
construction of L-wrappers. Then, we present HTML 4.01, W3C consortium enforced a rigorous
an approach for implementing L-wrappers us- standardization process of HTML that ultimately
ing XSLT transformation language. The last two resulted in a complete redefinition of HTML as
sections of this chapter contain some pointers an XML application, known as XHTML.
to future works, as well as a list of concluding In our work we make the assumption that
remarks. Web documents already are or can be converted
through a preprocessing stage to well-formed
XML before being actually processed for ex-
bAckGrOUND traction of interesting data. While clearly, data
extraction from HTML can benefit from existing
The goal of this section is to briefly review the approaches for information extraction from un-
main ingredients of our approach to Web data structured texts, we state that preprocessing and
extraction, that is, XML technologies and logic conversion of HTML to a structured (i.e., tree-like
programming. Finally, as the application of logic or well-formed XML) form has certain obvious
programming and XML to information extraction advantages: i) an extracted item can depend on
is not entirely new, we briefly provide an literature its structural context in a document, while this
overview of related proposals. information is lost in the event the tree document
is flattened as a string; ii) data extraction from
XML technologies for Data XML documents can benefit from the plethora
Extraction of XML query and transformation languages
and tools.
The Web is now a huge information repository A Web document is composed of a structural
that is characterized by i) high diversity, that is, part and a content part. The structural part con-
the Web information covers almost any application sists of the set of document nodes or elements.
area, ii) disparity, that is, the Web information The document elements are nested into a tree-
comes in many formats ranging from plain and like structure. The content part of a document
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
consists of the actual text in the text elements Logic Programming for
and the attribute-value pairs attached to the other representation and Querying
document elements. of Web Documents
We model semistructured Web documents as
labeled ordered trees. The node labels of a labeled The rapid growth of the Web gave a boost to
ordered tree correspond to HTML tags. In par- research on techniques to cope with the informa-
ticular, a text element will be considered to have a tion flood. At the core of the various applications
special tag text. Let Σ be the set of all node labels that include tasks like data retrieval, data extrac-
of a labeled ordered tree. For our purposes, it is tion, and text categorization there are suitable
convenient to abstract labeled ordered trees as sets representations of Web documents to allow their
of nodes on which certain relations and functions efficient structured querying and processing. In
are defined. Note that in this chapter we are using this subsection we show how logic programming
some basic graph terminology as introduced in can be used to achieve this desiderate.
Cormen, Leiserson, and Rivest (1990). Logic programming (Sterling & Shapiro,
Figure 1 shows a labeled ordered tree with 25 1994) was originally developed within the ar-
nodes and tags in the set Σ = {a, b, c}. tificial intelligence community to help with the
Intuitively, a wrapper takes a labeled ordered implementation of natural language processing
tree and returns a subset of extracted nodes. An tools. However, its attractive features including
extracted node can be viewed as representing the declarative semantics, compact syntax, built-in
whole subtree rooted at that node. The structural reasoning capabilities, and so forth, together
context of an extracted node is a complex condition with efficient compilation techniques, made logic
that specifies i) the tree delimiters of the extracted programming a suitable paradigm for the develop-
information, according to the parent-child and ment of high-level general-purpose programming
next-sibling relationships (e.g., is there a parent languages; see, for example, the Prolog language.
node ?, is there a left sibling ?) and ii) certain Moreover, during the last decade applications of
conditions on node labels and their position (e.g., logic programming spread also to the areas of the
is the tag label td ?, is it the first child ?). This Web and the Semantic Web (Alferes, Damasio,
conditions are nicely captured as conjunctive & Pereira, 2003).
queries represented using logic programming A logic program is a set of logic statements
(see next section). that are classified as facts, rules and queries.
Facts and rules are used to describe the problem
Definition 1 (Labeled ordered tree) A labeled ordered tree is a tuple t = 〈T, E, r, l, c, n〉 such
that:
i. (T, E, r) is a rooted tree with root r ∈ T. Here, T is the set of tree nodes and E is the set
of tree edges.
ii. l : T → Σ is a node labeling function.
iii. c ⊆ T × T is the parent-child relation between tree nodes, that is, c = {(v, u) | node u is
the parent of node v}.
iv. n ⊆ T × T is the next-sibling linear ordering relation defined on the set of children of a
node. For each node v ∈ T, its k children are ordered from left to right, that is, (vi, vi+1)
∈ n for all 1 ≤ i < k.
0
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
n0 a
n1 a n16 c
b a b c a c
n2 n6 n15 n17 n18 n24
b b a b b
n3 n7 n8 n12 n13 n19 n20 n21 n22 n23
c c c c c
b b b b b b
domain, while queries are used to pose specific The structural component of a Web document
problem instances and to retrieve the correspond- can be represented as a set of facts that use the
ing solutions as query answers. Intuitively, the following relations Box 1.
computation associated to a logic program can In order to represent the content component of
be described as the reasoning process for deter- a Web document, we introduce two sets: the set S
mining suitable bindings for the query variables of content elements that denote strings attached
such that the resulting instance of the query is to text nodes and values assigned to HTML at-
entailed by the facts and rules that comprise the tributes, and the set A of HTML attributes. With
logic program. these notations, the content part of a Web docu-
Consider, for example, the logic program- ment tree can be represented using two relations
ming representation of a Web document tree. (Box 2).
We assign a unique identifier (an integer value) Consider the Hewlett Packard’s Web site of
to each node of the tree. Let N be the set of all electronic products and the task of data extraction
node identifiers. from a product information sheet for Hewlett Pack-
ard printers. The printer information is displayed
Box 1.
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
Box 2.
in a two-column table as a set of feature-value as solution (A, B, C, D, E, and F), while running
pairs (see Figure 2a). Our task is to extract the the rule version would produce the single variable
names or the values of the printer features. This binding A as solution.
information is stored in the leaf elements of the
page. Figure 2b displays the tree representation related Works
of a fragment of this document and Figure 3c
displays the logic programming representation With the rapid expansion of the Internet and the
of this fragment as a set of facts. Web, the field of information extraction from
Considering the example from Figure 2 and HTML attracted a lot of researchers during the
assuming that we want to extract all the text nodes last decade. Clearly, it is impossible to mention
of this Web document that have a grand-grand- all of their work here. However, at least we can
parent of type table that has a parent that has a try to classify these works along several axes and
right sibling, we can use the following query. Note select some representatives for discussion.
that for expressing logic programs we are using First, we have focused our research on infor-
the standard Prolog notation (Sterling, 1994): mation extraction from HTML using logic rep-
resentations of tree (rather than string) wrappers
? tag(A,text),child(B,A),child(C,B), that are generated automatically using techniques
child(D,C),tag(D,table),child(E,D).next(E,F). inspired by ILP. Second, both theoretical and
experimental works are considered.
The query can be more conveniently packed as Freitag (1998) is one of the first papers describ-
a rule as follows: ing a “relational learning program” called SRV.
It uses an ILP algorithm for learning first order
extract(A) :- information extraction rules from a text document
tag(A,text),child(B,A),child(C,B), represented as a sequence of lexical tokens. Rule
child(D,C),tag(D,table),child(E,D),next(E,F). bodies check various token features like length,
position in the text fragment, if they are numeric
The rule representation has at least three ob- or capitalized, and so forth. SRV has been adapted
vious advantages: i) modularity: the knowledge to learn information extraction rules from HTML.
embodied in the query is encapsulated inside the For this purpose, new token features have been
body of the predicate extract; ii) reusability: the added to check the HTML context in which a token
query can be more easily reused rather than having occurs. The most important similarity between
to fully copy the conjunction of conditions and SRV and our approach is the use of relational
iii) information hiding: the variables occurring learning and an ILP algorithm. The difference
in the right-hand side of the rule are hidden to is that our approach has been explicitly devised
the user, that is, running the initial version of the to cope with tree structured documents, rather
query would produce a tuple of variable bindings than string documents.
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
Chidlovskii (2003) describes a generalization per can be executed without any modification by
of the notion of string delimiters developed for a standard Prolog engine. Elog was devised for
information extraction from string documents both crawling and extraction and has a customized
(Kushmerick, 2000) to subtree delimiters for logic programming-like semantics, that is more
information extraction from tree documents. difficult to understand; and ii) L-wrappers are
The paper describes a special purpose learner efficiently implemented by translation to XSLT,
that constructs a structure called candidate index a standard language for transforming XML
based on tree data structures, which is very dif- documents, while for Elog the implementation
ferent from our approach. Note however, that the approach is different (a custom interpreter has
tree leaf delimiters described in that paper are been devised from scratch).
very similar to our information extraction rules. Thomas (2000) introduces a special wrapper
Moreover, the representation of reverse paths using language for Web pages called token-templates.
the symbols ↑, ← and → can be easily simulated Token-templates are constructed from tokens and
by our rules using the relations child and next in token-patterns. A Web document is represented
our approach. as a list of tokens. A token is a feature structure
Xiao, Wissmann, Brown, and Jablonski (2001) with exactly one type feature. Feature values may
proposes a technique for generating XSLT-pat- be either constants or variables. Token-patterns
terns from positive examples via a GUI tool and use operators from the language of regular ex-
using an ILP-like algorithm. The result is a NE- pressions. The operators are applied to tokens to
agent (i.e., name extraction agent) that is capable extract relevant information. The only similarity
of extracting individual items. A TE-agent (i.e., between our approach and this approach is the use
term extraction agent) then uses the items ex- of logic programming to represent wrappers.
tracted by NE-agents and global constraints to Laender, Ribeiro-Neto, and Silva (2002)
fill-in template slots (tuple elements according describes the DEByE (i.e., Data Extraction By
to our terminology). The differences in our work Example) environment for Web data management.
are that XSLT wrappers are learned indirectly DEByE contains a tool that is capable of extract-
via L-wrappers, and our wrappers are capable of ing information from Web pages based on a set
extracting tuples in a straightforward way, and of examples provided by the user via a GUI. The
therefore TE-agents are not needed. Additionally, novelty of DEByE is the possibility to structure
our approach covers the hierarchical case, which the extracted data based on the user perception
is not addressed in Xiao et al. (2001). of the structure present in the Web pages. This
Lixto (Baumgartner, Flesca, & Gottlob, 2001) structure is described at the example collec-
is a visual wrapper generator that uses an internal tion stage by means of a GUI metaphor called
logic programming-based extraction language nested tables. DEByE also addresses other issues
called Elog. In Elog, a document is abstracted as needed in Web data management, like automatic
a tree (similar to our work), rather than a string. examples generation and wrapper management.
Elog is very versatile by allowing the refinement Our L-wrappers are also capable of handling hi-
of the extracted information with the help of erarchical information. However, in our approach,
regular expressions and the integration between the hierarchical structure of information is lost
wrapping and crawling via links in Web pages. by flattening during extraction (see the printer
The differences between Elog and L-wrappers are example where tuples representing features of the
at least two fold: i) L-wrappers are only devised same class share the feature class attribute).
for the extraction task and they use a classic logic Sakamaoto (2002) introduces tree wrappers for
programming approach, for example, an L-wrap- tuples extraction. A tree wrapper is a sequence
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
of tree extraction paths. There is an extraction practical implications. Combined complexity and
path for each extracted attribute. A tree extrac- expressivity results of conjunctive queries over
tion path is a sequence of triples that contain a trees, that also apply to information extraction, are
tag, a position and a set of tag attributes. A triple reported in Gottlob, Koch, and Schultz (2004).
matches a node based on the node tag, its posi-
tion among its siblings with a similar tag and its
attributes. Extracted items are assembled into cONcEPtUALIzING WEb PAGEs
tuples by analyzing their relative document order. FOr DAtA EXtrActION
The algorithm for learning a tree extraction path
is based on the composition operation of two tree Many Web pages are dynamically generated by
extraction paths. Note also that L-wrappers use a filling in HTML templates with data obtained
different and richer representation of node proxim- from relational data bases. We have noticed that
ity and therefore, we have reason to believe that most often such Web documents can be success-
they could be more accurate (this claim needs, fully abstracted as providing relational data as
of course, further support with experimental sets of tuples or records. Examples include search
evidence). Finally, note that L-wrappers are fully engines’ answer pages, product catalogues, news
declarative, while tree wrappers combine declara- sites, product information sheets, travel resources,
tive extraction paths with a procedural algorithm multimedia repositories, Web directories, and
for grouping extracted nodes into tuples. so forth.
A new wrapper induction algorithm inspired Sometimes, however, Web pages contain hi-
by ILP is introduced in Anton (2005). The al- erarchically structured presentations of data for
gorithm exploits traversal graphs of documents usability and readability reasons. Moreover, it is
trees that are mapped to XPath expressions for generally appreciated that hierarchies are very
data extraction. However, that paper does not helpful for focusing human attention and man-
define a declarative semantics of the resulting agement of complexity. Therefore, as most Web
wrappers. Moreover, the wrappers discussed in pages are developed by knowledgeable specialists
Anton (2005) aim to extract only single items, in human-computer interaction design, we expect
and there is no discussion of how to extend the to find this approach in many designs of Web
work to tuples extraction. interfaces to data-intensive applications.
Stalker (Muslea, Minton, & Knoblock, 2001)
uses a hierarchical schema of the extracted data Flat relational conceptualization
called embedded catalog formalism that is similar
to our approach. However, the main difference We adopt a standard relational model by associat-
is that Stalker abstracts the document as a string ing to a Web data source a set of distinct attributes.
rather than a tree and therefore their approach is Let A be the set of all attribute names and let D ⊂
not able to benefit from existing XML process- A be the set of relational attributes associated to
ing technologies. Extraction rules of Stalker are a given Web data source. An extracted tuple can
based on a special type of finite automata called be defined as a function tuple : D → N, such that
landmark automata, rather than logic program- for each attribute a ∈ D, tuple(a) represents the
ming, as our L-wrappers. document node extracted from the Web source as
Concerning theoretical work, Gottlob and an instance of attribute a. Note that in practice,
Koch (2004) is one of the first papers that ana- instead of an extracted node, a user is rather in-
lyzes seriously the expressivity required by tree terested to get the HTML content of the node.
languages for Web information extraction and its
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
Let us consider, for example, the problem ( feature-class: ‘Speed/monthly volume,’ fea-
of extracting printer information from Hewlett ture-name: ‘Processor speed,’ feature-value:
Packard’s Web site. The printer information is ‘300 Mhz’)
represented in multisection, two column HTML
tables (as shown in Figure 2a). Each row contains Note that in this example some tuples may
a pair consisting of a feature name and a feature have identical feature classes. More generally,
value. Consecutive rows represent related features for some documents, distinct tuples might have
that are grouped into feature classes. For example, identical attribute instances. Clearly, this happens
there is a row with the feature name ‘Processor when the document has a hierarchical structure.
speed’ and the feature value ‘300 Mhz.’ This row For such cases, a hierarchical conceptualization
has the feature class ‘Speed/monthly volume.’ of the Web data source is more appropriate (see
So, actually, this table contains triples consisting the next section).
of a feature class, a feature name, and a feature Let us now show how logic programming can
value. The set of relational attributes is D = be employed to conveniently define wrappers for
{feature-class, feature-name, feature-value}. The data extraction from Web pages that have been
document fragment shown in Figure 2a contains conceptualized as flat relational data sources.
three tuples: Anticipating (see next section on logic wrappers),
we shall call such programs logic wrappers or
( feature-class: ‘Speed/monthly volume,’ feature- L-wrappers.
name: ‘Print speed, black (pages per minute),’ A L-wrapper for extracting relational data op-
feature-value: ‘Up to 50 ppm’) erates on a target Web document represented as a
labeled ordered tree and returns a set of relational
( feature-class: ‘Speed/monthly volume,’ feature- tuples of nodes of this document.
name: ‘First page out, black,’ feature-value: ‘8 A L-wrapper for the printers example shown in
secs’) Figure 2b is (FN = feature name, FV = feature
value) (Box 3).
Box 3.
extract(FN,FV) :-
tag(FN,text),text(FV),child(C,FN),child(D,FV),child(E,C),child(H,G),child(I,F),
child(J,I),next(J,K),child(F,E),child(G,D),first(J),child(K,L),child(L,H).
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
fruits
*
fruit
fruit-name features
*
feature
feature-name feature-value
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
This rule extracts all the pairs of text nodes a feature-name and a feature-value. However,
such that the grand-grand-grand-grandparent of this approach has at least two drawbacks: i) re-
the first node (J) is the first child of its parent node dundancy, because distinct tuples might contain
and also the left sibling of the grand-grand-grand- identical attribute instances, and ii) the intrinsic
grandparent of the second node (K). hierarchical structure of the data is lost, while it
might convey useful information.
Hierarchical conceptualization Following the hierarchical structure of this
data, the design of a L-wrapper of arity 3 for this
In this subsection, we propose an approach for example can be done in two stages: i) derive a
utilizing L-wrappers to extract hierarchical data. wrapper W1 for binary tuples ( fruit-name, list-of-
The advantage would be that extracted data will features); and ii) derive a wrapper W2 for binary
be suitably annotated to preserve its hierarchi- tuples ( feature-name, feature-value). Note that
cal structure, as found in the Web page. Further wrapper W1 is assumed to work on documents
processing of this data would benefit from this containing a list of tuples of the first type (i.e.,
additional metadata to allow for more complex the original target document), while wrapper W2
tasks, rather than simple searching and filtering is assumed to work on document fragments con-
by populating a relational database. For example, taining the list of features of a given fruit (i.e., a
one can imagine the application of this technique single table from the original target document).
to the task of ontology extraction, as ontologies For example, wrappers W1 and W2 can be defined
are assumed to be natively equipped with the as logic programs as shown in Box 4.
facility of capturing taxonomically structured Note that for the combination of W1 and W2
knowledge. into a single L-wrapper of arity 3, we need to
Let us consider a very simple HTML document extend the definition of a L-wrapper by adding a
that contains hierarchical data about fruits (see new argument to relation extract for represent-
Figure 3a). A fruit has a name and a sequence of ing the root node of the document fragment to
features. Additionally, a feature has a name and which the wrapper is applied, that is, instead of
a value. This is captured by the schema shown extract(N1, …, Nk) we shall now have extract(R,
in Figure 3b. Note that this representation allows N1, …, Nk), R is the new argument. Moreover, it
features to be fruit-dependent; for example, while is required that for all 1 ≤ i ≤ k, Ni is a descendant
an apple has an average diameter, a lemon has of R in the document tree. The resulted solution
both an average width and an average height. is shown in Box 5.
Abstracting the hierarchical structure of data, The final wrapper (assuming the index of
we can assume that the document shown in Figure document root node is 0) is shown in Box 6.
3a contains triples consisting of a fruit name,
Box 4.
extr _ fruits(FrN,FrFs) :-
tag(FrN,text),child(A,FrN),child(B,A),next(B,FrFs),child(C,FrFs),tag(C,p).
extr _ features(FN,FV) :-
tag(FN,text),tag(FV,text),child(A,FN),child(B,FV),next(A,B),
child(C,B),tag(C,tr).
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
Box. 5
% ancestor(Ancestor,Node).
ancestor(N,N).
ancestor(A,N) :-
child(A,B),ancestor(B,N).
extr _ fruits(R,FrN,FrFs) :-
ancestor(R,FrN),ancestor(R,FrFs),extr _ fruits(FrN,FrFs).
extr _ features(R,FN,FV) :-
ancestor(R,FN),ancestor(R,FV),extr _ features(FN,FV).
Box 6.
extract(FrN,FN,FV) :-
extr _ fruits(0,FrN,FrFs),extr _ features(FrFs,FN,FV).
extract_all(Res) :- ?-extract_all(Res).
extr_fruits_all(0,Res). Res = fruits(
extr_fruits_all(Doc,fruits(Res)) :- [fruit(name(`Red apple`),
findall( features(
fruits(name(FrN),FrFs), [feature(name(`weight`),value(`120`)),
(extr_fruits(Doc,NFrN,NFrFs), feature(name(`color`),value(`red`)),
content(NFrN,FrN), feature(name(`diameter`),value(`8`))
extr_features_all(NFrFs,FrFs)), ])),
Res). fruit(name(`Lemon`),
extr_features_all(Doc,features(Res)) :- features(
findall( [feature(name(`weight`),value(`70`)),
feature(name(FN),value(FV)), feature(name(`color`),value(`yellow`)),
(extr_features(Doc,NFN,NFV), feature(name(`height`),value(`7`)),
content(NFN,FN),text(NFV,FV)), feature(name(`width`),value(`4`))
Res). ]))])
While simple, this solution has the drawback Xs of all terms X such that goal G is true (it is as-
that, even if it was devised with the idea of hi- sumed that X occurs in G). The solution and the
erarchy in mind, it is easy to observe that the result are shown in Figure 4. Note that we assume
hierarchical nature of the extracted data is lost. that i) the root node of the document has index
Assuming a Prolog execution engine of L- 0, and ii) predicate content(TextNode, Content) is
wrappers, we can solve the drawback using the used to determine the content of a text node.
findall predicate. findall(X, G, Xs) returns the list
Using Logic Programming and XML Technologies for Data Extraction from Web Pages
Definition 3 (Pattern graph) Let W be a set denoting all vertices. A pattern graph G is a quadruple
〈A, V, L, λa〉 such that V ⊆ W, A ⊆ V × V, L ⊆ V and λa : A → {‘c’, ‘n’}. The set G of pattern
graphs is defined inductively as follows:
i. If v ∈ W then 〈∅, {v}, {v}, ∅〉 ∈ G
ii. If G = 〈A, V, L, λa〉 ∈ Γ, v ∈ L, and w, ui ∈W \ V, 1 ≤ i ≤ n then a) G1 = 〈A ∪ {(w, v)}, V
∪ {w}, (L \ {v}) ∪ {w}, λa ∪ {((w, v),’ n’)}〉 ∈ G; b) G2 = 〈A ∪ {(u1, v), … , (un, v))}, V
∪ {u1, … , un}, (L \ {v}) ∪ {u1, … , un}, λa ∪ {((u1, v),’c’), … , ((un, v),’c’)}〉 ∈ G; c) G3
= 〈A ∪ {(w, v), (u1, v), …, (un, v))}, V ∪ {w, u1, …, un}, (L \ {v}) ∪ {w, u1, … , un}, λa
∪ {((w, v),’n’), ((u1, v),’c’), … , ((un, v),’c’)}〉 ∈ G.
LOGIc WrAPPErs As DIrEctED vertices and arcs to be consistent with the corre-
GrAPHs sponding relations and functions over tree nodes.
The result of applying a pattern to a labeled ordered
In this section, we take a graph-based perspec- tree is a set of tuples of extracted nodes.
tive in defining L-wrappers as sets of patterns. Patterns can be concisely defined in two steps:
Within this framework, a pattern is a directed i) define the pattern graph together with arc labels
graph with labeled arcs and vertices that cor- that model parent-child and next-sibling relations,
responds to a rule in the logic representation. and ii) extend this definition with vertex labels
Arc labels denote conditions that specify the tree that model conditions on vertices, extraction
delimiters of the extracted data, according to the vertices and assignment of extraction vertices
parent-child and next-sibling relationships (e.g., is to attributes.
there a parent node?, is there a left sibling?, etc.). Intuitively, if 〈A, V, L, λ a〉 is a pattern graph,
Vertex labels specify conditions on nodes (e.g., then V denotes its set of vertices, A denotes its
is the tag label td?, is it the first child?, etc.). A set of arcs, L ⊆ V are its leaves (vertices with in-
subset of graph vertices is used for selecting the degree 0) and λ a distinguishes between parent-
items for extraction child (labeled with ‘c’) and next-sibling (labeled
Intuitively, an arc labeled ‘n’ denotes the „next- with ‘n’) arcs. Note also that a pattern graph is
sibling” relation, while an arc labeled ‘c’ denotes tree shaped with arcs pointing up.
the “parent-child” relation. As concerning vertex Note that according to definition 4, we assume
labels, label ‘f’ denotes “first child” condition, that extraction vertices are among the leaves of the
label ‘l’ denotes “last child” condition and label pattern graph, that is, an extraction pattern does
σ ∈ Σ denotes “equality with tag σ” condition. not state any condition about the descendants of
Patterns are matched against parts of a target an extracted node. This is not restrictive in the
document modeled as a labeled ordered tree. A context of patterns for information extraction
successful matching asks for the labels of pattern from Web documents.
Definition 4 (L-wrapper pattern) Let A be the set of attribute names. An L-wrapper pattern
is a tuple p = 〈V, A,U, D, µ, λa, λc〉 such that 〈A, V, L, λa〉 is a pattern graph, U = {u1, u2, …, uk}
⊆ L is the set of pattern extraction vertices, D ⊆ A is the set of attribute names, µ : D → U
is a one-to-one function that assigns a pattern extraction vertex to each attribute name, and
λc : V → C is the labeling function for vertices. C = {∅, {‘f’}, {‘l’}, {σ}, {‘f’,’l’}, {‘f’,σ},
{‘l’,σ}, {‘f’,’l’, σ}} is the set of conditions, where σ is a label in the set Σ of tag symbols.
Another Random Document on
Scribd Without Any Related Topics
Right after lunch the General ordered several of the soldiers to
help the Old Soldier take the raft out for a trial trip.
With the help of the big sweep on the stern of the raft and the use
of several long poles, the little men slowly pushed the craft out into
the stream.
“Jumping beans!” exclaimed the Old Soldier as the raft slid easily
out into the water. “That is what I call a good—” but the Old Soldier
never finished the sentence, for at that very moment a big frog
poked his head out of the water and hopped up onto the raft.
“Oooooo, a-a-a s-s-submarine!” gasped the Dunce. “Jump for your
lives,” and he leaped head first into the deep water.
Most of the Teenie Weenies gave one look at the ugly frog,
followed the Dunce into the water and swam as fast as they could
for shore. The Old Soldier and Gogo were the only men to hold their
ground, and if it had not been for these brave little fellows, the frog
would have captured the transport without a battle. The Old Soldier
drew his sword and attacked the frog, while Gogo struck the big
fellow over the head with the boat pole. The frog, who had been
attracted by the red coats of the soldiers, had hopped onto the raft
in search of a meal, but he quickly slid back into the water at the
first blow of the boat pole.
The soldiers who had jumped into the water were much ashamed
of their behavior and they all quickly returned to the raft and
finished the trial trip they had started.
“My brave lad,” said the General, grasping Gogo by the hand when
the raft returned from its trip, “in behalf of the Teenie Weenie army I
want to thank you for your great bravery and I hereby promote you
to first sergeant in the Teenie Weenie army.”
“Oh, dat’s all right, General,” said Gogo, much confused at the
honor thrust upon him. “There’s no fool frog what’s done gonna
scare me when I’s mad, and I was certainly mad at that fool frog.”
The raft proved to be thoroughly seaworthy, so the General gave
orders for the men to be ready to board her just as soon as the wind
and current were favorable for the trip to the wild men’s island.
Chapter Thirteen
THE ATTACK
It was a long trip to the wild men’s island and the General wished
to make the journey under cover of darkness. “I want to land on the
island before daybreak so we can surprise the wild men,” the
General told his officers, who were gathered for a council of war.
“The Sailor tells me,” continued the General, “that the wind and
current are just right to sail the raft over to-night. I will take over
the infantry on the first trip and then the raft can return for the
artillery and the baggage and the rear guard, which the Old Soldier
will command.”
The Cook had a great pot of rice cooked and he had stewed five
lima beans. This great amount of food was portioned out, and three
days’ rations were given to each soldier.
A number of picks and shovels, with a lot of bags and a chest of
bullets, was loaded onto the raft.
Promptly at eleven o’clock the General, followed by several men,
marched onto the raft and some of the soldiers with long poles
quickly pushed out into the dark stream.
The Teenie Weenies Wild Men
pull the captured out of the bottle.—
Chapter Sixteen.
The Sailor and the Cowboy handled the big sweep at the back of
the raft, while the Policeman and the Scotchman pushed wherever
they could with the long poles.
No lights were allowed on the raft and the men were ordered to
talk only in whispers, for the General wished to land on the island
unknown to the wild men.
“J-j-j-jimminie C-c-christmas!” stuttered the trembling Dunce, “I-I-
I’m not a-a bit s-s-scared. Are you, G-g-gogo?”
“Not v-v-very m-m-much,” answered the colored lad, trying to
keep his knees from knocking together. “I done hope we-all—”
But at that minute, the raft struck something with such a bang it
nearly upset most of the little soldiers. In fact, the Scotchman would
have tumbled into the water if the Cowboy hadn’t caught him.
The raft had struck the limb of an old tree that lay in the water
and to the alarm of the General it stuck fast.
“This is terrible. Perfectly terrible,” groaned the General, glancing
towards the eastern sky. “It will soon be daylight and the wild men
will see us if we are delayed here.”
The men worked with might and main to free the raft, but it was
stuck tight to the snag and before they managed to get it free it was
broad daylight.
“The wild men have very likely seen us by this time,” said the
General, peering towards the island. “So instead of our surprising
them, they probably will surprise us, but we have got to land.
Examine your rifles and see that they are in condition to use, for we
are likely to have a fight.”
“Look there!” cried the Sailor, pointing towards the shore, for the
raft was now only a short distance from the island. “There’s
something behind that stick.”
“Maybe it’s a wild man,” suggested the Dunce, turning a trifle pale.
“Don’t you think we had better go back, General?”
“We intend to go on,” said the General, glancing scornfully at the
frightened Dunce, “but if you want to you can jump into the water
and swim back.”
“I-I-I think I’ll stay here,” said the Dunce as he thought of the
many frogs and turtles that might snap him up if he tried to swim
back.
As the raft drew near the shore, several arrows whistled over the
soldiers’ heads and instantly a number of wild men sprang up from
behind a stick that lay on the shore and began shooting at the raft.
“Make ready, men!” shouted the General, drawing his sword.
“Shoot over the wild men’s heads when I give the word to fire. We
don’t want to hurt any of them if we can avoid it.”
“O-o-o-oh, I-I-I’m shot!” screamed the Dunce, as an arrow
knocked his hat from his head, but the rest of the little soldiers
never heard the foolish fellow, for they stood ready, awaiting the
General’s order to fire.
Chapter Fourteen
“Deliver this note at once to the Old Soldier,” said the General, as
he handed the following letter to the army aviator:
Commander of the rear guard of the Teenie Weenie Army, Camp Bitem,
on the Swamp Road:
My dear Captain:
We have had a battle with the enemy and our brave men have put
them to rout.
Our transport met with an accident and it was broad daylight before
we landed on the island.
The wild men attacked us as we neared the shore and sent a shower
of arrows at us.
I ordered my men to return the fire, and at the first crack of their rifles
the wild men were greatly scared and ran off into the tall grass; I believe
that it is the first time the wild men have ever heard a rifle shot.
We have taken possession of a high bank where I have ordered the
men to begin work on a trench.
The raft is now on its way to your camp, and I want you to rush over
the cannon and baggage as soon as possible, for I fully expect the wild
men to attack us before long.
I am sending this note by our brave aviator so you can have things
ready to load on the raft when it arrives.
Respectfully yours,
THE GENERAL,
Commander in Chief of the Teenie Weenie Army.
P. S.—I forgot to say that none of my men was hurt in the battle
except the Dunce, who was badly scared by being shot through the hat.
“Yes, sir,” saluted the Turk, and springing onto the back of the
airplane he quickly flew out of sight over the water.
A COUNCIL OF WAR
For several days the army spent their time building trenches and
making a comfortable camp, while the army scouts learned all they
could about the wild men and the lie of the land.
The Red Cross tent had been set up and the tiny cots looked very
pretty, with their clean white sheets. Fortunately, there had been
little use for them, as the army had been unusually healthy, the only
exception being the Chinaman, who had been badly bitten by a
pollywog, or tadpole, while he was taking a swim in the river.
There had been very little excitement in camp for some time. Not
a single wild man had been seen since the morning the army had
landed on the raft and the soldiers had nothing much to do while off
guard duty but to kill mosquitoes, which were thick about the camp.
Early one morning the Turk was called to the General’s tent, where
he remained for some time.
“Somethin’ doin’, I’ll bet,” thought the Dunce, who was on guard
duty at the time in front of headquarters.
Something really was doing, for the Turk was ordered to fly out at
once and make a careful map of Sabo Island. The Turk hurried to his
tent, where he supplied himself with paper and pencils and a pair of
tiny field glasses. The army airplane was dining on a fat worm when
the Turk arrived, so he sat down and waited until the bird had eaten
his breakfast.
“We’ve got to go out and make a map of the wild men’s island,”
said the Turk.
“All right,” answered the bird, “I’m ready,” and hopping onto the
ground he squatted down while the Turk climbed up on his fat back.
The Turk headed the bird to a big tree which grew on the river
bank near the island and in a few minutes the airplane settled easily
on the topmost branch. The great blue river lay far beneath the Turk
and with the help of his field glasses he was able to make a good
map of the island and the surrounding country.
When he returned to the Teenie Weenie camp the General
immediately called a council of war and the little aviator was asked
to explain the map in detail.
“Well,” began the Turk, “the circle marked Camp Bitem is the place
where we camped and built the raft and the dotted line is the course
we took to our present camp. The wild men have a sort of camp or
fort, I couldn’t just exactly make out what it was, but anyhow they
are gathered in some force on the only cleared ground between their
village and our camp.”
MAP OF WILD MEN’S ISLAND.
“We couldn’t march through the grass and trees and cut the wild
men’s camp off from the village, could we?” asked the Old Soldier.
“No, sir, I don’t think so,” answered the Turk, “for I do not believe
anyone could possibly get through the grass and trees.”
“Well, that’s too bad,” muttered the General. “I wanted to get
those wild men out of that place with as little trouble as possible,
but it looks as though we would have to take their fort by storm.”
All the Teenie Weenie officers gathered in the General’s tent
listened solemnly to their commander’s words, for they knew it
would be mighty serious if they were forced to charge the wild men’s
fort.
Chapter Sixteen
“I done got ’em bottled up! I done got ’em bottled up!” shouted
Gogo, the little colored Teenie Weenie, as he ran panting up to the
General’s tent.
“What’s bottled up? What’s all the excitement about?” asked the
General, popping his head through the opening of his tent.
“Why I-I-I done ketched one of the wild men and turned him ovah
to the guard and I done got three mo’ corked up in a bottle.”
“Great Guns! This is exciting. Tell me about it,” cried the General.
“Well, you see it’s dis way,” said Gogo, sitting down on a pebble
and mopping his head with his tiny handkerchief. “I done took a
walk out beyond the picket lines yonder. I knew I had no business
wanderin’ out dere, but I jus’ kept on and pretty soon I run across a
big bottle a-layin’ on its side.
“I was kind of ’spicious about dat bottle, fo’ I done see through de
glass where some dry grass had done been fixed up fo’ a bed,
mighty like some one been sleepin’ dere.
“‘Gogo,’ I says to myself, ‘some one been sleepin’ heah in dis
bottle and it ain’t none of de Teenie Weenies, fo’ none of dem has
been out heah dis far.’ Den I made up my mind that it mus’ be some
of dem scalawag wild men and I reckon dey mus’ stayed in dis bottle
when dey was on guard duty watchin’ our army.
“‘But why did dey-all stay in dis heah
bottle?’ I says to myself. ‘It’s not cold
nights.’ But jus’ den a big mosquito cam’ a-
buzzin’ and a-buzzin’ round and den I knew
dat the wild men been a-stayin’ in dat
bottle fo’ to keep de mosquitoes from bitin’
’em.
“I says to myself, ‘Some of dese wild men
will be comin’ ’round heah pretty soon and
maybe I can done cotch ’em and extinguish
myself.’”
“Distinguish yourself,” corrected the General.
“Yes, sah,” continued the little colored fellow. “Well, I done crawl
under a leaf and waited. I done wait fo’ a long time, but pretty soon
I done see fo’ of de wild men come sneakin’ along and pretty soon
dey done make right fo’ de bottle. Three of ’em done crawl in de
bottle and one of ’em done squat down outside by de openin’ of de
bottle kinda like he was guardin’.
“‘By de great corn pone,’ I says, ‘if a couple of de Teenie Weenies
was heah we could done cotch dese scalawags.’
“Pretty soon I thought to myself, ‘Why don’t you ketch ’em
yourself?’ So I done sneaked out up behind de wild man what was
guardin’ de mouth of de bottle and done cracked him on de head
with de butt of my gun. I didn’t hit very hard—just hard enough to
stun him a little—and den I grabbed a cork dat was layin’ near by
and stuffed it into de bottle and braced it with a stick of wood so the
scalawags couldn’t get out. I then picked up de wild man I had
knocked down and brought him into camp and dat’s all.”
“A very brave deed, sergeant,” said the General. “And I will
immediately send out a squad of men to bring your prisoners into
camp.”
The Old Soldier was ordered to take a squad of men and go after
the prisoners, while the Doctor was sent to dress the bump on the
head of the wild man that Gogo had knocked down. After a great
deal of work the soldiers managed to pull the three wild men out of
the bottle and when they were brought into camp they were
securely tied to a strong blade of grass.
Chapter Seventeen
“Why are you making such a fine camp here, General?” asked the
Doctor, as he noticed that the Teenie Weenies continued to improve
the camp. “Won’t we have to move on pretty soon if the wild men
do not attack?”
“We’ll stay right here for some time,” answered the General, taking
off his tiny sword and laying it on the table which stood in front of
his tent. “We are within striking distance of the wild men’s village, so
the aviator tells me, and we’ll use this camp for our base of
operations.”
“General,” said the Cook, saluting the commander of the Teenie
Weenie army, “I beg your pardon, but there is something I must tell
you.”
“What is it, sir?” said the General, returning the Cook’s salute.
“Why sir, there’s a thimble missing from among my cooking things.
I put two beans to soak in it last night and when I went to look at
them a little while ago the beans were lying on the ground and the
thimble was gone.”
“That’s most strange,” said the General; “I’ll have the Cowboy look
into the matter and see if he can find out what has happened to the
thimble.”
“Thank you, sir,” said the Cook, “I’m a little short of cooking pans
and kettles and I’d like to have it back.”
The Cowboy was ordered to look for the lost thimble, but before
he had fairly started the search, the thimble turned up in a most
peculiar way. Down the main street of the camp towards the
General’s tent marched the Dunce with the lost thimble over his
head and followed by a laughing crowd of soldier.
“What’s the idea of this?” asked the General as the Dunce stopped
before him.
“Safety first,” answered the Dunce.
“What do you mean by safety first?” asked the General, trying
hard to keep from laughing at the ridiculous sight.
“W-w-well, you s-s-see,” began the Dunce, “I thought this thimble
would make a fine suit of armor, and protect me from the wild men’s
arrows. I took it out back of camp, got some tools and cut a couple
of holes for my arms to go through and another hole to see through
—”
“Yes, and spoiled a perfectly good thimble,” put in the General.
“Jinks!” exclaimed the Dunce, “I never thought of that.”
“Of course you didn’t,” answered the General sternly. “You have a
habit of doing your thinking afterwards, and that is a mighty bad
habit.”
“Quite right! Quite right!” cried a field mouse, who had been
hanging around the camp for a few days. “Quite right, I says.
There’s always a time to think. One ought to do a heap of thinking
before one acts, I says.”
“Yes, you’re right,” put in the General, glaring at the mouse, who
was very talkative. “One ought to think a great deal and then he
ought to say only about one half of what he thinks.”
“Words of wisdom! Words of wisdom!” cried the mouse, never
dreaming the General’s rebuke was aimed at him, and he strolled
down the camp street quite pleased with himself.
“Now, Dunce,” said the General, “I’m going to try to see if I can
help you do a little thinking.”
“Y-y-yes, s-s-s-sir,” answered the Dunce.
“I’m going to make you wear that thimble for the rest of the day
and that ought to help you to remember that you have spoiled a
perfectly good cooking pot, just because you didn’t happen to think.”
All day long the poor Dunce was forced to walk up and down in
front of the General’s tent, wearing the heavy thimble. It was a
warm day and the thimble grew quite hot in the sunshine, so his
punishment was pretty hard, but there is no doubt it did him a great
deal of good.
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookgate.com