Академический Документы
Профессиональный Документы
Культура Документы
Abstract: The efficient processing and association of different multimodal information is a very important research field with a great variety
of applications, such as human computer interaction, knowledge
discovery, document understanding, etc. A good approach to this
important issue is the development of a common platform for converting
different modalities (such as images, text, etc) into the same medium and
associating them for efficient processing and understanding. Thus, this
paper here presents the development of a novel methodology based on
Local-Global (LG) graphs capable for automatically converting image
context into natural language text sentences and then into speech for
serving as an interactive model for locating missing objects in home
environments. Simple illustrative examples are provided for proving the
concept proposed here.
Keywords: Converting Images to NL-Text, Image
Representation, Graphs, Recognizing Objects.
Analysis and
I- INTRODUCTION
Where are my pills? Did you see them? This is one of
several frequently asked questions every day in a home
environment, where we ask help from others who may have
information regarding a missing object of ours. The usual
answer to this question is a short spoken NL sentence, like
over there. This interaction creates the following
challenge. Can we develop an interactive system that
automatically extracts and recognizes objects from images
and describe their locations and associations in NL
sentences by using the appropriate set of sensors,
computing devices, and software techniques? The answer is
yes for some categories of images of low complexity with
no illumination or shadow tricks. Thus, a variety of
techniques from different fields have to be synergistically
employed in order such tasks to be accomplished. In
particular, the main research fields involved in this effort
are image processing and understanding, NL sentences
conversion into speech and computing models, like graphs
[1-32].
49
50
(a)
(b)
(c)
Figure 3 : An example of a mismatch. (a) is the original
data image. (b) and (c) show two graphs with one PCRP
change. (b) and (c) have a similar graph and also
the same region relationships.
ij
(b)
(a)
Fig. 2. a) It shows the connectivity among three
neighboring regions and their centroids and the
LG graph; b) The L-G graph of a synthetic image
consisted of seven segmented regions
51
Perpendicular Rule:
Ni aijpe Nj NLs ={the straight line segment Ni is
perpendicular with the
straight line segment Nj}
Symmetric Rule:
Ni aijs Nj NLs ={the straight line segment Ni is
symmetric with the straight line
segment Nj}
Synthesis Rule:
Ni aijx Nj Ni aijx Nj NLs AND NLs
Image Processing
Analysis
An Illustrative Example-1:
Here for simplicity we skip the attributes of the nodes and
all the possible relationships among the nodes, thus from
the figure 1 the attributed graph is:
LG representation of
objects & Associations
G1 = Ln1(c=140o)Ln2(c=174o)Ln3(c=111)Ln4(c=22)
Ln5(c=152)Ln6(c=160)Ln7(c=173)Ln8(c=108o)
Ln9(c=30o)Ln1 Ln2(p)Ln8 Ln3(p)Ln7
Ln9(p)Ln4
52
Table-1
More specifically, from the input image all the objects are
extracted and represented in their own graph form (G).
Each graph form G is compared to the graph models
existing in the graph DB. The outcome from the DB is the
recognition of each object, its features and the relationships
among the objects. These relationships are expressed into a
NL text description.
Extracted NL Outcome:
Detection: There is a silver wrench-tool
Location of the objects: The wrench-tool is on the upper
center part of the image.
Associations of the objects:
The wrench-tool is above the car; The wrench-tool is at the
left of the helicopter; The wrench-tool is at the right of
airplane;
A Real Example:
The following example presents a real case by using a
surveillance system based on Nortech Security camera and
a HP pavilion portable computer, figure 8. The camera
scans the room by capturing a sequence of images (10
frames per second) and the computer software inspects each
image in order to discover and extract the requested
object(s), in this particular case the objects are the medical
pills, which have been located in the tenth frame of the
sequence. Here we present the LG graph with a few of
visual connections with the surrounding objects (for clear
visual representation). The figure 8 contains only a view of
the camera used in this experiment; a view of one of the
two rooms (upper right frame in the figure); a view of
second room (lower left frame of the figure) with the
objects ; a magnified view of the detected and recognized
objects (pills).
Outcome
Graph
DB &
Associations
Extracted NL Outcome:
Detection: There are three boxes with pills;
Location of the objects: The boxes with the pills are at the
upper center of the image.
Associations of the objects: The boxes with the pills are on the
table; The boxes with the pills are below the lamp; The boxes
with the pills are at the left of the chair; The boxes with the pills
are at the right of the couch;
The table-1 below shows the outcome from the simple one
image example.
53
Discussion:
This is realistic case, where we need to find something
missing in a home environment. The methodology has
inspected a sequence of images (one by one) and it took on
the average a 1 sec per frame to decide if there is the
missing object in it or not. In addition, the methodology
(using a small DB with 20 items) made a decision that there
was a couch in the frame with the pills, however, there were
two chairs together forming a couch. The recognition of the
pills was based on the fact that we have indicated that the
boxes contain the pills and there were not other boxes in the
DB. We are working to improve the methodology to
interactively separate and recognize the correct boxes from
a set of boxes by using additional information such as
color, shape, size, label information (NL text).
III. CONCLUSION
This paper here presented the basic concept and the
synergy of methodology for the development of an efficient
methodology capable for automatically converting images
into equivalent natural language text sentences and
contributing to research efforts by transforming different
modalities into the same model. The model used here for
representing two modalities (images, NL text sentences) is
the LG graphs. Note here that this conversion provides no
interpretation of the context of an image, but a description
of it. LG graphs could also provide efficient interrelations
of images and text.
In addition, the methodology has potential for
commercial and scientific applications such as multimedia
information retrieval, knowledge discovery, etc. It can be
used as a software tool in a variety of applications, such as
document processing and understanding, digital libraries,
knowledge extraction, automatic annotation of images, etc.
It can also serve as a testbed for integration and
synchronization of other single modalities, such as speech,
NL translation, by representing structural knowledge in the
same LG model, and it is an efficient scheme for
multimodal sources.
Acknowledgement
This work is partially supported by an AIIS grant.
BIBLIOGRAPHY AND REFERENCES
[01] F.Wahl, K.Wong and R.Casey, Block separation and
text extraction in mixed text image documents, CVGIP, 20,
1989
[02] N.Bourbakis, A document processing methodology:
separating text from images, IFAC, IJEAAI vol. 14, pp. 3542, 2001, also in IEEE Symp.I&S,Nov.1996,MD
[03] N. Bourbakis, Associating activities in images using
SPN graphs, IEEE Conf. TAI-06, Nov. 13-15, 2006, WDC.
[04] N. Bourbakis and P. Kakumanu, Recognizing Facial
Expressions using LG graphs, IEEE Conf. on TAI-06, Nov.
13-15, 2006, WDC, also, SUNY-B-TR-1997
54
55