Вы находитесь на странице: 1из 26

B.

Tech 8th Semester project 2012 Team Members: Akash (20084087) Anshul Goyal(20084027) Shrish Chandra Mishra(20084050) Amit Singh(20084057)

When we download images from world wide web the images obtained are in random order.
We often want these images to be in some semantic order. Image Processing may not be used for this purpose because of the huge overhead involved.

The images on the Web are not found alone they are embedded in an HTML page along with text related to image .

The surrounding text of an image can be used to determine the context of that image. Performing semantic analysis on the surrounding text we can obtain relevant ordering of images.

Can we guess what does This image represents?

When the surrounding text is considered the information we get is Valley of flowers national Park is an Indian National Park, nestled in West Himalayas. It is located in Uttarakhand state.
(Source:Wikipedia.org)

Flow Diagram Search term


MODULE 2
yes
c Cached Concept Extraction from Corpora

MODULE 1

no

Download Images and Parse Surrounding Text

Arranging Images Using Semantic Analysis

Read Images into Memory Output to User Interface

Search term is entered using GUI provided by the wxPython library

Search term is checked against the already cached downloaded images Module 1 download the images and the HTML document using the search term Images are downloaded by using producer consumer thread.

Thumbnails are created using the downloaded images and brought into an array Module 2 extracts the concepts from the corpora and arrange the images using the semantic analysis

The output is shown to user in a GUI window

User can see the image as well as the corresponding corpora of image by clicking the thumbnails

Module 1

Search Term

Create Web Search URL Web search and get resulting image URLs Download Images and their corresponding HTML document

Search Constraints

Extract surrounding text for every image

Adding image information into database

Images and surrounding text

Search Constraints
1. 2. Number of images to be downloaded Image size

Web Search URL is created using the search term and search constraints and the image URLs are retrieved from the page Images and their HTML pages are downloaded from the URL retrieved above using the pythons mechanize library and using threads.

Threading
The threading is implemented as Producer-Consumer threads. The producer thread creates a new thread for every new image and put them in a queue.

Each thread downloads image and its corresponding HTML file. The consumer thread takes the thread from queue checks the validity of the image and then create an entry for it in the database serially for maintaining the consistency.

Surrounding text is extracted from the images HTML page using the pythons BeautifulSoup library and is saved as .corpora files in the system along with the corresponding images. For every new image downloaded, a new entry is created into the database with information like image URL, website URL, search term and images name in the hard disk.

Module2

Corpora
Removing Stop Words

Apply Stemmer Algorithm

Concept Extraction Using Standard Ontology

Arranging Images

Updating the database

Output

Takes corpora as input from the module1 Irrelevant stop words are removed from the corpora using a predefined dictionary of stop words

Ex. Stopwords are common words that carry less important meaning than keywords. Usually search engines remove these words from Keyword phrase

Stopwords common words important meaning keywords search engines remove stopwords keyword phrase

Stemmer algorithm is used to obtain the morphological root of the words in corpora Stemmer algorithm is applied using the pythons whoosh library
Ex. Chatter
Stemmer Algo.

Chat

Running

Run

Concepts are extracted from the corpora by comparing against the given standard ontology in OWL format Arrangement of the images is done on the basis of the concepts extracted from the image The concepts extracted and the keywords from the surrounding text are entered into the database for a particular image.

A common RDF file is created for storing information A general entry of image into database is shown below
<rdf:Description rdf:about=URI of IMAGE> <image:image_name> </image:image_name > <image:URL_webpage></image:URL_webpage> <image:URL_image> </image:URL_image> <image_Search_term></image_Search_term>

<image:keyword> </image:keyword>
<image:concept_name> </image:concept_name> </rdf:Description>

Result without applying the sematic arranging

Result after applying the algorithm

Keyword Suggestion for the user

The keywords are suggested by comparing the frequency of the words in the surrounding text The words with highest frequency are shown to the user

Comparison of the images

It is done by calculating the intersection of the concepts and the union of the concepts extracted the surrounding text from both the images Formula used Concepts from image1 Concepts from image1

concepts from image2 concepts from image2

A user can query the database on 3 attributes of the image 1.Surrounding text keyword 2.Search Term 3.Concept The images with corresponding attributes are shown to the user.

Applying the arrangement of images for other domains. Remove the duplicate images by trying to find some relationship between the texts. Optimization of the data structures used.

Вам также может понравиться