B.Tech 8 Semester Project 2012: Anshul Goyal (20084027) Shrish Chandra Mishra (20084050) Amit Singh (20084057)

B.
Tech 8th Semester project 2012 Team Members: Akash (20084087) Anshul Goyal(20084027) Shrish Chandra Mishra(20084050) Amit Singh(20084057)
When we download images from world wide web the images obtained are in random order.
We often want these images to be in some semantic order. Image Processing may not be used for this purpose because of the huge overhead involved.
The images on the Web are not found alone they are embedded in an HTML page along with text related to image .
The surrounding text of an image can be used to determine the context of that image. Performing semantic analysis on the surrounding text we can obtain relevant ordering of images.
Can we guess what does This image represents?
When the surrounding text is considered the information we get is Valley of flowers national Park is an Indian National Park, nestled in West Himalayas. It is located in Uttarakhand state.
(Source:Wikipedia.org)
Flow Diagram Search term

MODULE 2
yes
c Cached Concept Extraction from Corpora
MODULE 1
no
Download Images and Parse Surrounding Text
Arranging Images Using Semantic Analysis
Read Images into Memory Output to User Interface
Search term is entered using GUI provided by the wxPython library
Search term is checked against the already cached downloaded images Module 1 download the images and the HTML document using the search term Images are downloaded by using producer consumer thread.
Thumbnails are created using the downloaded images and brought into an array Module 2 extracts the concepts from the corpora and arrange the images using the semantic analysis
The output is shown to user in a GUI window
User can see the image as well as the corresponding corpora of image by clicking the thumbnails
Module 1
Search Term
Create Web Search URL Web search and get resulting image URLs Download Images and their corresponding HTML document
Search Constraints
Extract surrounding text for every image
Adding image information into database
Images and surrounding text
Search Constraints
1. 2. Number of images to be downloaded Image size
Web Search URL is created using the search term and search constraints and the image URLs are retrieved from the page Images and their HTML pages are downloaded from the URL retrieved above using the pythons mechanize library and using threads.
Threading
The threading is implemented as Producer-Consumer threads. The producer thread creates a new thread for every new image and put them in a queue.
Each thread downloads image and its corresponding HTML file. The consumer thread takes the thread from queue checks the validity of the image and then create an entry for it in the database serially for maintaining the consistency.
Surrounding text is extracted from the images HTML page using the pythons BeautifulSoup library and is saved as .corpora files in the system along with the corresponding images. For every new image downloaded, a new entry is created into the database with information like image URL, website URL, search term and images name in the hard disk.
Module2
Corpora
Removing Stop Words
Apply Stemmer Algorithm
Concept Extraction Using Standard Ontology
Arranging Images
Updating the database
Output
Takes corpora as input from the module1 Irrelevant stop words are removed from the corpora using a predefined dictionary of stop words
Ex. Stopwords are common words that carry less important meaning than keywords. Usually search engines remove these words from Keyword phrase
Stopwords common words important meaning keywords search engines remove stopwords keyword phrase
Stemmer algorithm is used to obtain the morphological root of the words in corpora Stemmer algorithm is applied using the pythons whoosh library
Ex. Chatter
Stemmer Algo.
Chat
Running
Run
Concepts are extracted from the corpora by comparing against the given standard ontology in OWL format Arrangement of the images is done on the basis of the concepts extracted from the image The concepts extracted and the keywords from the surrounding text are entered into the database for a particular image.
A common RDF file is created for storing information A general entry of image into database is shown below
<rdf:Description rdf:about=URI of IMAGE> <image:image_name> </image:image_name > <image:URL_webpage></image:URL_webpage> <image:URL_image> </image:URL_image> <image_Search_term></image_Search_term>
<image:keyword> </image:keyword>
<image:concept_name> </image:concept_name> </rdf:Description>
Result without applying the sematic arranging
Result after applying the algorithm
Keyword Suggestion for the user
The keywords are suggested by comparing the frequency of the words in the surrounding text The words with highest frequency are shown to the user
Comparison of the images
It is done by calculating the intersection of the concepts and the union of the concepts extracted the surrounding text from both the images Formula used Concepts from image1 Concepts from image1
concepts from image2 concepts from image2
A user can query the database on 3 attributes of the image 1.Surrounding text keyword 2.Search Term 3.Concept The images with corresponding attributes are shown to the user.
Applying the arrangement of images for other domains. Remove the duplicate images by trying to find some relationship between the texts. Optimization of the data structures used.

B.Tech 8 Semester Project 2012: Anshul Goyal (20084027) Shrish Chandra Mishra (20084050) Amit Singh (20084057)

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

B.Tech 8 Semester Project 2012: Anshul Goyal (20084027) Shrish Chandra Mishra (20084050) Amit Singh (20084057)

Загружено:

Авторское право:

Доступные форматы

B.

Can we guess what does This image represents?

Flow Diagram Search term

Download Images and Parse Surrounding Text

Arranging Images Using Semantic Analysis

Read Images into Memory Output to User Interface

Search term is entered using GUI provided by the wxPython library

The output is shown to user in a GUI window

Extract surrounding text for every image

Adding image information into database

Images and surrounding text

Apply Stemmer Algorithm

Concept Extraction Using Standard Ontology

Updating the database

Result without applying the sematic arranging

Result after applying the algorithm

Keyword Suggestion for the user

Comparison of the images

concepts from image2 concepts from image2

Вам также может понравиться