Добро пожаловать в Scribd!

Midterm

Загружено:

0% нашли этот документ полезным (0 голосов)

53 просмотров6 страниц

Collection made of 500 000 documents, each containing on average 800 words. Number of different words (i.e. Not taking duplicates into account) is estimated to 700 000. Linguistic preprocessing increases retrieval precision. B) stemming only slightly reduces the size of the dictionary. C) stop lists contains all most frequent terms.

Исходное описание:

Оригинальное название

midterm

Авторское право

Доступные форматы

PDF, TXT или читайте онлайн в Scribd

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Пожаловаться на этот документ

Авторское право:

Attribution Non-Commercial (BY-NC)

Доступные форматы

Скачайте в формате PDF, TXT или читайте онлайн в Scribd

Отметить как неприемлемый контент

0% нашли этот документ полезным (0 голосов)

53 просмотров6 страниц

Midterm

Загружено:

مهدی رنگچی

Авторское право:

Attribution Non-Commercial (BY-NC)

Доступные форматы

Скачайте в формате PDF, TXT или читайте онлайн в Scribd

Отметить как неприемлемый контент

Перейти к странице

Вы находитесь на странице: 1из 6

Поиск в документе

ISCL wintersemester 2007 IR Midterm exam

17 December 2007 Non-electronic documents and calculators are authorized. Name : Semester :

Exercise 1 : Denitions
Dene the following terms : tokenization permuterm index champion list

Exercise 2 : Characteristics of a collection and its index

Consider a collection made of 500 000 documents, each containing on average 800 words. The number of dierent words (i.e. not taking duplicates into account) is estimated to 700 000. For all questions, give your computation. What is the size (mega or giga bytes) of the collection when stored (uncompressed) on disc ?

With the best reduction rate of the dictionary achieved when using a linguistic preprocessing (noise words, stemming), what is the size (number of terms) of the dictionary ?

Consider an index where the average length of a non-positional posting list is 200. What is the estimation of the total number of postings of this index ?

How many bytes do you allow respectively for encoding (without compression) a dictionary term ? a non-positional posting ?

What are the size (mega or giga bytes) of the resulting dictionary and posting lists ?

If you compress your dictionary using the dictionary-as-a-string method, what is the new size of the dictionary ?

Exercise 3 : Querying an index

What kind of queries can be applied to the collection, for each of these, what index is needed ?

Exercise 4 : Linguistic preprocessing

Are the following statements right or false (justify your answer) ? a) stemming increases retrieval precision.

b) stemming only slightly reduces the size of the dictionary.

c) stop lists contains all most frequent terms.

Exercise 5 : Porter stemming

What would be the result of the porter stemmer used with the following words ? busses rely realised What is the Porter measure of the following words (give your computation) ? crepuscular

rigorous

placement

Exercise 6 : Index architecture

Propose a Map-Reduce architecture for creating language specic indexes from an heterogeneous collection. You can illustrate this architecture using a gure.

Exercise 7 : Index compression

What is the largest gap that can be encoded in 2 bytes using the variable-byte encoding ?

What is the posting list that can be decoded from the variable byte-code 10001001 00000001 10000010 11111111 ?

What would be the encoding of the same posting list using a -code ?

Exercise 8 : Vector Space Model

Consider a collection made of the documents d1 , d2 , d3 and whose characteristics are the following : Term actor movie trailer tfd1 12 15 52 tfd2 35 24 13 tfd3 55 48 12 df 123 240 85

Compute the vector representations of d1 , d2 and d3 using the tf idft,d weighting and the euclidian normalisation.

Compute the cosine similarities between these documents.

Give the ranking retrieved by the system for the query movie trailer.

Exercise 9 : Term weighting

Compute the vector representations of the documents introduced in the previous exercise using the ltn weighting scheme.

Exercise 10 : Index architecture (extra credit)

Consider a hashtable as a structure mapping keys to values using a hash function h such that h(key) = value. What problem may arise from such a structure when inserting new key-value pairs ?

What workaround would you propose for this insertion ? Give an algorithm for inserting a key-value pair.

Вам также может понравиться

Big Data Solutions for Practice Exercises of Chapter 10
Документ4 страницы
Big Data Solutions for Practice Exercises of Chapter 10
NUBG Gamer
0% (1)
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
Документ6 страниц
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
priyam
100% (1)
SP18 CS182 Midterm Solutions - Edited
Документ14 страниц
SP18 CS182 Midterm Solutions - Edited
Hasim
Оценок пока нет
Advanced C++ Interview Questions You'll Most Likely Be Asked
От Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
Оценок пока нет
Midterm Sol
Документ6 страниц
Midterm Sol
Mohasin Azeez Khan
Оценок пока нет
HF Sample Paper-X Half Yearly Class - Xi Informatics Practices
Документ3 страницы
HF Sample Paper-X Half Yearly Class - Xi Informatics Practices
utkarsh
Оценок пока нет
15 122 hw2
Документ10 страниц
15 122 hw2
Ryan Sit
Оценок пока нет
NR-220502 - Design and Analysis of Algorithms
Документ4 страницы
NR-220502 - Design and Analysis of Algorithms
Srinivasa Rao G
100% (2)
Autoencoder Assignment PDF
Документ5 страниц
Autoencoder Assignment PDF
praveen kandula
Оценок пока нет
Homework Exercise 4: Statistical Learning, Fall 2020-21
Документ3 страницы
Homework Exercise 4: Statistical Learning, Fall 2020-21
Elinor Rahamim
Оценок пока нет
CS5785 Homework 4: .PDF .Py .Ipynb
Документ5 страниц
CS5785 Homework 4: .PDF .Py .Ipynb
Al Tarino
Оценок пока нет
Assignment 1
Документ2 страницы
Assignment 1
Aditya Kapoor
Оценок пока нет
Ps 1
Документ4 страницы
Ps 1
cancancan35
Оценок пока нет
IIIT Bhubaneswar Information Retrieval Document
Документ2 страницы
IIIT Bhubaneswar Information Retrieval Document
Praveen Kumar K
Оценок пока нет
MCA 312 Design&Analysis of Algorithm QuestionBank
Документ7 страниц
MCA 312 Design&Analysis of Algorithm QuestionBank
nbpr
Оценок пока нет
Instructions:: Problem 1
Документ3 страницы
Instructions:: Problem 1
Lomash Yaduvanshi
Оценок пока нет
ps1 PDF
Документ5 страниц
ps1 PDF
Anil
Оценок пока нет
03 Unit Calculator Fa14
Документ59 страниц
03 Unit Calculator Fa14
kaysov
Оценок пока нет
KENDRIYA VIDYALAYA SANGATHAN HYDERABAD REGION PRE-BOARD EXAMINATION 2020-21
Документ10 страниц
KENDRIYA VIDYALAYA SANGATHAN HYDERABAD REGION PRE-BOARD EXAMINATION 2020-21
Kowshik
Оценок пока нет
Assignment For Week2
Документ2 страницы
Assignment For Week2
sneha Sabbineni
Оценок пока нет
IR Endsem Solution1
Документ17 страниц
IR Endsem Solution1
Dash Casper
Оценок пока нет
ComputerScience SQP
Документ11 страниц
ComputerScience SQP
Arihant Nath Chaudhary
Оценок пока нет
DS402 Exam Questions on Data Structures
Документ2 страницы
DS402 Exam Questions on Data Structures
NIET14
Оценок пока нет
PracList XII CS 2022 23
Документ8 страниц
PracList XII CS 2022 23
Manan Sethi
Оценок пока нет
HT TP: //qpa Pe R.W But .Ac .In: Pattern Recognition
Документ4 страницы
HT TP: //qpa Pe R.W But .Ac .In: Pattern Recognition
Duma Dumai
Оценок пока нет
Multmedia Studies
Документ15 страниц
Multmedia Studies
Devson Simon
Оценок пока нет
OOP MSBTE Question Paper Winter 2007
Документ3 страницы
OOP MSBTE Question Paper Winter 2007
api-3728136
Оценок пока нет
Oomd Assignment - 1 - 2020 - 2021
Документ2 страницы
Oomd Assignment - 1 - 2020 - 2021
vejiyo9416
Оценок пока нет
DataGrokr Technical Assignment
Документ4 страницы
DataGrokr Technical Assignment
Sidkrish
Оценок пока нет
CSI 4107 - Winter 2006 - Final
Документ10 страниц
CSI 4107 - Winter 2006 - Final
Amin Dhouib
Оценок пока нет
ML Lab 11 Manual - Neural Networks (Ver4)
Документ8 страниц
ML Lab 11 Manual - Neural Networks (Ver4)
dodela6303
Оценок пока нет
Data Structure Syllabus
Документ6 страниц
Data Structure Syllabus
darshansjadhav369
Оценок пока нет
DAA Question Bank
Документ10 страниц
DAA Question Bank
AVANTHIKA
Оценок пока нет
CSCI 5454 Algorithms Problem Set Solutions
Документ10 страниц
CSCI 5454 Algorithms Problem Set Solutions
Ramnarayan Shreyas
Оценок пока нет
First Model Exam 2021-22 - Computer Science Paper 1 Theory
Документ7 страниц
First Model Exam 2021-22 - Computer Science Paper 1 Theory
Andrina Praveen
Оценок пока нет
Cbse QP Sol 2011
Документ8 страниц
Cbse QP Sol 2011
thiripura sundari
Оценок пока нет
CS 3MI3: Fundamentals of Programming Languages: Dr. Jacques Carette
Документ3 страницы
CS 3MI3: Fundamentals of Programming Languages: Dr. Jacques Carette
midaw87576
Оценок пока нет
Lords Institute of Engineering & Technology Approved by AICTE/Affiliated To Osmania University/Estd.2002. Name of The Subject: PYTHON PROGRAMMING
Документ6 страниц
Lords Institute of Engineering & Technology Approved by AICTE/Affiliated To Osmania University/Estd.2002. Name of The Subject: PYTHON PROGRAMMING
Abdul Azeez 312
Оценок пока нет
PRML Assignment1 2022
Документ2 страницы
PRML Assignment1 2022
Basil Azeem ed20b009
Оценок пока нет
Assignment 1
Документ2 страницы
Assignment 1
Arjun
Оценок пока нет
UNIT-4 Part-B: 1) Briefly Described About 1-D Time Series and The 2-D Color Images and With Suitable Examples?
Документ11 страниц
UNIT-4 Part-B: 1) Briefly Described About 1-D Time Series and The 2-D Color Images and With Suitable Examples?
chandral joshi
Оценок пока нет
Sample Paper 1 (Solved) : Class Xii Informatics Practices (New)
Документ4 страницы
Sample Paper 1 (Solved) : Class Xii Informatics Practices (New)
Harsh Vardhan Singh
Оценок пока нет
CNN Text Classification
Документ12 страниц
CNN Text Classification
孙亚童
Оценок пока нет
Computer Science 123789
Документ15 страниц
Computer Science 123789
Laxmikant Tak
Оценок пока нет
Xii CS PB1
Документ10 страниц
Xii CS PB1
lalita nagar
Оценок пока нет
423/723 Natural Language Processing: Assignment 1
Документ4 страницы
423/723 Natural Language Processing: Assignment 1
Amir
Оценок пока нет
CS 09 303 Data Structures NOV 2014
Документ2 страницы
CS 09 303 Data Structures NOV 2014
Sai Das
Оценок пока нет
Practice 5
Документ4 страницы
Practice 5
Ajoy A G
Оценок пока нет
Class Xii Computer Science (Question Bank - MLL)
Документ40 страниц
Class Xii Computer Science (Question Bank - MLL)
Keshav Pandiarajan
Оценок пока нет
Heba DSBook 2022
Документ337 страниц
Heba DSBook 2022
Ahmed Fawzy
Оценок пока нет
Questions Collected From Web - Amazon
Документ13 страниц
Questions Collected From Web - Amazon
Nishtha Patel
100% (1)
MLchallenge2022 Block4
Документ9 страниц
MLchallenge2022 Block4
fede
Оценок пока нет
DASS Assignment - 3
Документ4 страницы
DASS Assignment - 3
Apoorva Tirupathi
Оценок пока нет
Cambridge University - Computer Science Tripos - y2023PAPER4
Документ9 страниц
Cambridge University - Computer Science Tripos - y2023PAPER4
GtrendsCE Account
Оценок пока нет
Mid A
Документ9 страниц
Mid A
Krishna
Оценок пока нет
Xii Cs Question Bank For Bright Students Chapter Wise Set-II
Документ40 страниц
Xii Cs Question Bank For Bright Students Chapter Wise Set-II
Usha Mishra365
Оценок пока нет
Multimedia Multimedia Bible
Документ163 страницы
Multimedia Multimedia Bible
Stephen Kimani
0% (2)
Good Habits for Great Coding: Improving Programming Skills with Examples in Python
От Everand
Good Habits for Great Coding: Improving Programming Skills with Examples in Python
Michael Stueben
Оценок пока нет
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
От Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
Оценок пока нет
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
От Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
Оценок пока нет