Вы находитесь на странице: 1из 29

1

Chapter 1

INTRODUCTION

BACKGROUND OF THE STUDY

The University of Trento is an Italian university located nearby Rovereto,

and the reasearcher Natalia Kokash stated in her study that the most important

among a variety of topics that relate to computation is heuristic algorithm

validation, complexity estimation and optimization. Wide part of theoretical

computer science deals with these tasks. Complexity of tasks in general is examined

studying the most relevant computational resources like execution time and space.

(Kokash, 2002)

Kokash added that the ranging of problems that are solvable with a given

limited amount of time and space into well-defined classes is a very intricate task,

but it can help incredibly to save time and money spent on the heuristic algorithms

design. A vast collection of papers were dedicated to algorithm development. In

connection with the quality issue, the goal of the heuristic algorithm is to find as

good solution as possible for all instances of the problem. There are general

heuristic strategies that are successfully applied to manifold problems.


2

Also Kokash said that nowadays computers are used to solve incredibly

complex problems. But in order to manage with a problem, the researcher should

develop an algorithm. Sometimes the human brain is not able to accomplish this

task. Moreover, exact algorithms might need centuries to manage with formidable

challenges. In such cases heuristic algorithms that find approximate solutions but

have acceptable time and space complexity play indispensable role.

In her study which is all about the heuristics, their areas of application and

the basic underlying ideas are surveyed. The researcher also described in more

detail some modern heuristic techniques, namely Evolutionary Algorithms and

Support Vector Machines.

Grossman and Ophir (2004) from Claremont Mckenna College, Israel,

stated in their study entitled Information Retrieval: Algorithms and Heuristics

that since the beginning of civilization, human beings have focused on written

communication. From cave drawings to scroll writings, from painting presses to

electronic libraries, communicating was of primary concern to mans existence.

Today, with the emergence of digital libraries and electronic information

exchange there is clear need for improved techniques to organized large quantities

of information. Applied and theoretical research and development in the areas of

information authorship, processing, storage, and retrieval is of interest to all sectors

of the community.

In this study, they surveyed recent research efforts that focused on the

electronic searching and retrieving of documents. Their focus is strictly on the


3

retrieval of information in response to users queries. That is, they discussed

algorithms and approaches for information retrieval. A static, or relatively static,

document collection is indexed prior to any user query. A query is issued and a set

documents that are deemed relevant to the query are ranked based on their

computed similarity to the query and presented to the user.

Numerous techniques exist to identify how these documents are ranked, and

that is the key focus of this study (effectiveness). Other techniques also exist to

ranked documents quickly, and these are also discussed (efficiency).

Television, smart phones, computers are just some products of our modern

technology which are created by our hardworking, competent inventors. The

overtaking of technology is fast approaching in our world with the help of our

intelligent researchers and inventors we could make the future possible which is

good to us in order to have an efficient and effective performance. Technology is a

branch that deals with engineering and applied sciences. Technology is now

everywhere and it is now spread throughout the society.

But yet the beauty of the Technology has been misused by some of the

normal norms instead of using it in more appropriate manner. This technology that

we are now acquiring today is a huge help to us especially on calculations,

exchanging of message, entertainment, etc.

Saint Joseph Institute of Technology, commonly known as SJIT is one of

the premiere institutions in the region which embraces the changes in technology.

The institute is already ISO 9000:2008 certified and is nationally accredited by the
4

Philippine Association of Colleges and Universities Commission on

Accreditation (PACU-COA). The College of Business and Information

Technology, being one of the colleges of SJIT is embracing the thrust of the

institute as well. CBIT consists of seven (7) courses offered, namely Bachelor of

Science in Computer Science (BSCS), Bachelor of Science in Information

Technology (BSIT), Bachelor of Science in Hotel and Restaurant Management

(BSHRM), Bachelor of Science in Tourism (BST), Bachelor of Science in Business

Administration (BSBA), Associate in Computer Technology (ACT), Associate in

Hotel and Restaurant Management (AHRM). The College of Business and

Information Technology (CBIT) was established last 2013, since this department

was a combined college department of College of Information Technology (CIT)

and College of Business Management (CBM). It is already making use of

computerized systems in the conduct of its transactions (i.e. enrollment system,

grading system, etc.). However, like many colleges and institutions in the region,

CBIT has still a lot of issues to tackle. One of such issues that the researchers of

this study would like to address would be the issue with regard to lack of

mechanism in evaluating certain aspects of the enrollment process, namely

evaluation of subjects and subject scheduling.

Lastly, the main problem that the researchers attempt to solve with this

study is to determine the most reliable and usable heuristic algorithm. Because

heuristic algorithm is a technique designed for solving problem more quickly when

classic methods are too slow, or for finding an approximate solution when classic
5

methods fail to find any exact solution. This is achieved by trading optimality,

completeness, accuracy, or precision for speed.

STATEMENT OF THE PROBLEM

The researchers of this study would like to address the issues and concerns

mentioned above by answering the following specific problems:

1. Among the different kinds of heuristic algorithms, which is performs best

in searching and retrieving files.

2. Among the heuristic algorithms that will be used, which algorithm performs

best in terms of speed and accuracy.

3. Among the heuristic algorithms that will be used, which has the most

optimal performance.

4. Among the heuristic algorithms that will be used, which is the most reliable.

OBJECTIVES OF THE STUDY

The researchers of this study wish to address the abovementioned issues by:

1. To be able to differentiate the different kind of heuristic algorithm.

2. To be able to evaluate the speed and accuracy of the heuristic algorithm.

3. To be able to determine the optimality of the algorithm.


6

4. To be able to compare the reliability of the algorithms overall results.

Significance of the Study

The findings generated in this study will serve as contribution to the

Colleges of Saint Joseph Institute of Technology (SJIT) more particular in making

of e-book file retrieval operation using heuristic algorithm.

Faculty. The system serves as a tool for our beloved educators who would

like to impart their knowledge of what they acquired in their previous experience

in academy. The faculty desired to engage students learning by searching a

particular e-book which would help them to have a sources for their topics that they

want to present during discussion.

Students. The system serves as a tool for our competent learners who wants

to search a particular e-book which it could serve as location of their source for

projects, reports and other educational purposes. This system is focus to develop

their critical thinking by acquiring knowledge on e-book source.

Colleges/Departments. The system involves division of e-books by each

college. It would help their e-books to be organized and to be oriented which e-

books belongs to them.


7

Library. The system caters some part of library books searching features

which is located at their desired books. The difference of our system is to easily

locate and grab a particular e-book for their educational desires.

Researchers. The study also contributes to the committed and determined

researchers who want to study about the algorithms which be related in our study

and append to their knowledge and even innovate more.

Scope and Limitation

This study will focused on the development of heuristic algorithm that be

able to search a specific e-book or a pdf file in the static file directory on a Personal

Computer or where the system would be installed.

The file retrieval according to thefreedictionary.com it means the

process of accessing information from memory or other storage devices. And now

it is one of the common elements of file searching and now this features are now in

integrated already in many platforms like Microsoft, Apple etc. operating system.

The system caters the algorithms panel where in each panel will have a

dashboard that will display the graphical presentation of the evaluation in terms of

speed, accuracy and optimality.

The study is limited on the development of three to five algorithms by the

researchers for which out of the algorithms developed, the researchers are aiming
8

for the best heuristic algorithm that deals with the file searching in terms of speed,

optimality and accuracy. Each college of Saint Joseph Institute of Technology

(SJIT) will have a folder for the storage of e-books pertaining to the departments

specific courses.

In System Features, the researchers will develop a platform for the

algorithm to visualize which algorithm excels the most in terms of time retrieval or

speed, accuracy and optimality. It will be developed in Visual Studio 2010 with

corresponding programming language used which is Visual Basic.NET.

Following are the system requirements in the development of the study:

Windows 8 Operation System 64/32 bit

Minimum of 1366x768 screen resolution

Microsoft 3.5 .NET framework or higher version

Furthermore, anything that havent stated above would automatically an

out scope to our study.


9

Chapter 2

Review and Related Literature of the Studies

This chapter presents the related literature and studies after the thorough

and in-depth search done by the researchers. This will also present the synthesis

of the theoretical and conceptual and conceptual framework to full understand the

research to be done and lastly the definition of the terms for better comprehension

of the study.

This study considered some articles taken form the internet as a reliable sources.

Related Literature

Natallia Kokash (2002) study the most important thing among the

variety of topics that relate to computation is heuristic algorithm validation,

complexity estimation and optimization. Complexity of tasks in general is

examined studying the most relevant computational resources like execution time

and space. A vast collection of papers were dedicated to algorithm development. In

connection with the quality issue, the goal of the heuristic algorithm is to find as

good solution as possible for all instances of the problem.

There are general heuristic strategies that are successfully applied to

manifold problems.

Optimization: Most desirable or satisfactory.


10

Execution time: Is the time during which a program is running.

Moreover, exact algorithms might need centuries to manage with formidable

challenges. In such cases heuristic algorithms that find approximate solutions but

have acceptable time and space complexity play indispensable role. In her study

which is all about the heuristics, their areas of application and the basic underlying

ideas are surveyed.

Grossman and Ophir (2004) study on how a query is ranked against a

document collection using either a single or a combination of retrieval strategies,

and how an assortment of utilities are integrated into the query processing scheme

to improve these rankings. Methods for building and compressing text indexes,

querying and retrieving documents in multiple languages, and using parallel or

distributed processing to expedite the search are likewise described.

Retrieval Strategies: Assign a measure of similarity between a

query and a document. These strategies are based on the common

notion that the more often terms are found in both the document and

the query, the more relevant the document is deemed to be to the

query. Some of these strategies employ counter measures to

alleviate problems that occur due to the ambiguities inherent in

language, the reality that the same concept can often be described

with many different terms.

Retrieval Utilities: Many different utilities improve the results of a

retrieval strategy. Most utilities add or remove terms from the initial
11

query in an attempt to refine the query. Others simply refine the

focus of the query by using subdocuments or passages instead of

whole documents. The key is that each of these utilities (although

rarely presented as such) are plug-and-play utilities that operate with

any arbitrary retrieval strategy.

Efficiency: Used to improve the effectiveness of query processing

in terms of precision and recall.

In this study, they surveyed recent research efforts that focused on the electronic

searching and retrieving of documents. Their focus is strictly on the retrieval of

information in response to users queries. That is, they discussed algorithms and

approaches for information retrieval.

2.1 Heuristic Algorithm

As defined by Kenny (2014), a Heuristic is a function that ranks alternatives

in search algorithms at each branching step based on available information to

decide on which branch to follow. The term heuristic is used for algorithms which

find solutions among all possible ones, but they do not guarantee that the best will

be found, therefore they may be considered as approximately and accurate

algorithms. Occasionally these algorithms can be accurate, that is they actually find

the best solution, but the algorithm is still called heuristic until this best solution is

proven to be the best.


12

According to Nathal (2014), a heuristic algorithm is one that is designed to

solve a problem in a faster and more efficient fashion than traditional methods by

sacrificing optimality, accuracy, precision, or completeness for speed. Heuristic

algorithms often times used to solve NP-complete problems, a class of decision

problems. In these problems, there is no known efficient way to find a solution

quickly and accurately although solutions can be verified when given.

Heuristics can produce a solution individually or be used to provide a good

baseline and are supplemented with optimization algorithms. Heuristic algorithms

are most often employed when approximate solutions are sufficient and exact

solutions are necessarily computationally expensive.


13

CHAPTER 3

METHODOLOGY

The chapter presents the e-book file retrieval method of the algorithm flow

chart design, and research instruments.

Algorithm Design

With three (3) algorithm for e-book file retrieval design and conceptualize

by the researchers and conduct a test which would be applied to the system to

measure the algorithms attributes such as accuracy, speed or algorithmic efficiency

and optimality of the algorithm.

Accuracy: Accuracy is the proximity of measurement results to the true value.

Speed or algorithmic efficiency: Are the properties of an algorithm which relate

to the amount of computational resources used by the algorithm. An algorithm must

be analyzed to determine its resource usage. Algorithmic efficiency can be thought

of as analogous to engineering productivity for a repeating or continuous process.

Optimality of the Algorithm: Results of the algorithm.

An algorithm is a self-contained sequence of actions to be performed. Algorithms

perform calculation, data processing, and/or automated reasoning tasks. It is

an effective method that can be expressed within a finite amount of space and
14

time and in a well-defined formal language for calculating a function. Starting from

an initial state and initial input (perhaps empty), the instructions describe

a computation that, when executed, proceeds through a finite number of well-

defined successive states, eventually producing "output" and terminating at a final

ending state. The transition from one state to the next is not

necessarily deterministic; some algorithms, known as randomized algorithms,

incorporate random input.

The concept of algorithm has existed for centuries; however, a partial

formalization of what would become the modern algorithm began with attempts to

solve the Entscheidungs problem (the "decision problem") posed by David

Hilbert in 1928. Subsequent formalizations were framed as attempts to define

"effective calculability" or "effective method"; those formalizations included

the GdelHerbrandKleene recursive functions of 1930, 1934 and 1935, Alonzo

Church's lambda calculus of 1936, Emil Post's "Formulation 1" of 1936, and Alan

Turing's Turing machines of 19367 and 1939.

Another concept of algorithm is backtracking. It is a general algorithm for

finding all (or some) solutions to some computational problems, notably constraint

satisfaction problems, that incrementally builds candidates to the solutions, and

abandons each partial candidate c ("backtracks") as soon as it determines

that c cannot possibly be completed to a valid solution


15

Algorithm 1. The method is it will get all the drives in the computer or all

available drives that can be search by the algorithm that will get the top level

directories of the specific drive. It will search throughout the subfolders of the

uppermost directory. And it will continue throughout the last folder of the

uppermost level of the drive the same as next drive.

While the algorithm is searching throughout subfolders of the drives, it will

search also the pdf files. And if the algorithm find the pdf file, it will automatically

convert to string and it will determine the users keyword in that string if it has

occurrence.

.Algorithm Flow chart

Figure 1 Flowchart of algorithm 1.


16

The figure 1 shows that the study about algorithm 1 consist of a folders with

its level of sub-folders that signifies the algorithm flow. In the Parent level of the

folder (D//:), the algorithm 1 will go in to the first folder (F1) and allocate the

selected file to determine if it has an available pdf inside of the parent folder. And

if the algorithm found out that there is an available pdf inside the parent folder, the

sub folder will automatically arise which become the level 1 folders which is the

F2 and F7. This folders represented as the child of the parent folder F1.

Meanwhile, in the level 1 folder (F2), the algorithm 1 will also go in, to

search if there is an available pdf inside and if the algorithm noticed that there is an

available pdf in the level 1 folder F2, it will automatically proceed to next level

(level 2) folder. The process of getting the pdf file in each folders (F3) and (F4) are

still the same until the last level of the folder (level 4). And when the algorithm

reach the last level of the folder(level 4 F4) and find out that there is no more

available pdf inside, the algorithm will automatically go in to its sub folder (F5), to

know if inside that folder has an available pdf.

And when the algorithm reach at the level 4 (F5) folder and found out that

there is no more pdf available inside of the folder, the algorithm will go back to the

third level (F3). Since this folder was already evaluated by the algorithm if it has a

pdf inside, the algorithm itself will go to the next sub folder (F6). The process of

determining if it has an available algorithm in level 3 F6, is similar to

(F5),(F2)and(F7) until the algorithm return to the level 1 folder (F1).


17

Algorithm 2

The method of this study is to collect all the selected directories inside the

drives and search all the pdf file in a particular directories. Until the algorithm will

completely done the process on getting the pdf files. A directory is a file

system cataloging structure which contains references to other computer files, and

possibly other directories. On many computers, directories are known as folders,

or drawers to provide some relevancy to a workbench or the traditional office file

cabinet.

Algorithm Flow Chart

Algorithm 2

Figure 2 Flowchart of Algorithm 2.


18

The Figure 2 shows the description flow of the algorithm 2. First, the

algorithm 2 will go in to the parent folder (D://) to distinguish whats inside of the

drive. After the parent folder, the algorithm 2 will proceed to the next stage as to

collect all the directories in a drive, and search all pdf file in a collected directories.

Algorithm 3

The method of this study is to collect all the directories inside the drives and

search all the pdf file in a collected directories. In algorithm 3, there are two kinds

of the process flow chart. A flowchart is a graphical representation of a computer

program in relation to its sequence of functions.

Algorithm Flowchart

Figure 3 Flowchart of Algorithm 3 First Run


19

The figure 3 first run shows that the study about algorithm 3 consist of a

folders with its level of sub-folders that signifies the algorithm flow. In the Parent

level of the folder (D//:), this algorithm has the same process in algorithm 1 that the

algorithm itself will go in to the folder (F1) to distinguish if it has an available pdf

inside but unlike algorithm 1, in the algorithm 3, since the algorithm found out that

there is an available pdf inside the folder (F1), the algorithm will not proceed to the

next level of the folder instead the pdf collected by the algorithm will stored in its

level folder which is the text file. This kind of process will be repeated until the

algorithm reach the last level of the folder (level 4 F4).

Algorithm Flowchart

Algorithm 3 Second Run

Figure 5 Flowchart of Algorithm 3 Second Run


20

The figure 5 shows first that the algorithm will enter the drives through

collecting some pdf files inside of it. And the pdf file would be stored in a notepad

or a text file. When the algorithm was done on storing the files notepad, the

algorithm will search all the collected file in a selected directories.

Research Instrument

Internet:
It is the global system of interconnected computer networks that use

the Internet protocol suite (TCP/IP) to link devices worldwide. It is a network of

networks that consists of private, public, academic, business, and government

networks of local to global scope, linked by a broad array of electronic, wireless,

and optical networking technologies.

The Internet carries an extensive range of information resources and

services, such as the inter-linked hypertext documents and applications of

the World Wide Web (WWW), electronic mail, telephony, and peer-to-

peer networks for file sharing.

The origins of the Internet date back to research commissioned by

the United States federal government in the 1960s to build robust, fault-tolerant

communication via computer networks. The primary precursor network,

the ARPANET, initially served as a backbone for interconnection of regional

academic and military networks in the 1980s. The funding of the National Science
21

Foundation Network as a new backbone in the 1980s, as well as private funding for

other commercial extensions, led to worldwide participation in the development of

new networking technologies, and the merger of many networks.

The linking of commercial networks and enterprises by the early 1990s

marks the beginning of the transition to the modern Internet, and generated a

sustained exponential growth as generations of institutional, personal,

and mobile computers were connected to the network. Although the Internet was

widely used by academia since the 1980s, the commercialization incorporated its

services and technologies into virtually every aspect of modern life.

Internet use grew rapidly in the West from the mid-1990s and from the late

1990s in the developing world. In the 20 years since 1995, Internet use has grown

100-times, measured for the period of one year, to over one third of the world

population.

Most traditional communications media, including telephony, radio,

television, paper mail and newspapers are being reshaped or redefined by the

Internet, giving birth to new services such as email, Internet telephony, Internet

television music, digital newspapers, and video streaming websites. Newspaper,

book, and other print publishing are adapting to website technology, or are

reshaped into blogging, web feeds and online news aggregators.

The entertainment industry was initially the fastest growing segment on the

Internet. The Internet has enabled and accelerated new forms of personal
22

interactions through instant messaging, Internet forums, and social

networking. Online shopping has grown exponentially both for major retailers

and small businesses and entrepreneurs, as it enables firms to extend their "bricks

and mortar" presence to serve a larger market or even sell goods and services

entirely online. Business-to-business and financial services on the Internet

affect supply chains across entire industries. https://en.wikipedia.org/wiki/Internet

System Design

This method specifically described on how a system performs the

requirements outlined in the functional requirements. Depending on the system, this

can include instructions on testing specific requirements, configuration settings, or

review of functions or code.

System design is the process of defining the architecture, components,

modules, interfaces, and data for a system to satisfy specified requirements.

Systems design could be seen as the application of systems theory to product

development. There is some overlap with the disciplines of systems

analysis, systems architecture and systems engineering.

File Data Analysis

The table 1 shows the file data analysis that the word that been found in a

certain page of a selected pdf, has the average percentage of the word occurrence
23

in a certain page. The time execution has categorized in terms of second (s) and

millisecond (ms). The data was generated by a system from the pdf file that has

been found and process the data gathered inside the pdf file

Table 1 File Data Analysis.

Final Calculation

The table 2 shows the final calculation of the total average of the occurrence

word in the pdf detected by the algorithm, as well as the page count of the pdf.

Lastly, the total time execution of the process on searching the pdf files located on

the drives.

Table 2 Final Calculation.


24

Speed Chart

The Figure 6 shows the file speed retrieval with variables x and y axis, the

x-axis stand for time equivalent and for the y-axis is the total word found located

in a certain page of the pdf. And also, each algorithm has corresponding colors for

us to determine which algorithm is faster in a chart presented. This speed chart

Figure 6 Speed Chart of the three (3) algorithm.

represented the result of the three (3) algorithm.

Accuracy Chart

The Figure 7 shows the file data accuracy in terms of percentage to

determine whether the pdf file is more reliable kind of file that the users needed.

In addition, the chart shows each algorithm color code presented for the viewer to

understand and differentiate the data gathered by the algorithm results.


25

Figure 7 Accuracy Chart of the three (3) algorithm.

String Searching

In all algorithm, the string searching usage is very important to determine

the word occurrence in the pdf.

Procedure on determining the word found:

Open the pdf

Enter each pages to convert through text

Filter all unnecessary string such double spaces and new line
26

Chapter 4

Results and Discussion

This chapter presents the results and discussions of the data gathered.

Arranged according to the order of presentation in the statement of the problem,

raised in chapter 1.

Problem 1: Among the different kinds of heuristic algorithms, which is

performs best in searching and retrieving files.

1.1 Searching Files and Retrieving files

1.2 Results Best

1.1 File Searching and Retrieving Files

File searching sometimes called string matching algorithms, are an

important class of string algorithms that try to find a place where one or

several strings (also called patterns) are found within a larger string or text.

File retrieval is defined as the matching of some stated user query against a

set of free-text records. These records could be any type of mainly unstructured

text, such as newspaper articles, real estate records or paragraphs in a manual. User

queries can range from multi-sentence full descriptions of an information need to a

few words.
27

It is also a branch of information retrieval where the information is stored

primarily in the form of text.

1.2 Results Best of the Three Algorithm

In the first test of the algorithm, the one who performs best in terms of file

retrieval and file searching is the algorithm 3. Because, the path of searching the

pdf has an index that would locate the pdf file.

Problem 2: Among the heuristic algorithms that will be used, which algorithm

performs best in terms of speed and accuracy?

1.1 Results Best

In terms of speed the algorithm who performs best was the algorithm 3.

Because, the algorithm 3 has a direct path of stored directories that are going help

the algorithm to search easily the pdf file.

Meanwhile, in terms of accuracy, the algorithm who performed best was the

algorithm 3 because the bases of its accuracy are the words found in its folder.

Problem 3: Among the heuristic algorithms that will be used, which has the

most optimal performance.


28

1.1 Algorithms Optimality

In terms of optimality, the algorithm who has the most optimal performance

result was the algorithm 3 because the resources performed by algorithm 3 was just

enough to the pdf searched file.

Problem 4: Among the heuristic algorithms that will be used, which is the most

reliable.

As a whole, Algorithm 3 is the most reliable algorithm because it has its

own path that can be useful to transmit file by searching directly of the searched

pdf in a directories.
29

Вам также может понравиться