Вы находитесь на странице: 1из 19

Distributed Computing

Assignment 1& 2

Name : M .Ramya Reg No : 17mis1148

Course : Distributed Computing Course Code : SWE4003

Faculty : Prof .Maheshwari N Semester : WIN2020

Choose a tools relevant to distributed system in any components in domain like


networking/analytics concept. Tool should be open source, download and installing
tools with procedure of steps to be maintained.

OpenRefine :

OpenRefine or Google Refine is a powerful tool for working with messy data:
cleaning it; transforming it from one format into another; and extending it with
web services and external data.It always keeps your data private on your own
computer until us want to share or collaborate. Your private data never leaves your
computer unless you want it to.

Big Idea of OpenRefine:

What -A messy, unstructured, inconsistent dataset can be explored using open


refine. In general, it will be very difficult to explore data through redundancies and
inconsistencies.But, OpenRefine gives several functions through which one can
filter the data, edit the inconsistencies, and view the data. It’s a tool to clean the
data.

Why -Spreadsheets can also refine a dataset but they are not the best tool for it as
Openrefine cleans data in a more systematic controlled manner. While using
historical data, we come across issues like blank fields, duplicate records,
inconsistent formats and using Openrefine tool can help to resolve such issues.

When -Now data analysis play an important role in business. Data analysts
improve decision making, cut costs and identify new business opportunities.
Analysis of data is a process of inspecting, cleaning, transforming, and modelling
data with the goal of discovering useful information, suggesting conclusions, and
supporting decision making. So, to ensure the accuracy of our analysis, we have to
clean our data
What can OpenRefine do:

 Cleaning messy data


 Transformation of data: converting values to other formats, normalizing and
denormalizing.
 Parsing data from web sites: OpenRefine has a URL fetch feature and jsoup
HTML parser and DOM engine.
 Adding data to dataset by fetching it from webservices
 Data Normalization,Column Reorganization,Clustering,Tracking Operations
 Exporting Data

Why OpenRefine is a better tool?


OpenRefine Strengths and Weaknesses:

Strengths:

1. OpenRefine is a desktop application. It opens in the browser as a Local


Webserver. So, the data is safe and it doesn’t get uploaded to the Google
server.

2. It has facets which is used to filter the data into subsets and these clusters
can be customized and organised into meaningful data.

3. It has a Browser based interface, and so can handle more data efficiently.

4. Openrefine has a strong feature in extending data – user can use it to find
Meta Data and it can be used to correlate with it.

Weakness:

1. The UI of Openrefine is not user friendly. Although the features and


functions are strong, the UI make Openrefine looks boring. Besides, in the
visualization, the function is not scalable. For instance, Openrefine give user
a view of data, but the image is not big enough to figure out complex
distribution.

2. Unfortunately Google has removed support for this tool, making few of its
features redundant.

How to install and run Refine

OpenRefine is a desktop application in that you download it, install it, and run it on
your own computer.

Requirements

1. Java JRE/JDK installed


2. A Supported OS: Windows, Linux, macOS

Release Version

 OpenRefine requires you to have a working Java JRE, otherwise you will
not be able to start OpenRefine.
 Download OpenRefine here.
 Install it as detailed below for your operating system
o Windows
o macOS
o Linux
 As long as OpenRefine is running, you can point your browser at
http://127.0.0.1:3333/ to use it, and you can even use it in several browser
tabs and windows.

Windows

Install: Once you have downloaded the .zip file, uncompress it into a folder.

Run: Run the .exe file in that folder. You should see the Command window in
which OpenRefine runs. By default, the Command window has a black
background and text in monospace font in it.

Shut down: When you need to shut down OpenRefine, switch to that Command
window, and press Ctrl-C. Wait until there's a message that says the shutdown is
complete. That window might close automatically, or you can close it yourself. If
you get asked, "Terminate all batch processes? Y/N", just press Y.

Upgrading: If you upgrade to a new version of OpenRefine, you may need to


update your workspace. Remove your workspace.json file located at a data storage
location. This file will be regenerated when OpenRefine starts.

Windows Installation Procedure :

1. go to firefox or chrome in https://openrefine.org

2.Download for windows kit


3.Once you have downloaded the .zip file, uncompress it into a folder wherever
you want (such as in C:\Open-Refine).
4.Run the .exe file in that folder. You should see the Command window in which
OpenRefine runs. By default, the Command window has a black background and
text in monospace font in it

5.then in that specified web server the open refine will be opened .
Conclusion : Hence Openrefine can be easily been installed So by the following
procedure we can download and install the open refine successfully.

Implementation
Example 1: Fetching and Parsing HTML transforms an ebook into a structured
data set by parsing the HTML and using string array functions.

This example downloads a single web page and parses it into a structured
table using Refine’s built in functions. The raw data for this example is an HTML
copy of Shakespeare’s Sonnets from Project Gutenberg.

Start “Sonnets” Project

Start OpenRefine and select Create Project. method is Clipboard, which allows
entering dafinta via copy & paste. Under “Get Data From”, click Clipboard, and
paste this URL into the text box: https://programminghistorian.org/assets/fetch-
and-parse-data-with-openrefine/pg1105.html
After clicking Next, Refine should automatically identify the content as a line-
based text file and the default parsing options should be correct. Add the project
name “Sonnets” at the top right and click Create project. This will result in a
project with one column and one row.
Fetch HTML

Refine’s built-in function to retrieve a list of URLs is done by creating a new


column. Click on the menu arrow of Column 1 > Edit column > Add column by
fetching urls.

Name the new column “fetch”. The Throttle delay option sets a pause time
between requests to avoid being blocked by a server. The default is conservative.
After clicking “OK”, Refine will start requesting the URLs from the base column
In this case, there is one URL in Column 1 resulting in one cell in the fetch column
containing the full HTML source for the Sonnets web page.

Parse HTML

Much of web page is not sonnet text, must be removed to create clean data set.
First, it is necessary to identify pattern which isolate desired content. Items will
often be nested in unique container.To make examine HTML easier, click on URL
in Column 1 to open link in new tab, then right click on the page “View Page
Source”. each poem is contained inside a single <p> element. Thus, if all
paragraphs are selected, sonnets can be extracted from group.

On the fetch column, click on the menu arrow > edit column > Add column based
on this column. Give the new column the name “parse”, then click in the
Expression text box.

Data in Refine can be transformed using the General Refine Expression Language
(GREL). Expression box accepts GREL functions that will be applied to each cell
in the existing column to create values for the new one. Starting with value, add the
functions parseHtml() and select("p") in the Expression box using dot notation,
resulting in: value.parseHtml().select("p"). Do not click OK at this point, simply
look at the Preview to see the result of the expression.

Notice that the output on the right no longer starts with the HTML root elements
(<!DOCTYPE html etc.) seen on the left. Refine represents an array as a comma
separated list enclosed in square brackets, for example [ "one", "two", "three" ].It is
necessary to use toString() or join() to convert the array into a string variable. The
join() function concatenates an array with the specified separator. For example, the
expression [ "one", "two", "three" ].join(";") will result in the string
“one;two;three”. Thus, the final expression to create the parse column is:
value.parseHtml().select("p").slice(37,-2).join("|").Click OK to create the new
column using the expression.
Split Cells

The parse column now contains all the sonnets separated by “|”, but the project
still contains only one row. Individual rows for each sonnet can be created by
splitting the cell. Click the menu arrow on the parse column > Edit cells > Split
multi-valued cells. Enter the separator | that was used to join in the last step.

After this operation, the top of the project table should now read 154 rows. Below
the number is an option toggle “Show as: rows records”. Clicking on records will
group the rows based on the original table, in this case it will read 1. The 154 rows
make sense because the ebook contained 154 sonnets

Each cell in the parse column now contains one sonnet surround by a <p> tag.
Click on the parse column and select Edit cells > Transform. This will bring up a
dialog box similar to creating a new column. Transform will overwrite the cells of
the current column rather than creating a new one.

In the expression box, type value.parseHtml(). The preview will show a complete
HTML tree starting with the <html> element.Select the p tag, add an index
number, and use the function innerHtml() to extract the sonnet text:
value.parseHtml().select("p")[0].innerHtml() .Click OK to transform all 154 cells
in the column.
Unescape

each cell has dozens of &nbsp; HTML do “no-break space” since browsers cut
extra white space in source. Entities are common when harvesting web pages and
can be quickly replaced with unescape() function. On parse column, select Edit
cells > Transform type following in expression box: value.unescape('html')

Extract Information with Array Functions

GREL array functions provide a powerful way to manipulate text data can be used
to finish processing sonnets. Any string value can be turned into array using split()
function by provide character that separates items.In the sonnets each line ends
with <br />, providing a convenient separator for splitting. The expression
value.split("<br />") will create array of lines of each sonnet. Index numbers ,slices
can then be used to populate new columns. a single line can be extracted and
trimmed to create clean columns representing the sonnet number and first line.
Create two new columns from the parse column using these names and
expressions:

 “number”, value.split("<br />")[0].trim()


 “first”, value.split("<br />")[1].trim()
The next column to create is the full sonnet text which contains multiple lines.
However, trim() will only clean beginning,end of cell, leaving whitespace.From
parse column, create new column named “text”,click in Expression box. A
forEach() statement asks for an array,variable name,expression applied to variable.
Following form forEach(array, variable, expression), construct the loop using these
parameters:

 array: value.split("<br />"), creates an array from the lines of the sonnet in each
cell.
 variable: line, each item in the array is then represented as the variable.
 expression: line.trim(), each item is then evaluated separately with the specified
expression.Thus,final expression to extract ,clean full sonnet text
is:forEach(value.split("<br />"), line, line.trim()).slice(1).join("\n")
Cleanup and Export

we used a number of operations to create new columns with clean data. At this
point the unnecessary columns can be removed. Click on the All column > Edit
columns > Re-order / remove columns.
Drag unwanted column names to right of dialog box, in this case Column 1, fetch,
parse. Drag other columns to desired order on left.Click Ok to remove/ reorder
data set.
Conclusion : So finally we have clean up data with necessary information from
sonnets poem and information is detailed given with subdivision from what text,
numbers etc. instead of checking for huge book . Processing a book of poems into
structured data enables new ways of reading text, allowing us to sort, manipulate,
and connect with other information has been done .

Вам также может понравиться