Вы находитесь на странице: 1из 45

Types of digital data

Learning objectives

• Introduction to various formats of digital data


• Data storage mechanism
• Data access methods
• Management of data
• Process of extracting desired information from data
• Challenges posed by various formats of data
Introduction
• Data growth has seen exponential acceleration since the advent of
the computer & internet.
• Digital data can be classified into 3 forms
▫ Unstructured
▫ Semi-structured
▫ Structured
• According to Merrill Lynch, 80-90% of the business data is either
unstructured/ semi-structured.
• Gartner also estimates that unstructured data constitutes 80% of the
whole enterprise data.
10% Unstructured data
10%

Semi-structured
80% data
Structured data
10%
10%

Unstructured data
Semi-structured data
Structured data
80%
Unstructured data
• Unstructured data – data which does not conform to a data model.
▫ No identifiable structure within this kind of data is available
▫ Data cannot be stored in rows and columns in a relational database
▫ It is not in a form which can be used easily by a computer program
▫ Advantage - no additional effort on its classification is necessary
▫ Limitation - no controlled navigation within unstructured content is
possible
• Storing data in an unstructured form without any defined data
schema is a common way of filing information
▫ About 80-90% data of an organization is in this format
▫ Example:- memos, chat rooms, power-point presentations, images, videos, letters,
white papers, body of an email etc.
• A common technology to search in unstructured text documents is
full-text search.
▫ Famous full-text search engine library is Apache Lucene2 . Other examples are
MySql3 and Postgres indixes
▫ Advantage of full-text search - it completely is decoupled from the data
▫ This makes it very flexible - it can be used on every kind of textual data
▫ Limitation - it cannot be used to search for pictures or videos
Fully Structured data
• Structured data – data follows a
predefined schema i.e., data conforms
to some specification
▫ can be used easily by a computer program
▫ Example:- Data stored in databases in rows and columns
• Well-defined schema of fully structured data enables efficient data
processing, improved storage and navigation of content
• Designing a database schema is an elaborate process. It has to be
defined before the content is created. It defines the type and structure
of data and its relations. Figure above illustrates an ER-diagram and
its concrete tables within a RDBMS
▫ Limitation - difficult to subsequently extend a previously defined database
schema that already contains content.
▫ Advantage - existing tools & web frameworks, support the development of
database focused applications.
▫ For instance, Hibernate5 and Oracle TopLink6 are Object/Relational (O/R)
Mapping frameworks, which map classes and objects to relational database
tables and rows.
Semi-structured data
• In some applications, data is collected in an ad-hoc manner before it is
known how it will be stored and managed
• Semi- structured data does not conform to a data model but has some
structure
▫ Not all the information collected will have identical structure
▫ The schema information is mixed in with the data values, since each data object can
have different attributes that are not known in advance. Hence, this type of data is
sometimes referred to as self-describing data.
▫ Metadata for this data is available but is not sufficient
▫ Example: - emails, XML, markup languages like HTML etc.
▫ Example:- XML - language for data representation and exchange on the web. In XML
data can be directly encoded and a Document Type Definition (DTD)/ XML Schema
(XMLS) defines the structure of the XML document
▫ Advantage - ability to accommodate variations in structure
Case Study: GoodLife HealthCare Group

• “GoodLife HealthCare Group is one of India’s leading healthcare groups. The


group began its operations in the year 2000 in a small town off the south-east
coast of India, with just one tiny hospital building with 25 beds. Today, the
group owns 20 healthcare centers across all the major cities of India. The
group has witnessed some major successes and attributes it to its focus on
assembly line operations and standardizations. The group believes in making a
“Dent in Global Healthcare”. A few of its major milestones are as listed below
in chronological order:
▫ Year 2000 – the birth of the GoodLife HealthCare Group. Functioning initially from a tiny
hospital building with 25 beds
▫ Year 2002 – built a low cost hospital with 200 beds in India
▫ Year 2004 – gained foothold in other cities of India
▫ Year 2005 – the total number of healthcare centers owned by the group touched the 20 mark
▫ The next 5 years saw the groups dominance in the form of it setting up a GoodLife HealthCare
Research Institute to conduct research in molecular biology and genetic disorders
▫ Year 2010 witnessed the group bad the award for the “Best HealthCare Organization of the
Decade”
• GoodLife HealtCare offers the following facilities:
▫ Emergency care 24 x 7
▫ Support groups
▫ Support and help through call centers
Contd…
• The healthcare group also specializes in orthopedic surgeries. The group has
always leveraged IT to offer the best possible and affordable services to their
patients. It has never hesitated in spending money on procuring the best
possible machines, medicines, & facilities to provide the finest comfort to the
patients. The doctors, surgeons, nurses and paramedical staff are always on
the lookout for ground-breaking research in the field of medicine, therapy and
treatment. Year 2005 saw the group establish its own Research Institute to do
pioneering work in the field of medicine.

Organizational Structure
• GoodLife HealthCare Group has a Board of Company’s Directors at its helm.
They are 4 in all – each being an exceptional leader from the healthcare
industry. The group live by the norm that a disciplined training is the key to
success. The senior doctors and paramedical staff are very hands-on. They
walk the talk at all times. Strategic decisions are taken by the company’s board
after cautious consultation, thorough planning and review. The strategic
decisions are then conveyed to all the stakeholders such as the shareholders,
vendor partners, employees, external consultants, etc. the group has acute
focus on the quality of service being offered.
Contd…
Quality Management
• The healthcare group spends a great deal of time and effort in ensuring
that its people are the best. They believe that the nursing and paramedical
staff constitute the backbone of the organization. They have a clearly laid
out process for recruitment, training and on-the-job monitoring. The
organization has its own curriculum and a grueling training program that
all the new hires have to religiously undertake. Senior doctors act as
mentors of the junior doctors. It is their responsibility to groom their
junior partners. Every employee is visibly aware of the organizations
philosophy and the way of life prevalent in the organization.

Marketing
• GoodLife HealthCare Group had long realized the power of marketing.
They have utilized practically all channels to best sell their services. They
regularly advertise in the newspapers and magazines and more so when
they introduce a new therapy or treatment. They have huge hoardings
speaking about their facilities. They advertise on television with campaigns
that ensure that viewer cannot help but sit through it. They have had the
top saleable sportsperson to endorse their products. Lastly the group
understands that “word of mouth” is very powerful and the best
advertisers are the patient themselves who have been treated at one of the
group’s facilities.
Contd…
Alliance Management
• Over the years GoodLife HealthCare Group has established a good
alliance with specialist doctors, consultants, surgeons and
physiotherapists etc. the group has also built a strong network of
responsible and highly dependable supplier partners. There is a very
transparent system of communication to all its supplier partners.
There is a very transparent system of communication to all its
supplier partners. All its vendor partners are aware of the values
and organizations philosophy that defines the healthcare group. The
group believes in a win-win strategy for all. It is with the help and
support from these vendor partners that the group is able to stock
just the required amount of inventory and is able to procure the
emergency inventory supplies at a very short notice. The group
focuses only on its core business processes and out-sources the
remaining processes to remain in control and steer ahead of
competition.
Contd…
Future Outlook
• GoodLife HealthCare Group is looking at expansion in other countries of
the world too. They are also looking at growing in the existing markets.
They are 27,000 employees today and are looking at growing to 60,000 in
next 5 years. The group already has a dedicated wing for the treatment o
bone deformities. It aspires to set up a chemist store within its premises to
make it convenient for their patients. They would like to set up an artificial
limb center for the production of artificial limbs and rehabilitation of
patients of orthopedic surgeries. The GoodLife HealthCare Group realizes it
social obligation too and is looking forward to setting up a free hospital
with 250 beds in a couple of years time.

Information Technology
• Web presence
▫ GoodLife HealthCare Group has excellent web presence: leveraging website, social
networking and mobile devices banner ads.
▫ Leverages internet technology for surveys, targeted mailers.
▫ Self-help portal for online registration for treatment of ailments.
• Front office management
▫ Patient relationship management
▫ Alliance management
▫ Registration and discharge of patients
▫ Billing
▫ Help desk
Contd…
Human Capital Management & Training Management
• Employee satisfaction surveys
• Employee retention program management
• Employee training and development program management

Data Centered Hosted Application


• Finance & accounting, corporate performance
• Inventory management
• Suppliers and purchase
• Marketing and campaign management
• Funnel and channel analysis application

Personal Productivity
• Email, web access, PDA connect
• Suggestions
• Quick surveys
• Feedback from communities of patients & also the communities of
specialist doctors , consultants and physiotherapists
Research questions
• Where is each type of its data present?
• How is it stored?
• How is the desired information extracted from it?
• How important is the information provided by it?
• How can this information augment public health and
healthcare services?
Terminology
• Entity – thing or object in real world that is distinguishable from all
other objects
▫ Example:- each person in an enterprise is an entity.
• Entity set – set of entities of the same type that share the same
properties/ attributes
▫ Example: - customers of a given bank can be defined as entity set
customer
• Attribute – an entity is represented by a set of attributes. Attributes
are descriptive properties possessed by each member of an entity set
▫ Possible attributes of the customer entity set are customer name, loan
amount etc.
• Database – includes a collection of entity sets each of which
contains any number of entities of the same type.
Getting into “GOODLIFE” database
• GoodLife witnesses enormous amounts of data being exchanged in
the following forms:
▫ Doctors/ nurses notes in an electronic report
▫ Emails sharing information about consultations/ investigations
▫ Narrative portions of electronic medical records
▫ Investigative reports
▫ Chat rooms

• GoodLife maintains a database which stores data only in a structured


format.
• However, the organization also has unstructured and semi-structured
data in abundance.
1. Structured data
• The patient index card is in a structured form. All the fields in the
patient index card are also structured fields
• GoodLife nurses make electronic records for every patient who visits
the hospital. These records are stored in a relational database
• Example: - Nurse Nandu records the body temperature and blood
pressure of a patient Prem, and enters them in the hospital database.
Dr. Dev, who is treating Prem searches the database to know his body
temperature. Dr. Dev is able to locate the desired information easily
because the hospital data is structured and is stored in a relational
database.

GoodLife Healthcare Patient Index Card


Patient ID <> Date <>
Nurse Name <>
Patient Name <> Patient Age <>
Body Temperature <> Blood Pressure <>
Contd…
• Structured data is organized in semantic chunks (entities) with
similar entities grouped together to form relations or classes
• Entities in the same group have the same descriptions i.e.,
attributes
• Descriptions for all entities in a group (schema)
▫ Have the same defined format
Conforms to
▫ Have a predefined length a data model

▫ And follow the same order Similar


Data is
stored in the
entities are
form of rows
grouped
& columns

Structured
data

Characteristics of Attributes in
Data resided

Structured data the group


are the same
in fixed
fields within
a record/ file

Definition,
format and
meaning of
data is
explicitly
known
Contd…
• Data coming from databases such as Access, OLTP systems, SQL,
Excel etc. are in structured format
• Working with structured data is easy when it comes to storage,
scalability, security and update and delete operations
▫ Storage – both standard & user-defined data types can be used
▫ Scalability – not generally an issue with increase in data
▫ Security – ensuring security is easy
▫ Update and delete operations – easy due to structured format
• Hazel-free retrieval
▫ Retrieving information – well defined structure helps in easy retrieval of
data
▫ Indexing & searching – enables streamlined search
▫ Mining data – can be easily mined and knowledge can be extracted from it
▫ BI operations – works extremely well with structured data
The problem of IR

• Goal: to find documents relevant to an information need from a


large document set

• Option in searching for basic queries:


▫ Sequential/on-line text searching or string matching: Finding the
occurrences of a pattern in a text when the text is not preprocessed
 when text is small (in MB)
 when index overhead can’t be afforded
 Slow and difficult to improve
▫ Indexed searching: Build data structure over the text (indices) to
speedup the search
 when text is large or huge
 the text is semi-static (not often updated)
 Fast & flexible to further improvement

24
25

Example Info.
need

Query
IR
Document Retrieval system Answer list
collection

Googl
e
Web
2. Unstructured data
Dr.Sami, Dr.Raj & Dr.Rahul work at the medical facility of GoodLife. Over the past
few days, Dr.Sami & Dr.Raj had been exchanging long emails about a particular case of
gastro-intenstinal problem. Dr.Raj upon a particular combination of drugs has successfully
cured the disorders in his patients. He has written an email about this combination of drugs
to Dr.Sami.
Dr.Rahul has a patient with quite a similar case of gastro-intestinal disorder whose
cure Dr.Raj has chanced upon. Dr.Rahul already tried regular drugs but with no positive
results so far. He quickly searches the organizations database for process, but with no luck.
The information is tucked away in the email conversation between Dr.Sami & Dr.Raj.
Dr.Rahul would have accessed the process had the storage & analysis of unstructured data
been undertaken by GoodLife. Does not
Dr.Raj’s email to Dr.Sami has not been successfully conform
to any data
model
updated into the medical system database as it fell Cannot be
stored in
Has no
the form
in the unstructured format. easily
identifiabl
of rows &
columns
e structure
in a
database

Unstructu
red data
Characteristics of
Unstructured data Does not
follow any
Not in any
particular
rules/ format/
semantics sequence

Not easily
usable by
a program
Contd…
• Unstructured data cannot be stored in the form of rows & columns and
hence it is difficult to determine the meaning of the data
• It does not follow any rules/ semantics. It can be of any type & hence is
unpredictable
• Unstructured data can be classified into 2 broad categories:
▫ Bitmap objects – image, video or audio files etc.
▫ Textual objects – word documents, emails, excel spreadsheet etc.
• Web pages are said to be unstructured data even though they are
defined by HTML, which has a rich structure
▫ HTML is solely used for rendering & presentations
▫ Web pages usually carry links & references to external unstructured content
such as images, XML files etc.
How to manage unstructured data
• Few generic tasks to be performed to enable storage & search of
unstructured data are:
▫ Indexing – on the basis of some value in the data, index is defined
 Index is an identifier and it represents the large record in the data set
 In the absence of index, whole data set will be scanned for retrieving the desired
data
▫ Tags/ Metadata – using metadata, data in a document, etc. can be tagged.
 It enables search & retrieval
▫ Classification/ Taxonomy – taxonomy is classifying data on the basis of the
relationships that exist between data
 Data can be arranged in groups & placed in hierarchies based on the taxonomy
prevalent in an organization
▫ CAS (Content Addressable Storage) – stores data based on their metadata.
 It assigns a unique name to every object stored in it
 The object is retrieved based on its content & not its location
 Used extensively to store emails etc.
How to store unstructured data
• Challenges faced while storing unstructured data are:
▫ Storage space – lot of space is required to store unstructured data. Difficult
to store images, videos, audios etc.
▫ Scalability – as the data grows, scalability becomes an issue & the cost of
storing such data increases
▫ Retrieve information – difficult to retrieve & recover data
▫ Security – difficult due to varied sources of data, e.g., emails, web pages etc.
▫ Update & delete – very difficult as retrieval is difficult due to no clear
structure
▫ Indexing & searching – indexing unstructured data is difficult & error-
prone as the structure is not clear & attributes are not pre-defined.
 As a result, search results are not very accurate
 Indexing becomes more difficult as the volume of the data grows
Storage space

Scalability

Retrieve information

Challenges for storing Security


Unstructured data Update & delete

Indexing & searching


Possible solutions to storage challenges of unstructured data

• Changing format – unstructured data may be converted to formats


which are easily managed, stored and searched
▫ Example: - IBM is working on providing a solution which will convert
audio, video etc. to text
• Developing new hardware – new h/w needs to be developed to support
unstructured data. It may either complement existing storage devices
or may be a stand-alone for unstructured data
• Storing in RDBMS/ BLOBs – unstructured data may be stored in
relational databases which support BLOBs (Binary Large Objects).
▫ As video/ image files cannot be stored neatly into a relational column, its
metadata such as date & time of creation, owner/ author of the data etc. can
be stored which does not pose any problem for storage
• Storing in XML format – unstructured data may be stored in XML
format which tries to give some structure to it by using tags & elements
• CAS (Content Addressable Storage) – It organizes files based on their
metadata & assigns a unique name to every object stored in it.
▫ The object is retrieved based on its content & not its location
▫ Used extensively to store emails etc.
How to extract information from stored unstructured data

• Challenges faced while extracting unstructured data are:


▫ Interpretation – unstructured data is not easily interpreted by conventional
search algorithms
▫ Classification/ Taxonomy – different naming conventions followed across
the organization make it difficult to classify data
▫ Indexing – designing algorithms to understand the meaning of the
documents & then tagging or indexing them accordingly is difficult
▫ Deriving meaning – computer programs cannot automatically derive
meaning/ structure from unstructured data
▫ File formats – increasing number of file formats makes it difficult to
interpret data
▫ Tags – as the data grows, it is not possible to put tags manually

Interpretation

Tags

Indexing

Challenges for extracting Deriving meaning


Unstructured data File formats

Classification/ Taxonomy
Possible solutions to challenges faced in extracting information
from unstructured data

• Tags – unstructured data can be stored in a virtual repository & be


automatically tagged.
▫ Documentum provides this type of solution
• Text mining – text mining tools help in grouping as well as classifying
unstructured data & assist in analyzing by considering grammar,
context, synonyms etc.
• Application platforms – application platforms like XOLAP help extract
information from email & XML-based documents
• Classification/ Taxonomy – taxonomies within the organization can be
managed automatically to organize data in hierarchical structures
• Naming conventions/ standards – following naming standards or
conventions across an organization can greatly improve storage,
retrieval, search & index
UIMA: possible solution for Unstructured data
• UIMA (Unstructured Information Management Architecture)
▫ open source platform from IBM integrates different kinds of analysis
engines
▫ provides a complete solution for knowledge discovery from unstructured
data
▫ The analysis engines enable integration & analysis of unstructured
information & bridge the gap between structured & unstructured data
▫ Stores information in a structured format

Analysis
Unstructured data Acquired Subjected to
such as chat, from various semantic
sources analysis
• images, email etc.

Delivery
Structured
Query & Structured
information
presentation information
access

Users
3. Semi-structured data
Dr.Vishnu of “GoodLife HealthCare” organization usually gets a
blood test done for migraine patients visiting her. It is her observation that
patients with migraine have high platelet count. She makes a note of this in
the diagnosis & conclusion section in the blood test report of patients. One
day, another doctor, Dr.Mamatha searches the database when he is unable
to find the cause of migraine in one his patients, but with no luck! The
answer he is looking for is nestled in the vast hoards of data.
Dr.Vishnu’s blood test reports on patients were not successfully
updated into the medical system database as they were in the semi-
structured format. GoodLife HealthCare
Blood Test Report
Date <>
Department <> Attending Doctor <>
Patient Name <> Patient Age <>
Blood test report – Example Hemoglobin Content <>
for semi-structured data RBC Count <>
WBC Count <>
Platelet Count <>
Diagnosis <notes>
Conclusion <notes>
Contd…
• It is important to understand, manage, and analyze semi-structured
data coming from heterogeneous sources
• Semi-structured data does not conform to any data model
• Data cannot be stored in rows & columns as in a database
• Semi-structured data, however, has tags & markers which help group
the data & describe how the data is stored, giving some metadata, but
they are not sufficient for management & automation of data
• In semi-structured data, similar entities are grouped conform
Does not
to
a data
model but
& organized in a hierarchy tags &
elements
(metadata) Data
• The attributes/ properties within a group entities
Similar
are
cannot be
stored in
the form of
may/ may not be the same grouped
rows &
columns

• Example:- 2 addresses may not contain same Structured


data

number of attributes Attributes The tags &


Address 1 in the
group may
elements
describe
<house number><street name><area name><city> not be the
same
data that is
stored
• Address 2
No
• <house number><street name><city> sufficient
metadata
Contd…

<HTML>
<HEAD>
• The blood test report prepared ..
by Dr.Vishnu is semi-structured .
<TABLE>
• It has structured fields like Date, <TR>
Department, Patient Name etc., <TH><I> header 1 </I></TH>
and unstructured fields like <TH><I> header 2 </I></TH>
Diagnosis, Conclusion etc. <TH><I> header 3 </I></TH>
</TR>
<TR>
• Another example of semi- <TD> text 1 </TD>
structured data are web pages <TD><A HREF=http://www.stuff/> text
▫ These pages have content 2 </A></TD>
embedded within HTML & often <TD> text 3 </TD>
have some degree of metadata </TR>
..
within tags .
▫ This implies certain details of the </TABLE>
..
data being presented. .
</BODY>
</HTML>
Contd…
• Sources of semi-structured data
▫ Email, XML, TCP/ IP packets, Zipped files, Binary executables, Mark-up
languages, Integration of data from heterogeneous sources
• Characteristics of semi-structured data
▫ It is organized into semantic entities
▫ Similar entities are grouped together
▫ Entities in the same group may not have same attributes
▫ Order of attributes is not necessarily important
▫ Not always all attributes are required
▫ Size of the same attributes in a group may differ
▫ Type of the same attributes in a group may differ
• Integration of data from heterogeneous sources (e.g., RDBMS, OODBMS,
Structured file, Legacy system) leads to the data being semi-structured

• The problems arising of semi-structured data are evident in


Dr.Mamata’s failure to deliver good healthcare to her patient. The
reason behind this failure is the inadequate semi-structured data
management been undertaken by GoodLife HealthCare
How to manage semi-structured data

• Few generic ways to manage &


store semi-structured data are:

▫ Schemas – can be used to


describe the structure of the data
 Schemas define the constraints on the
structure, content of the document etc.
 Limitation – as requirements are ever
changing in business environment,
changes required in data will also lead
to changes in schema
▫ Graph-based data models – used
to describe data.
 The relationships & hierarchies are
represented in the form of a tree-like
structure where the vertices contain
the entity
& the leaves contain the data
Contd…

• XML – used to store &


exchange semi-
structured data

▫ Allows user to define


tags to store data
in hierarchical or nested
forms
How to store semi-structured data
• Challenges faced while storing unstructured data are:
▫ Data usually has irregular & partial structure. Data from few sources may
have partial structure while some may have none at all
▫ Structure of data from some sources is implicit which makes it very difficult
to interpret relationships between data
▫ Schema & data are tightly coupled. Some queries may update both data &
schema and data with the schema being updated very frequently

Storage cost

RDBMS
Challenges for storing
semi-structured data Irregular & partial structure

Implicit structure

Evolving schemas

Distinction between schema &


data
Possible solutions for storing unstructured data

• Allows to define tags & attributes to store data


XML • Data can be stored in hierarchical/ nested structure

• Data can be stored in a relational database by mapping the


RDBMS
data to a relational schema which is then mapped to the table

• Databases which are specially designed to store semi-


Special purpose structured data
DBMS

• Data can be stored & exchanged in the form of graph


Object Exchange
Model
(OEM)
Contd…
• Modeling semi-structured data (the OEM way)
▫ Object Exchange Model (OEM) is a model for storing & exchanging
semi-structured data
▫ Structures data in the form of graphs. Objects are the entities, labels are
the attributes & leaf contains the data
▫ It models the hierarchies, nested structures etc.
▫ Indexing & searching a graph-based data model is easier & quicker as it
is easy to traverse to the data
Challenges in extracting information from semi-structured data

• Flat files – semi-structured data is usually stored in flat files which


are difficult to index & search
• Heterogeneous data – data comes from varied sources which is
difficult to tag & search
• Incomplete-Irregular structure – extracting structure when there is
none & interpreting the relations existing in the structure which is
present is difficult task

Flat files

Heterogeneous
Challenges for extracting
data
semi-structured data
Incomplete/
irregular structure
Possible solutions to challenges faced in extracting information
from semi-structured data

• Indexing – indexing data in a graph-based model enables quick search


• OEM – this data modeling technique allows data to be stored in a
graph-based data model which is easier to index & search
• XML – allows data to be arranged in a hierarchical or tree-like
structure which enables indexing & searching
• Mining tools – various mining tools are available which search data
based on graphs, schemas, structures etc.
Assignment - I
1. Compare & contrast structured, semi-structured &
unstructured data.
2. A newly opened restaurant wants to collect feedback from its
customers on the ambience of the restaurant, the quality &
quantity of food served, the hospitality of the restaurant staff
etc. Design an appropriate feedback form for the restaurant
& comment on the type of data that will be collected therein.
3. Under which category of data does census survey form fall?
Explain in detail.

Вам также может понравиться