Вы находитесь на странице: 1из 41

Statistics Netherlands 1

How good are our metadata?


Jean-Pierre Kent
Statistics Netherlands
Statistics Netherlands Statistics Netherlands
Misconceptions about Metadata
Bo Sundgren (Statistics Sweden):
Metadata collection is dull, expensive and time-
consuming.
Patty Adelaar (Dutch Social Planning
Office):

Metainfo is like cod liver oil: it is good for statisticians,
but you have to ram it down their throats!
Statistics Netherlands Statistics Netherlands
A vision for the future?
Take care of the Meta,
and the Meta will take care
of the Data
Statistics Netherlands Statistics Netherlands
Nature of this research
Self assessment
What are we doing?
What are we doing wrong?
What should we be doing?

Self evaluation
What can we learn from others?
What can we teach others?
Statistics Netherlands Statistics Netherlands
Without good metadata no good data
What can go wrong?
The metadata dont match the data
They are wrong
They are incomplete
They are noisy

Metadata are inconsistent

Metadata differ from statistic to statistic
Statistics Netherlands Statistics Netherlands
The metadata dont match the data
Assignment 1: design a statistic
Concept
Assignment 2: document it
Statistician 1
Statistician 2
Process

Data
Documen-
tation
Statistics Netherlands Statistics Netherlands
Three ambition levels for good meta
First level: reliable metadata

Second level: consistent metadata

Third level: standard metadata
Statistics Netherlands Statistics Netherlands
Ambition level 1: reliable metadata
The metadata need to be truthful

They must tell
the correct story
the whole story
the relevant story
Statistics Netherlands Statistics Netherlands
How to avoid divergence?
Before computer era:
All data on form with preprinted metadata
Questionnaire
Card file
Published tables
Punched cards were the first case of data
without metadata
Computers are the cause of the problem!
Statistics Netherlands Statistics Netherlands
Keeping data and meta synchronised
Concept
Statistician
Machine-
readable
metadata
Program 1
Process

Data
Documen-
tation
Program 2
Statistics Netherlands Statistics Netherlands
Without good metadata no good data
What can go wrong?

The metadata dont match the data
Implementation metadata are different
from design metadata
Statistics Netherlands Statistics Netherlands
Three ambition levels for good meta
First level: reliable metadata

Second level: consistent metadata

Third level: standard metadata
Statistics Netherlands Statistics Netherlands
Design vs. implementation
Metadata are specified by design
Implementation takes place in various software
separately
Input software (Blaise, EDI)
Database software (Access, SQL Server)
Processing software (VB)
Publication software
Is this semantically neutral?
Statistics Netherlands Statistics Netherlands
How to avoid multiple implementation
All programs speak the same language
Has become possible recently, thanks to
component architecture

One program can translate its metadata for
other programs
Blaise, using its metadata script language
Cameleon
Statistics Netherlands Statistics Netherlands
Active metadata
Active metadata = Embedded metadata +
metadata-driven processes

A process is fully conditioned by definitions
contained in the metadata

Constraints: data are forced to obey the
constraints / rejected if this is not possible
Dependent variables: computations are carried
out
Processing rules are executed
Statistics Netherlands Statistics Netherlands
Migrating to a component architecture
Different programs, incompatible storage
formats
Software A
Format A
Software B
Format B
Statistics Netherlands Statistics Netherlands
Migrating to a component architecture
Software and storage are made independent
Interface A
Software A
Format A
Interface B
Software B
Format B
Statistics Netherlands Statistics Netherlands
Migrating to a component architecture
Both interfaces are combined into a
common interface
Software A
Format A
Software B
Format B
Interface A Interface B Interface A + B (or standard)
Statistics Netherlands Statistics Netherlands
Migrating to a component architecture
Working on the storage format no longer
affects the software
Formaat A
Standard Interface
Software A Software B
Formaat B
Standard format
Statistics Netherlands Statistics Netherlands
Towards a unique metadata system
Data & Metadata
Standard data/metadata interface
Design
software
Input
software
Throughput
software
Publication
software
Presentation
software
Statistics Netherlands Statistics Netherlands
Without good metadata no good data
What can go wrong?

Incomplete metadata
Diverging data and metadata
Implementation metadata differs from design
metadata
The metadata of different data sets are
different
tools
Statistics Netherlands Statistics Netherlands
Three ambition levels for good meta
First level: reliable metadata

Second level: consistent metadata

Third level: standard metadata
Statistics Netherlands Statistics Netherlands
Diverse definitions
Who is unemployed?

Someone with an unemployment allowance?

Someone registered at the employment office?

Someone who declares to be looking for work?
Statistics Netherlands Statistics Netherlands
Two possibilities
It is one concept:
Chose one definition, or merge them into a new
one

We have three distinct concepts:
Use all three definitions, but use different names
and make different variables
Statistics Netherlands Statistics Netherlands
Three guarantees for reliable metadata
Embedded metadata
Metadata-driven processes
Active metadata
Statistics Netherlands Statistics Netherlands
Embedded metadata (1)
Data and metadata are firmly tied together
A data set consists of both data and metadata

Tools are metadata-aware
They know how to access the metadata
They know how to understand the metadata
Statistics Netherlands Statistics Netherlands
Embedded metadata (2)
Tools access data only through the
metadata
The metadata tell them how to access the data

They enforce the relationship between data
and metadata
They apply all the rules defined in the metadata
Statistics Netherlands Statistics Netherlands
Embedded metadata
Data
Metadata
Data set 1
Data
Metadata
Data set 2
Survey design
STANDARDS
Process
META
DATA
META
DATA
META
DATA
Statistics Netherlands Statistics Netherlands
Rules vs. Tools
Rules
Imposed metadata models
Central metadata repositories
Extra work, statisticians dont understand why
Tools
Automatic support for standards
Automatic collection or computation of required
information
Statistics Netherlands Statistics Netherlands


NO RULES WITHOUT TOOLS
Rules vs. Tools
Statistics Netherlands Statistics Netherlands
Metadata-driven processes
Processes are implemented in a generic
way
You dont write a program to aggregate a
specific data set.
But
The IT department makes once a program
that can aggregate any data set
The statisticians provide metadata containing
all the information required in order to
aggregate the data
Statistics Netherlands Statistics Netherlands
From multiple databases
to a 4-database model
Statistics Netherlands Statistics Netherlands
Input DB Output DB
Intermediate
DB 1
Statistic 3
Statistic 2
Statistic 1
Intermediate
DB 2
Stove Pipe Model
Statistics Netherlands Statistics Netherlands
Disadvantages Stove Pipes
No Co-ordination
Definitions
Aggregation & Estimation techniques
Figures
No Integration
No Relations between figures
No Relations between databases
InConsistency
First step
Collect, Compare, Confront
4 DataBase Model
Vertical integration

Statistics Netherlands Statistics Netherlands
Input DB Output DB
Intermediate
DB 1
Statistic 3
Statistic 2
Statistic 1
Intermediate
DB 2
Micro
Base
Stat
Base
Stat
Line
Base
Line
Stovepipe Model 4-DataBase Model
Statistics Netherlands Statistics Netherlands
Input DB
Output DB
Intermediate
DB 1
Intermediate
DB 2
Micro
Base
Stat
Base
Stat
Line
Base
Line
Statistic 3
Statistic 2
Statistic 1
Input
Data
Concepts
Interm.
Data 1
Concepts
Interm.
Data 2
Concepts
Output
Data
Concepts
Independent Concepts?
Independent (Data) Models?
No!
4-DataBase Model
Horizontal
Integration
Statistics Netherlands Statistics Netherlands
Input Data
Concepts
Interm. 1
Concepts
Interm. 2
Concepts
Output Data
Concepts
Intersection
of
Concepts
Base
Line
Micro
Base
Stat
Base
Stat
Line






Object Type
Object
Population
Variable
Property
Value
Statistics Netherlands Statistics Netherlands
Input Data
Concepts:
Intersection
Concepts:
Register
Questionnaire
Question
Routing
Interviewer
Instruction
Object Type
Object
Population
Variable
Property
Value
Base
Line
Statistics Netherlands Statistics Netherlands
Interm. 1
Data
Concepts:
Outlier
Imputation
Imputation Rule
Micro Edit
Macro Edit
Weight

Intersection
Concepts:
Object Type
Object
Population
Variable
Property
Value

Micro
Base
Statistics Netherlands Statistics Netherlands
Interm. 2
Data
Concepts:
Aggregation
Estimation
Weight
Intersection
Concepts:
Object Types
Objects
Populations
Variables
Properties
Values

Stat
Base
Statistics Netherlands Statistics Netherlands
Output
Concepts
Cube
a, b, g, t Variable
Variable Grouping
Rows
Columns
Layout

Intersection
Concepts:
Object Type
Object
Population
Variable
Property
Value
Stat
Line

Вам также может понравиться