0 оценок0% нашли этот документ полезным (0 голосов)
25 просмотров41 страница
The document discusses issues with metadata in statistics organizations. It notes that metadata collection is often seen as dull but important for ensuring good data quality. The document outlines three levels of metadata ambition: reliable, consistent, and standardized metadata. It emphasizes that without good metadata linking data and documentation, statistical errors can occur. The key is developing tools and processes that keep metadata synchronized with data as it passes between different stages and systems.
The document discusses issues with metadata in statistics organizations. It notes that metadata collection is often seen as dull but important for ensuring good data quality. The document outlines three levels of metadata ambition: reliable, consistent, and standardized metadata. It emphasizes that without good metadata linking data and documentation, statistical errors can occur. The key is developing tools and processes that keep metadata synchronized with data as it passes between different stages and systems.
The document discusses issues with metadata in statistics organizations. It notes that metadata collection is often seen as dull but important for ensuring good data quality. The document outlines three levels of metadata ambition: reliable, consistent, and standardized metadata. It emphasizes that without good metadata linking data and documentation, statistical errors can occur. The key is developing tools and processes that keep metadata synchronized with data as it passes between different stages and systems.
Jean-Pierre Kent Statistics Netherlands Statistics Netherlands Statistics Netherlands Misconceptions about Metadata Bo Sundgren (Statistics Sweden): Metadata collection is dull, expensive and time- consuming. Patty Adelaar (Dutch Social Planning Office):
Metainfo is like cod liver oil: it is good for statisticians, but you have to ram it down their throats! Statistics Netherlands Statistics Netherlands A vision for the future? Take care of the Meta, and the Meta will take care of the Data Statistics Netherlands Statistics Netherlands Nature of this research Self assessment What are we doing? What are we doing wrong? What should we be doing?
Self evaluation What can we learn from others? What can we teach others? Statistics Netherlands Statistics Netherlands Without good metadata no good data What can go wrong? The metadata dont match the data They are wrong They are incomplete They are noisy
Metadata are inconsistent
Metadata differ from statistic to statistic Statistics Netherlands Statistics Netherlands The metadata dont match the data Assignment 1: design a statistic Concept Assignment 2: document it Statistician 1 Statistician 2 Process
Data Documen- tation Statistics Netherlands Statistics Netherlands Three ambition levels for good meta First level: reliable metadata
Second level: consistent metadata
Third level: standard metadata Statistics Netherlands Statistics Netherlands Ambition level 1: reliable metadata The metadata need to be truthful
They must tell the correct story the whole story the relevant story Statistics Netherlands Statistics Netherlands How to avoid divergence? Before computer era: All data on form with preprinted metadata Questionnaire Card file Published tables Punched cards were the first case of data without metadata Computers are the cause of the problem! Statistics Netherlands Statistics Netherlands Keeping data and meta synchronised Concept Statistician Machine- readable metadata Program 1 Process
Data Documen- tation Program 2 Statistics Netherlands Statistics Netherlands Without good metadata no good data What can go wrong?
The metadata dont match the data Implementation metadata are different from design metadata Statistics Netherlands Statistics Netherlands Three ambition levels for good meta First level: reliable metadata
Second level: consistent metadata
Third level: standard metadata Statistics Netherlands Statistics Netherlands Design vs. implementation Metadata are specified by design Implementation takes place in various software separately Input software (Blaise, EDI) Database software (Access, SQL Server) Processing software (VB) Publication software Is this semantically neutral? Statistics Netherlands Statistics Netherlands How to avoid multiple implementation All programs speak the same language Has become possible recently, thanks to component architecture
One program can translate its metadata for other programs Blaise, using its metadata script language Cameleon Statistics Netherlands Statistics Netherlands Active metadata Active metadata = Embedded metadata + metadata-driven processes
A process is fully conditioned by definitions contained in the metadata
Constraints: data are forced to obey the constraints / rejected if this is not possible Dependent variables: computations are carried out Processing rules are executed Statistics Netherlands Statistics Netherlands Migrating to a component architecture Different programs, incompatible storage formats Software A Format A Software B Format B Statistics Netherlands Statistics Netherlands Migrating to a component architecture Software and storage are made independent Interface A Software A Format A Interface B Software B Format B Statistics Netherlands Statistics Netherlands Migrating to a component architecture Both interfaces are combined into a common interface Software A Format A Software B Format B Interface A Interface B Interface A + B (or standard) Statistics Netherlands Statistics Netherlands Migrating to a component architecture Working on the storage format no longer affects the software Formaat A Standard Interface Software A Software B Formaat B Standard format Statistics Netherlands Statistics Netherlands Towards a unique metadata system Data & Metadata Standard data/metadata interface Design software Input software Throughput software Publication software Presentation software Statistics Netherlands Statistics Netherlands Without good metadata no good data What can go wrong?
Incomplete metadata Diverging data and metadata Implementation metadata differs from design metadata The metadata of different data sets are different tools Statistics Netherlands Statistics Netherlands Three ambition levels for good meta First level: reliable metadata
Second level: consistent metadata
Third level: standard metadata Statistics Netherlands Statistics Netherlands Diverse definitions Who is unemployed?
Someone with an unemployment allowance?
Someone registered at the employment office?
Someone who declares to be looking for work? Statistics Netherlands Statistics Netherlands Two possibilities It is one concept: Chose one definition, or merge them into a new one
We have three distinct concepts: Use all three definitions, but use different names and make different variables Statistics Netherlands Statistics Netherlands Three guarantees for reliable metadata Embedded metadata Metadata-driven processes Active metadata Statistics Netherlands Statistics Netherlands Embedded metadata (1) Data and metadata are firmly tied together A data set consists of both data and metadata
Tools are metadata-aware They know how to access the metadata They know how to understand the metadata Statistics Netherlands Statistics Netherlands Embedded metadata (2) Tools access data only through the metadata The metadata tell them how to access the data
They enforce the relationship between data and metadata They apply all the rules defined in the metadata Statistics Netherlands Statistics Netherlands Embedded metadata Data Metadata Data set 1 Data Metadata Data set 2 Survey design STANDARDS Process META DATA META DATA META DATA Statistics Netherlands Statistics Netherlands Rules vs. Tools Rules Imposed metadata models Central metadata repositories Extra work, statisticians dont understand why Tools Automatic support for standards Automatic collection or computation of required information Statistics Netherlands Statistics Netherlands
NO RULES WITHOUT TOOLS Rules vs. Tools Statistics Netherlands Statistics Netherlands Metadata-driven processes Processes are implemented in a generic way You dont write a program to aggregate a specific data set. But The IT department makes once a program that can aggregate any data set The statisticians provide metadata containing all the information required in order to aggregate the data Statistics Netherlands Statistics Netherlands From multiple databases to a 4-database model Statistics Netherlands Statistics Netherlands Input DB Output DB Intermediate DB 1 Statistic 3 Statistic 2 Statistic 1 Intermediate DB 2 Stove Pipe Model Statistics Netherlands Statistics Netherlands Disadvantages Stove Pipes No Co-ordination Definitions Aggregation & Estimation techniques Figures No Integration No Relations between figures No Relations between databases InConsistency First step Collect, Compare, Confront 4 DataBase Model Vertical integration
Statistics Netherlands Statistics Netherlands Input DB Output DB Intermediate DB 1 Statistic 3 Statistic 2 Statistic 1 Intermediate DB 2 Micro Base Stat Base Stat Line Base Line Stovepipe Model 4-DataBase Model Statistics Netherlands Statistics Netherlands Input DB Output DB Intermediate DB 1 Intermediate DB 2 Micro Base Stat Base Stat Line Base Line Statistic 3 Statistic 2 Statistic 1 Input Data Concepts Interm. Data 1 Concepts Interm. Data 2 Concepts Output Data Concepts Independent Concepts? Independent (Data) Models? No! 4-DataBase Model Horizontal Integration Statistics Netherlands Statistics Netherlands Input Data Concepts Interm. 1 Concepts Interm. 2 Concepts Output Data Concepts Intersection of Concepts Base Line Micro Base Stat Base Stat Line
Object Type Object Population Variable Property Value Statistics Netherlands Statistics Netherlands Input Data Concepts: Intersection Concepts: Register Questionnaire Question Routing Interviewer Instruction Object Type Object Population Variable Property Value Base Line Statistics Netherlands Statistics Netherlands Interm. 1 Data Concepts: Outlier Imputation Imputation Rule Micro Edit Macro Edit Weight
Intersection Concepts: Object Type Object Population Variable Property Value