Vagnoni Grant Proposal Submission 2 Full Word07

Fusion: Integrating heterogeneous EHR data from diverse sources. A.
Specific Aims
The Institute of Medicine (IOM) reported in 2000 that as many as 98,000 people die annually as a result of medical errors [16]. Several years later, a more comprehensive study found that the number was more than 195,000 patients dying each year from preventable in-hospital medical errors. This is equivalent to more than one 747 jumbo jet full of patients crashing everyday. The Center for Disease Control's annual list of leading causes of death does not include medical errors. If the list did, preventable medical errors would show up as 3rd behind heart disease and cancer [36]. Additionally, hospital-acquired infections account for more than 90,000 deaths each year and are largely preventable [7][10]. Few individuals now doubt that preventable medical injuries are a serious problem. However, there is still uncertainty among medical professionals as to the best means to combat this epidemic [17]. Many high-hazard industries begin combating preventable mistakes by focusing on the flaws within the system, and not by blaming and shaming the people in those systems [17]. For this reason, interest in technologies to support safer care by assisting in reducing the complexity and problem areas within the health care system has increased. Studies report that Electronic Health Record (EHR) systems could obviate many of the preventable deaths while saving 81 billion dollars annually [14]. While EHR systems are a step in the right direction, computerized assistance can help even further. One key assistive technology has been the computer-assisted physician order entry systems (CPOE). In one experiment, implementation of a CPOE application leveraging an existing EHR system decreased serious medication errors by 55% [4]. CPOE systems aid in solving the problem, but they are not a complete solution. These systems are only as good as the data they have access to. The current data landscape makes access to all the timely, relevant information incredibly challenging. At least 35 different laboratory information systems are used in the U.S. [2]. Adding to the complexity, between 200 and 350 vendors offer EHR products in the U.S. that interact with any number of the lab systems[31][22][9]. These are just the publically sold systems; many more hospitals have developed their own in-house, custom EHR systems. Thus it is no surprise today that a patients data can be easily lost during a migration from one hospital to the next or when a hospital switches from one vendor to the next. Without access to all the relevant data, these systems will never be able to prevent the crashing jumbo jets from killing all the patients within. Not only does the right data need to be present at the right time, but what the computerized systems really need is information and not data to act upon. Artifacts from different EHR systems often distort the information and hamper more advanced systems like CPOEs from working to their full potential. We seek to address these short comings by integrating all the data a Decision Support or CPOE system needs when it is needed by transforming and integrating all of a patients EHR data into the requisite information. While some have proposed standardization as a means of solving the integration problem, such approaches have suffered from many hurdles. A standard has to be created (there are many competing standards today). Even if one was settled on (for instance Health Level Sevens, HL7), it would only be effective once all vendors supported the standard. This would involve significant work and money, and currently few EHR companies comply with the accreditations available [40]. Finally, even if all the problems were overcome, the data would be syntactically integrated but not semantically integrated. That is, the representation of a concept such as female could be handled by designating a 2 in one hospital while as an F at another. A CPOE will not know that they are the same without being told. The other possibility is to implement a semi-autonomous algorithm to not only integrate the data from disparate sources, but to integrate the meaning or semantics of the data across organizations into a common information and knowledge representation. Some algorithms have been created to do exactly this, the best known is BioStorm. BioStorm is an integrated knowledge representation system aimed primarily at the bio-surveillance domain. Its chief problem is the overly simplistic and brittle assumptions made in its architecture [21]. Regardless, an integration and knowledge creation algorithm will have to undergo rigorous testing before medical organizations will trust it with their data and put their faith in the results. Unfortunately, there has been a complete lack of validation and systematic verification of medical data integration algorithms. A gold standard to compare against is sorely needed for the field to progress and gain credibility.
Specific Aims: 1. To produce an algorithm that semantically integrates heterogeneous data from diverse sources. This algorithm will not only integrate the data but use the extracted knowledge to further augment those subsystems that provide it with data and/or information. This will be a top-down and simultaneously bottom-up process. This enables preexisting knowledge and recently encountered experiences to transform into something better than the sum of its parts. Further, we contend that existing, common representations from relational databases to textual repositories occlude all the information that is present in the data from outside extraction and use. 2. To generate a gold standard to serve as evaluation criterion to be used to judge integration algorithms against medical doctors, serving as domain experts. This is a very successful strategy that has as a result of providing an objective fitness function spurred areas from natural language processing (NLP) to decision support to flourish. For the class of heterogeneous integration algorithms, particularly within the medical community, no such gold standard exists. The inability to objectively compare implementations of disparate, medical data integration is a real rate limiting factor and one pointed on commonly in the literature.
B. Background and Significance

Merriam-Webster defines a mistake as: to blunder in the choice of, to misunderstand the meaning or intention of, and to identify wrongly. When a physician prescribes the wrong medicine for a patient, a blunder has been made that often is catastrophic. When a surgeon misunderstands the orders, the wrong body part is operated on. When a patient is misdiagnosed, the consequences can be fatal. So fatal, that it is estimated that more than 195,000 patients die each year from preventable in-hospital, medical errors. This is equivalent to more than one 747 jumbo jet full of patients crashing everyday. Indeed the safety record of the health system in the United States lags far behind the similarly complex aviation industry. A person would have to fly nonstop for 438 years before expecting to be involved in a deadly airplane crash [19]. These numbers dont encompass the whole problem, because in-hospital patients are a small fraction of the total patient population. Retail pharmacies fill countless prescriptions each year, nursing homes watch over the elderly and infirm, and doctors offices met out care in an out-patient setting. The 195,000 patients dying each year are in-hospital patients, the tip of the iceberg. No one knows how deep the iceberg extends beneath the United States troubled waters. B.1 Medical Errors The American Hospital Association (AHA) lists the following items as common types of errors: incomplete patient information, unavailable drug information, and miscommunication of drug orders [19]. Patient information is incomplete and not accessible in its entirety at the point of care. A patients allergies or all the medications being taken might hide in the pharmacys database. Previous diagnosis for past illnesses or additional current ailments might reside in another doctors electronic health record (EHR). Pending lab results might be tucked away in a third-partys laboratory information management system (LIMS). At least 35 different LIMS are used in the United States and 20,580 labs have been certified by the U.S Department of Health and Human Services. It is common for practices to receive results for a patient from multiple laboratories. What makes matters worse is that between 200 to 300 vendors offer ambulatory EHR products in the U.S. [28], on top of the number of custom, in-house EHR systems that exist. Another problem area is the current disconnect between warnings that are released by the FDA or pharmaceutical companies and the patients taking the drug. Doctors arguably are supposed to be that link, but they can be caught unaware for many reasons. Time is a scare resource, and the doctors and their staff are often overworked. The method of dissemination can introduce difficulties. Paper notices can get lost, emails can get put in a spam folder, or websites can go down. Even if time and dissemination dont fail, a doctor must remember all the drugs from all of the patients under their care, which is simply not possible without using an electronic system more specialized than the EHR, called the computer physician order entry system (CPOE). The CPOE allows medical staff to electronically order drugs and automatically check for interactions, complications, and warnings. However, a CPOE system is only as good as the information it has to work on. Finally, the orders for drugs themselves can be mistakenly communicated. Name confusion is the most heinous example, says Peter Honig, M.D. and FDA expert on drug risk-assessment [19]. Take the sound-alike names of Lamictal, an antiepileptic drug, with Lamisil, the antifungal drug. The volume of dispensing errors of these two drugs was so vast that campaigns to warn physicians and pharmacists of the potential confusion had to be waged. It still hasnt stopped epileptic patients from dying when they are erroneously given the antifungal drug Lamisil instead of Lamictal. The FDA has clamped down on sound-alike names of drugs, but the problem is far from eradicated. A CPOE system can only go so far in assisting to increase the specificity of the drug ordered. Sloppy handwriting and miss dosing might disappear, but until the CPOE system can understand the semantics of a patients condition and the drugs that should be used to treat will there be any true progress in eliminating these types of mistakes. After all, a CPOE system would not prevent a doctor from treating an epileptic seizure with antifungal medicine, because a patient might have both conditions. If a CPOE system had access to all the information relevant to a patient, unlike the current state, it could realize the patient is in the hospital for a seizure episode based on the chief compliant entered during the patients admittance, and further know that fungal drugs are not ordered to be used by nurses in hospitals but rather as an outpatient treatment. B.2 Perspectives of Error
While many errors are being committed everyday, who is to blame? The state of malpractice suits might argue that the doctors and staff are at blame. However in reality, medicine is practiced within a very complex system. This complexity coupled with the limitations of humans make avoiding mistakes an extremely demanding task. In its landmark report, To Err is Human, the Institute of Medicine (IOM) suggested moving from the traditional culture of picking out an individuals failings then blaming and shaming them [17][19]. The IOM asserts that systems failures are at the heart of the problem rather than bad physicians. This is a crucial scientific foundation for improvement of safety found in all successful high-hazard industries. Errors are seen as consequences rather than causes, having their origins in upstream systematic factors. The central idea is that of system defense within the complex environment of health care. In aviation maintenance an activity whose complexity is close to that found in medical practice, 90% of mistakes were judged to be blameless [23]. Over the past decade, research into error management determined two essential components for managing unsafe acts: limiting the incidence of them and creating systems that are better able to tolerate occurrences of errors and contain their damaging effects. As a result, interest in technological solutions to manage the complexity of the health care system have surged [19].
Figure 1: Process map in Healthcare and Between Healthcare and Secondary [3]. Pediatric cardiothoracic surgical team members at two large urban medical centers delineated the steps of care from the patients perspective, starting with the referral for surgery until the childs first post-discharge follow up visit. Notice that different groups of individuals with different backgrounds and levels of experience are involved and this process crosses multiple organizational boundaries.
B3. Complexity Healthcare is more complex than any other industry in terms of relationships. There are over 50 different types of medical specialties which then have further subspecialties interacting with each other and with an equally large array of allied health professionals. The more complex a system is the more chances it has to fail. Figure 1 demonstrates the complexity of one routine activity at a pediatric cardiovascular surgical care unit in a hospital. Medical care today depends on the interactions between many different individuals and parts of the healthcare system. As complex as health care is today, it is just as fragmented. There is an over emphasis on focusing
and acting on the parts without looking at relationships to the whole: the patients and their health [25]. However, complex systems are more than the sum of their parts [34][18]. Unfortunately, specialization with its commensurate focus on only a narrow portion of a patients medical history has exploded without a comparable growth in the ability to integrate the information surrounding a patient. As a result, our ability to turn data into information and information into knowledge has diminished [25]. Thus while the US continues to make advances technologically, its healthcare system remains fragmented, ranking 37th in performance to other systems in the world [35]. Current information technology systems focus narrowly on care of individual diseases, rather than higher level integration of care for prevention, mental health, multimorbid conditions, and acute concerns [12][26][27]. Knowledge generation is narrowly partitioned in disease-specific institutes and initiatives without sufficient balancing research that transcends these boundaries. B4. Data the Rate Limiter Healthcare suffers from data explosion [1,2]. US Government Census, [6,7], indicated that in 2002 there were 6,541 hospitals and in 2005 there were 7,569. There were 33.7 million hospital inpatient discharges in 2002; on any given day, there are 539,000 hospital inpatients, nationwide, excluding newborns. Each patient generates a multitude of structured lab reports and unstructured summary notes by ER, nurses, and doctors. There are one or more EHR systems for a patient at each of the different hospitals a patient might have visited during the course of his or her life. Figure 2: Different types of data
in a Hospitals Clinical Data Center [1] illustrates the diverse
ecology of todays clinical data center. It is common for departments to generate the same type of data but have it provided by different vendors or custom systems. B5. Standards arent the solution Standards have been around for more than a 100 of years. These early standards were for single and simple things like the Figure 2: Different types of data in a Hospitals Clinical Data Center [1]. Analog data wattages of a light bulb or from ECG Waveforms and sound, binary data from images and videos, structured data the spacing of threads on a from the EHR systems from various departments, and free-text from notes and reviews screw. The complexity of are all present for each patient seen at each hospital a patient has been to. Different systems and industries departments have different types, formats, and intent behind their data. These often form silos that must be breached to allow for true integration of the patients medical requiring standards has history. grown significantly since then. An EHR, for example, is so complex it cannot be described in a single standard, but will require hundreds of new and existing standards[32][33]. Furthermore there are many competing standards organizations that need to be involved such as: The International Standards Organization (ISO), Clinical Data Interchange Standards Consortium (CDISC), International Healthcare Terminology Standards Development Organization (IHTSDO), the World Health Organization (WHO), National Council for Prescription Drug Programs (NCPDP), Health Level Seven (HL7), Health Information Technology Standards Panel (HITSP), the European Committee for Standardization (CEN), and ASTM International. Each of these organizations has their own working groups on different aspects of the EHR such as CENs: Data Structure, Messaging and Communications, Health Concept Representation, Security, Health Cards, Pharmacy and Medication, Devices, and Business Requirements for EHRs. Indeed, Charles Jaffe, M.D. the CEO of HL7 is quoted as saying, There is not a lack of standards, there are, in fact, too many standards and too many organizations writing them[15]. Sam Karp, Vice President of Programs, California Healthcare Foundation testified to the Institute of Medicine as follows, Our experience with the current national data standards effort is that it is too slow, too cumbersome, too political and too heavily influenced by large IT vendors and other large institutions The
current approach to standards development is too complex and too general to effectively support widespread implementation. Assuming at some point in the next five years a national EHR standard is reached, then the task of compliance must be surmounted. This is a rather daunting task, because of the sheer size of the Healthcare Industry in the United States. There are at least 50 different medical specialties and subspecialties at more than 7,569 hospitals being served by over 300 private EHR vendors and/or custom built EHR system [28][5]. These hospitals are served by the 20,580 laboratories each using a version of a LIMS from the over 35 different vendors. Then there are the varied pharmacists using their own systems. Vendors must be convinced to implement the new standards with little benefit. The additional cost is not compensated for directly. Adoption is a significant hurdle in and of itself [32][33]. When all of this has been accomplished, data interoperability will finally be realized. However, the integration of this data will only be superficial. While the structure of the data might be dictated by the standards, the content of the data or the information within the data remains isolated. One vendor might represent the concept of human female by the numeral 2 while another represents it as the letter F and a third as a reference to another location (demographics table for instance). Even higher level concepts like the history of all medications taken by a patient can be represented in an uncountable number of ways while still complying with the letter of the standard. Worse, a standard cannot account for all data a vendor might have, so often there are locations within the standard for optional data. All of these factors make true interoperability extremely difficult, even when all parties comply with all the diverse standards that exist. B6. Grid-based approach The Novo Grid uses an agents as distributed software entities to exchange information between people and applications. Agents are task-specific software programs that can be installed in any location. The idea behind the concept is to create small, modular programs interacting through a workflow activity. This both encourages efficient composition and reduces design time. Once installed, agents could theoretically interact with their local environment to extract and insert information and exchange the information with other agents to automate workflow tasks. However, the agents are still specific instantiations of domain knowledge towards a specific data source at a specific location. Even though the agents are small and light-weight pieces of code, this is still no guarantee on their efficiency and usefulness. The most important thing from a data integration stand point is what data they were created to interact with. For example, an agent installed on a server in a hospitals data center can listen for HL7 result messages from the hospitals interface engine. Without a mechanism of formalized representation of knowledge about what data is, what the formats for the data are, what the domains constraints are, and what the environments specifics are it is difficult to make this process efficient. Its programming will be overly specific to a narrowly defined input and transformed output, and not allow for the degree of generalizability necessary to overcome the exponential combinatoric problem that exists within the health care system. Namely any party can want data from any other party and any time. Finally for inter-agent communication a blackboard or post office metaphor is implemented. The Novo Grid provides a messaging post office that allows agents to send information from one to another. However, there still must be a common mechanism for communication, which for the Novo Grid is a vendor specific standard. That is, the agents communicate through message passing that they have been pre-programmed to understand. There is however an alternative. Knowledge about the system could be explicitly represented so that it is comprehensible to both humans and machines.
B7. The Semantic Web

The semantic web is an explicit, graph-based representation of knowledge that is both human and machine comprehensible and computable. This framework is designed to enable data to be shared, federated, and enhanced regardless of system boundaries. The semantic web was designed around the Open World Assumption that truth is independent of whether it is known by any single entity. That is, the graph-based knowledge formalized into an ontology can be extended by another entity. The language used to represent knowledge is called the Resource Description Framework (RDF). Its basic representation of a statement of
knowledge is the triple: a subject, predicate, and object. Subjects and predicates are resources that refer to concepts that via the Open World Assumption can be extended or changed at any time. Objects can either be resources or literal values (text, numbers, dates, etc.). Resources are both unique and globally identifiable by using a Universal Resource Identifier (URI), of which a Universal Record Locator (URL) from the internet is a subset. B8. BioSTORM (Biological Spatio-Temporal Outbreak Reasoning Module) One of the most well known attempts at semantic integration of disparate date is BioSTORM, a system for automated surveillance of diverse data sources. The system is agent-based, like the Novo Grid, with a threelayered architecture consisting of a Knowledge Layer, an Agent Platform, and a Data Source Layer. The Knowledge Layer is composed of a library of problem solving methods and a collection of ontologies defining the system components. The problem solving methods are implemented either by BioSTORM or domain experts according to an Application Programming Interface (API). The ontologies in this layer describe system variables, the problem domain, and deployment configurations. The Agent Platform is the middle layer, and it primarily deploys agents based upon the specifications in the Knowledge Layer. The agents act according to the knowledge describing the tasks and problem domain. The Data Source Layer describes the environment through an ontology. This provides a binding between the tasks the agents are to undertake and the actual physical resource [21]. However, BioSTORM has several shortcomings. It is focused on the Bio-Surveillance area within public health. Their data broker component is concerned with mapping from the raw data to the needs of a particular problem solving method from the Knowledge Layer. This does not scale, because a particular method might require any given subset of data within the cloud of all patient EHR data. There are simply too many vendors supplying EHR systems to a vast array of different users who have any number of different data sources and types. Then couple this with all the possible uses the patient, the doctors, the researchers, the government, etc. might have for the data. Furthermore, this methodology requires a priori knowledge of the data sources. The implementer of an agent using the data broker must know what they are looking for and where it is located. It does not allow for exploration and change of the data present.
C. Preliminary Studies C.1 Modeling Data The semantic web will be used to enable extensions and enhancements to the data. An ontology is a formal specification of a conceptualization. It describes kinds of entities and how they are related verses a data schema which speaks more to message formats (syntax) than knowledge representation. Ontologies also provide generic support for a machine to reason on by making data machine understandable (historically it has been only human comprehensible). It defines individuals, classes, and properties within a domain, formally organizing knowledge and terms. Ontology Web Language, OWL, is the mechanism for doing just this. Statements are asserted about a particular subject in the form of a triple composed of a subject, a predicate, and an object. From these triples or statements, logical consequences and facts entailed by the semantics can be derived. Universal Resource Identifier, URI, (University Resource Location, URL, is a sub-type of URI) is used to specify unique ids for a subject, its properties, and in some cases an object of the statement. It is the use of URIs that makes it easy to extend ontologies by uniquely identifying a concept, taking into consideration what others have asserted about the same URI. It is precisely these features that we leverage in the proposed Fusion System. C.2 Integrated Department Information Service (IDIS) The mechanism of integration was used to augment the extraction of information from different data sources. The Intra-Departmental Integrated Search (IDIS) project integrated hierarchical (database) and textual (web pages) search results, mutually reinforcing each other. The goal was to process a domain (www.cs.uiuc.edu) and enable better search results by leveraging all the data within that domain from its relational databases (RDBS) to the textual data. We placed the RDBS data within an Oracle database, which comprised things like faculty information and student rosters. Knowledge about the database was extracted from metadata (i.e. the schema) and insight provided by the textual processing. The textual processing began as an automated crawling of the www.cs.uiuc.edu domain. Once all the html pages were collected, an open source text search engine was employed to extract phrases. The results were integrated via a web-based GUI. IDIS integrated across three different dimensions. Firstly, it not only combined results from both a relational and a text database, it was able to leverage the results from one to get higher quality information from the other. After the users Keyword Query was transformed by the SQL Query Composer into an appropriate query to the relational database, these results were utilized to provide the context upon which textual results were extracted from the large corpus of departmental web pages. Then the textual results drove how the structured results from the database were presented inside of the Presentation Layer. This synchronicity was one
Figure 3: The IDIS System received queries from the internet, which were processed into SQL Queries. The results of the SQL Queries were then used to extract contextualized and relevant results from the text database, which contains all the departments internet pages. Some of the data is injected into the relational database, while the majority remains in the text database. Most of the data from the relational database came from the departments structured data. The presentation layer not only displays the results but assists in future querying by providing context to both the user and the system.
of the key components we are going to utilize in the currently proposed research. There is information that is uniquely embodied in the textual data that simply cannot be conveyed in the structured, relational database repository. For example, RDBMS traditionally do a poor job efficiently storing the often sparse structure of text, which makes it extremely difficult to interact with. Likewise, the relational database has information explicitly represented in its structure (particularly its schema) that cannot be represented in the typical webpage. However, both of these hurdles are solved in xml for humans and made machine understandable in the semantic webs RDF/OWL ontologies. Another very relevant area of integration was across time. IDIS used past context to enhance current search. The history of a users search and interaction with the site provided a trajectory through which current needs were channeled. For instance while trying to prioritize a users query results, past activity within the area of courses offered by the college would cause professors related to those past searches to be ranked higher and displayed more prominently. Finally, search was integrated with navigation. Indeed, how the user interacted with the results provided useful information for current and future searching as well. A user might drill down on a professors research interests to then find classes related to those interests. These two cases are particularly interesting, because both can be conceived as trajectories or graphs of a users behavior. In both cases, the users current actions extend the graph representing all past interactions with the information contained across textual and relational databases. This is one of the reasons why using the semantic web, a graph-based representation of data, should be so powerful in our proposed project. Many of these synergies are easier and/or augmented by using a representation that more closely approximated the phenomenon we are trying to elicit. From the users perspective, the IDIS system completely abstracted the representation of the data from users interaction. The users of the system did not need to know the structure of the databases nor even how to interact with them (i.e. SQL Querying Language). Rather the user interacted with the system and requested additional information from the system by navigating through the relationships between the information. In some cases that information was present in the users head (via keyword queries), but in other cases it was present in the representation of the information to the user by the system. C.3 Computerized Medical Systems While at Computerized Medical Systems, I was responsible for creating a new system of integrating medical and health data from within the company and across its partners. This project differed from IDIS in that it dealt with a wide assortment of medical data from digital images to demographic information to medical records. Each application had its own particular means of representing the data. Much of the companys future hinged on being able to access data uniformly from anywhere, because of the exponential growth of complexity as new applications were introduced that then needed to interact with existing applications. When every application needs to potentially interact with every other, typical methods using standards and message passing just did not scale. This burden was threatening to halt future application development. Worse the hurdle for a collaborator to work with our products was even more insurmountable, because at least developers within the company had a familiarity with internal standards. However, giving away proprietary formats represented a problem in both intellectual property and time. Even if there was no concern of inadvertently giving away a secret, collaborators would be burdened with spending the time to understand long specifications about each products data requirements. C.4 XML-to-Ontology (X2O) My work at the School of Health Information Sciences (SHIS) has given me over 18 months of experience with all the technologies needed to complete the Fusion Algorithm. I will be presenting the work I have done while at SHIS at the upcoming SemTech 2009 conference titled, What is in XML: Automated Ontology Learning and Information Transformation from Continuously Varying Clinical Data.
C.4.1 Raw Data At SHIS, I have been researching and improving upon existing algorithms to integrate the Electronic Medical Records (EMRs) from 7 local hospitals. Each hospitals data is different from each others and represents information from its emergency department and patient physician visits. The data are received in unstructured XML (see figure4). The schema for the XML is unknown and the structure can change without notice. Data are received as a stream from each hospital, with a packet representing a snapshot of what occurred in the previous ten minutes. Over 4 million messages have been received during the past 4 years. It is important to note that since there exists a trivial transformation of a database into xml, X2O can operate on both XML and Relational Databases. Figure 4: Sample of the XML data received every 10 minutes. The Relational Databases are actually contents of the XML are subject to change without notification. easier than XML, because with a database structure is necessitated and a great amount of latent information resides within its required schema. Comma delineated lists, another common file format, can also be trivially converted into an XML document. Outside of binary file formats, starting with XML has proved to be the most comprehensive starting point for most of the common, structured data formats. Further, X2O does not need a XML Schema Definition (XSD) for the XML it processes. Again, the mining of relevant information from such supplied structure, should it be available, is potentially very exciting; however, it isn't necessary, because we perform well without it. There is no real need to "standardize" the input data, because the ontologies describing the data present in X2O serve as a distributed mechanism for translating the data from machine incoherence into machine comprehension. Increased human readability and understandability is also a by-product of the X2O process. C.4.2 The Mapping As discussed during the introduction, the mappings are done through ontologies. The inherent, open-world assumption around the Semantic Web means that adding further semantics, statements about concepts, is a trivial matter. The technology has such expectations built into it at every point, so when there is an updated or replacement to a medical ontology (UMLS for instance) there is only a growth in understanding by the machines that process the concepts and no additional work on a humans part to update anything. There are four levels of ontologies in X2O. The first level defines at its most atomic what it means to be data such as concepts and descriptions of those concepts. It is called the Core Data Model, and it is the foundation upon which the next level resides. For each type of data from XML to databases, there exists a model that expresses all the nuances of that particular medium, Data Specific Model. The third layer is the Domain Specific Model. It creates a mapping from within the domain leveraging what the system explicitly knows from the previous levels to enable semi-autonomous mapping. A domain expert describes the elements and
attributes from the perspective of the XML ontology. The important distinction, however, from other efforts is that an exhaustive mapping of every possible value is not necessary. Rather, patterns are described within the data based upon location within a snippet relative to other elements and attributes. This flexibility captures all of the mappings traditionally found, but does so more elegantly and extensibility. If a new value appears within the XML, because of this feature, usually it is still captured. The fourth and final ontology level could be expressed in the Semantic Web languages, but for efficiency it was expressed in the programming code. This is called the Action Model, and it provides the intentionality upon which the XML model, in this case, is operationalized. This is beyond just the programming necessary to control a computer's manipulation of the data. Within the code, lies a model that expresses the statements and meaning behind each element and attributes encountered within the hospital data streamed through our system. Answering what can be said about a fundamental class located within the data that will be described within the TBox verse the ABox. C.4.3 Preprocessing As the raw data is received, it isn't automatically processed into its final form. Rather, there exists an intermediary format that is quite novel. For each XML element, including all attendant attributes, a node is created. At the end of this pass, a hierarchical tree structure has emerged that has indexed and re-grouped the entire message. Because an xml element might not actually be meaningful but rather some attribute within the element is, this preprocessing essentially allows for a transformation of the message into the conceptual representation that X2O will explicitly construct. C.4.4 Construction of TBox and ABox The final stage of processing actually sees the data bifurcated into a TBox and an ABox. One particularly interesting property of X2O is that regardless of whether the XML possess any metadata, the system does a beautiful job of elegantly creating a TBox for it. The TBox is compact but as demonstrated in prior experiments, domain experts agree it accurately describes the concepts present. The ABox generated is a repository of the instances already present within the XML, and as such is a more meaningful form of that data than when it resided within the XML. This is because the ABox has further meaning applied to it through the generated TBox and the other models like the core data model and any potential outside knowledge sources imported. C.5 BLUEText BLUEText addresses the problem of extracting information from clinical text found within EHRs, which includes triage notes, physician and nurse notes, and chief complaints. Unfortunately, this text is often written in short hand: lacking proper grammatical structure, prone to using abbreviations, and possessing misspellings. This hinders performance of traditional natural language processing technologies that rely heavily on the syntax of language and the grammatical structure of the text. Figure 5 illustrates the architecture of BLUEText. It has superior results on clinical text, which is often not only unstructured but unconstrained. That is it often is grammatically incorrect, misspelling are rampant, and abbreviations of concepts frequently occurs. The information is often compressed, and so the deviation from proper structure contains more information than if the sentence had be properly written. One such example is: Pt a 76 yrs old Af/Am fem. w/ bp 50/120, Hx of HA x 3 years. The sentence could be properly written as: The patient is 76 years and an African-American female. She has an initial diastolic blood pressure of 50 beats per minute and a systolic blood pressure of 120 beats per minute. She has had three heart attacks that we are aware of.
Figure 5: Schematic depiction of the BLUE-Text processes and ontologies. Existing knowledge sources such as UMLS is leveraged by the Semantic Knowledge of the system. Additionally, the Syntactic and Domain Knowledge are all used when generating the Conceptual Graph. After a preliminary text preparation phase, the Minimal Syntactic Parser first parses the text and then analyzes the syntax. The results of the syntactic analysis forms a parse graph that is comprised of tokens of text mapped to the Syntactic Knowledge. The Semantic Interpreter performs automated ontology mapping: using the parse graph to map from the tokens derived from the original text to the concepts from the domain knowledge. Also, a separate algorithm indexes these tokens semantically. The interpreter uses this knowledge and the parse graph to construct a conceptual graph that represents the content of the input text. The Conceptual Graph is an intermediary and generic output that builds on the parse graph and maps it to the appropriate concepts from the domain knowledge. All nodes in the conceptual graph have both a positional and a semantic index that will be used by the Output Constructors to extract different formats of outputs defined by user and/or task based queries. The Semantic Interpreter uses this information to improve the conceptual graph through a normalization and disambiguation process.
D. Research Design and Methods

The overall research plan is summarized in the flowchart in Figure 6. Tasks on the flowchart are numbered as x.y, where x corresponds to the Specific Aim associated with the task, while y represents the specific task itself. The same numbering scheme is used in the text of this section, for easy reference to the figure.
Figure 6: Flowchart of activities in project. Number in boxes corresponds to task numbers used in the text in Section D.
D.0 Choose 100 Records: The same corpus of data from our prior work (C.4.1) will be leveraged for this experiment as well. As a result of our need to validate our results and create a gold standard, it was important to devise a suitable number of patient records to provide generalizability while not overwhelming the human medical experts. The records will come equally from both the Cerner systems and the Allscripts data. We will attempt to keep an even distribution across age, race, gender, and primary diagnosis (ICD9 code). D.1 To produce an algorithm that semantically integrates heterogeneous data sources. D.1.1 XML-to-Ontology (X2O): This system will be used with little modification. The system architecture currently fits into conceived architecture of the Fusion Algorithm. The Fusion Algorithm takes as input the resultant TBox and ABox generated after X2O completes its processing of the raw XML messages. i. Results: Ontologies that can serve as input for the Fusion Algorithm. Also this portion of the system controls what data the BLUEText (BLU) sub-system processes. In that context, the corpus of text resides solely in free-text portions present within an XML message. The X2O will determine when a section of XML is suitable to be processed by BLU based on some heuristics such as size of the value and prior information contained within the ontologies.
ii. Problems: One problem is that X2O might fail to provide BLU all the text that BLU could process, thereby missing information that could have been extracted. Work and especially validation will need to be done to ensure that X2O is filtering the data BLU processes accurately. iii. Alternate Approaches: There are alternatives to X2O, BioStorm is one such example. In the worst case, BioStorm could be used, but it is very unlikely that contingency would be needed. Further, we have already discussed in section B.6 the short comings of BioStorm, and why X2O is a superior choice. iv. Transition: As a result of most of the work for this component already being in a complete state, any modifications that need to do be done can occur concurrently with modifications of BLU and creation of Fusion. That is, only minor adjustments should be necessary, and therefore should not exist as an obstruction to timely progression. v. Expertise: X2O was designed by the PI and other investigators on the grant. D.1.2 BLUEText (BLU): Currently BLUEText processes text into ontologies, but not in the format required by the Fusion Algorithm (TBox and ABox). The refactoring to accomplish this will be significant but well-known and straight-forward. There are no expected difficulties in this rework. Fortunately as with the XML-to-Ontology algorithm, we were the creators of BLU, so we are confident of our assessment of the scope of the modifications required. i. Results: Some rather significant work remains until BLU will be in a state to contribute to Fusion. ii. Problems: The major concern is that the scope of the changes necessary for BLU to comply with the input expected by the Fusion Algorithm is not accurate. This seems unlikely based upon our extensive experience with the system, plus in the worst case the output of BLU could be treated as XML by X2O. As such, X2O could transform the output of BLU into the output expected by Fusion. iii. Alternate Approaches: There are many NLP solutions and some of them specific to the medical domain that could be used. However, most medical NLP solutions are Figure 7: Overview of the system architecture proprietary and therefore expensive. The of the Fusion System. Structured data is most well-known open solution is the NLMs handled by the XML-to-Ontology sub-system SPECIALIST NLP Suite of Tools developed through XML messages. The NLP algorithm, by the The Lexical Systems Group of The BLUEText, will process unconstrained text within the XML messages sent to the XML-toLister Hill National Center for Biomedical Ontology algorithm. The output of both subCommunications to investigate the systems is integrated within the Fusion contributions that natural language Component. Finally, the results of the processing techniques can make to the task integration is used to enable both of the of mediating between the language of users upstream algorithms to process further, while and the language of online biomedical simultaneously enabling downstream information resources. applications like EHR CPOE Decision iv. Transition: As explained in D.1.1 Support systems to function. v. Expertise: We designed the algorithm. D.1.3 Outside Ontologies: BLU makes use of various outside ontologies to provide much of the knowledge base upon which is operates. More importantly, the Fusion Algorithm will rely upon
Outside Ontologies to increase its knowledge of the corpus of medical data to further augment the accuracy of the Fusion Algorithm. Furthermore, it is very important the Fusion is able to take advantage of outside augmentations to its knowledge source in much the same way it is able to process the input from X2O and BLU. i. Results: This is a significant demonstration of the viability of extensibility promised by the Semantic Webs design and the conceived goals of the Fusion Project. As weve illustrated earlier in sections B7 and C4, semantic federation of knowledge is itself an unsolved but vitally important area of research. It is the only way we can hope to manage the onslaught of exponentially increasing data generated all across society, but particularly within the medical domain as it relates to patient health and safety. ii. Problems: Access to useful Outside Ontologies is perhaps the biggest problem. However the NLM publishes taxonomies (like UMLS, SnoMedCT, etc), which weve previously been able to successfully translate into a true ontology. Additionally, there are many useful ontologies about location, demographics, measurements, drugs, etc that we would like to leverage. Weve had some past successes incorporating a measurement ontology developed by NASA into BLU. We will have to analyze how far these transformed taxonomies and outside ontologies are from the desired input of Fusion. iii. Alternate Approaches: We can abandon the use of Outside Ontologies without jeopardizing the core aim of integrating heterogeneous data from multiple sources. iv. Transition: As explained in D.1.1 D.1.4 Fusion Algorithm: At the core of the proposed architecture, resides the Fusion Algorithm. Its purpose is to take in the pre-processed data from several algorithms that have been previously designed (D.1.1 and D.1.2) to translate a specific format of data into an ontological representation. Fusion then takes these ontological representations and integrates them into a unified knowledge base of (all) patients information. Then an EHR, CPOE, Decision Support, etc system can build on top of the unified view of the patients information. i. Results: Providing a single unified and well defined knowledge base will enable existing solutions and research into the next generation of solutions to increase their accuracy. Additionally, production times, maintenance, and other issues typically encountered when needing to cope with a diverse ecology of data islands should disappear. ii. Problems: The core problems that Fusion needs to solve is referred to as semantic federation or ontology integration. It is our belief that the diverse ontologies provided by trusted, outside sources and those created by sub-systems (X2O and BLU in this case) can be integrated by analyzing the graph-based representation of the data. Graph analysis is important in a whole host of areas; one of the most well-known examples is internet search (i.e. Googles PageRank). It is our intent to use PageRank and K-Nearest Neighbors Algorithms to analyze the graph structure that comprises the ontologies that Fusion receives and determine a measure of similarity between the different ontologies based on both the information or semantics within the ontologies and the structure of the graph representing the connection between the atomic units of information inside the ontology. It is unlikely that an entire automated algorithm will ever exist, nor that it would be advantageous. As a result of the catastrophic risks (patient death or serious injury), medical experts will always be required to provide their consent to the results. However, medical experts are both rare and their time extremely valuable. To this end, the aim of Fusion is to vastly augment the productivity of the doctors and medical staff involved. This is also another reason for the necessity of a gold standard. We will be augmenting what a human could do if they had sufficient time to do so, so we want to ensure that we are at least as good as what a doctor or nurse might arrive at. Without a human medical expert performing the same task as Fusion, it will be impossible to verify the success or accuracy of Fusion. iii. Alternate Approaches: Fortunately the space of graph algorithms is ripe with research. While the field of ontology integration is still very young, there are methods in the literature of graph algorithms (particularly in the area of webpage analysis and searching) that could be applied to this task. One such interesting method has just recently been applied to the area of web search that uses Level Statistics from a field called random matrix theory originally designed to analyze quantum systems [6]. The new method gauges the importance of words in
a document based on where they appear (pattern of locality), rather than simply on how often they occur. This is precisely what is needed when integrating ontologies as well. iv. Transition: This aim partially relies upon the completion of the D.2 aim, however, this is why we have medical experts (physicians) assisting with early verification testing of the results of the Fusion System. v. Expertise: Several medical experts will be collaborating to help verify the results, in addition to the gold standard generated as a result of D.2. We have demonstrated, we hope, our expertise in the areas of data integration, ontologies, and data modeling. Summary: The successful completion of this aim will result in a novel mechanism for solving the problems of data explosion both in quantity and representation. This is a real rate limiter that is stifling realization of technologies like EHR and CPOE systems from realizing their full potential to eliminate the vast majority of preventable medical errors, which are currently the number 3 killer of Americans if it was on the CDCs annual list of leading causes of death. D.2 To generate a gold standard, evaluation criterion for judging the success not only to evaluate our performance, but the many other current and future attempted solutions. D.2.1 Choose Experts: The initial choice of 3 experts to integrate 100 records was chosen partially in according with choices made in related areas of Health Informatics. Studies on the inter-rater agreement of Mesh Tagging by pubmed editors for instance. i. Problems: Recruiting medical experts is a notoriously difficult task, further getting experts interested in spending the amount of time necessary to integrate 100 patient records will be a challenge. Fortunately, we are located in Houston Texas medical center, where there is a large pool of medical experts present. ii. Alternate Approaches: Unfortunately there are not good alternatives to using medical experts as our gold standard. It is a commonly held belief that medical experts such as doctors, nurses, and health informaticians are what a computerized implementation should be held to. Clinical Decision Support is a particularly salient example where this occurs. There are, however, alternatives to creating a gold standard to evaluate our algorithm. Some of these statistics we would use anyway during analysis of aim 1. For instance, one of the central hypothesis is that the Fusion will augment the information providing by the component pieces (X2O and BLU). Therefore, it is posited that more knowledge would be discovered on a second pass of the initial patient data by incorporating the results from Fusion. Analyzing how much new knowledge (if any) is discovered could be one metric. iii. Transition: Much of the transition from choosing and recruiting to actually having the experts perform D.2.2 will be educating them on the task and what is required of them. iv. Expertise: Medical experts will be used to help select an appropriate cohort. Proposed members of the team have experience recruiting experts from past experiments. D.2.2 Integrate and Extract: Fundamentally, the experts will be performing the same task as Fusion but in their own way. They are free to integrate and extract concepts as they see fit. Their time spent at the computer integrating the EHR data will be recorded by video camera and also tracked within the program they will be using called TopBraid Composer. We have already written a plugin to the application that allows us to do this. i. Results: Indeed, much like a psychologist gains insight into human behavior through observation, it is possible that we can glean important and useful algorithmic information from analyzing how the experts integrate. This should be a significant contribution beyond simply providing the gold standard. That is, little research has been done on how a human medical expert might integrate disparate medical data. We hope to gain both a gold standard and an insight into human algorithms being employed by medical experts to integrate the data. ii. Problems: Actually performing the task will rely upon our training and the experts experience. If the experts lack sufficient medical experience to comprehend the underlining information in the data and transform the data into comprehensible knowledge, they could skew the results in favor of our algorithm by brining human performance artificially down. Likewise if we fail to properly educate the experts on their task, a similar result would be
observed. Partially, we hope that D.2.3 will help us discover problems with individuals, but it cannot uncover cross cutting bias or problems. iii. Alternate Approaches: There is really no alternative to using human medical experts, save for comparing the performance of Fusion against existing systems. However, most data integration systems are propriety and uninterested in being declared inferior to another (rival) system. Therefore getting access to another system would be difficult. BioStorm is perhaps one of the few examples of a fairly successful and open sourced system that we could use. This is another reason why having an independent gold standard backed by human experts can help put pressure on existing solutions. Purchasers and users of such existing solutions can assess them prior to competing, which currently is not possible in any rigorous way. iv. Transition: After this phase, a number of transition points exist. If D.2.3 asserts the expected amount of inter-rater agreement and our own experts concur, there will be no need to repeat this phase. However, if there is some sort of systematic bias that has skewed the results, it would be necessary to repeat either D.2.2 or both D.2.1. and D.2.2. This would depend upon our analysis and our assessment of where the bias is originating. v. Expertise: As explained in D.2.1. D.2.3 Assess Inter-rater reliability: At this point, our statistician will analyze the agreement of the experts attempts to integrate the patients data. Additionally, our medical domain experts will assess how reasonable the results are. i. Results: A measure of confidence both objective and subjective of the proposed gold standard. ii. Problems: As explained in D.2.2. iii. Alternate Approaches: Two approaches are being used, both with their own strengths and weaknesses. iv. Transition: As explained in D.2.2 v. Expertise: Our statistician has over 20 years experience, while our medical experts each have over 10 years. D.2.4 Compare Experts to the Fusion Algorithm: This final portion will entail utilizing two measures of similarity. The first is our data modelers assessment of both results. They both have extensive experience modeling data and analyzing ontologies. Secondly, the similarity methods developed for D.1.4 will be re-employed here. Not only will this provide a means of validation (because our similarity methods can be compared against human data modelers) but it will assist in systematically and objectively analyzing the results. i. Problems: Discrepancies between the methods will help us to analyze where the failures are. ii. Alternate Approaches: We will be using two different methods, it should not be necessary to use alternative approaches. Summary: The successful completion of this aim will not only provide a gold standard for any medical data integration algorithm to use, but validate the novel Fusion, the heterogeneous data from disparate sources integration algorithm. Even in the unlikely event that we do not succeed in all of our aims, a great deal will be learned by this experiment. We will be collecting and analyzing how human experts integrate data, which could yield an interesting direction to explore future integration research. Further, simply having a gold standard to compare existing and future algorithms is a worthy aim in itself. One such benefit is holding accountable those companies who sell their medical integration software to the claims they make. Additionally, enabling true competition by enabling consumers to objectively compare competition and new solutions should exert its own evolutionary pressures. Even in failure, important lessons can be gleaned about how to approach future attempts at heterogeneous data integration. Most existing research does not engage the very diverse and often unharmonious ecology of realistic data. It would be surprising if all existing successful research in this area could withstand the amount of rigorous testing we will have employed. If Fusion fails, how it fails might be as interesting as how well it succeeds.
D.3 Research Work Plan and Timeline Year 1 Timeline

Complete
year 2 Q4 Q1 Q2 Q3 Q4
IRB Submission & Review D.0 Choose Representative Records D.1.1 XML-to-Ontology D.1.2 BLUEText D.1.3 Outside Ontologies D.1.4 Fusion Algorithm D.1.5 Recurrent Data to D.1.1 & D.1.2 D.2.1 Choose Experts D.2.2 Integrate and Extract D.2.3 Assess Interrater Reliability D.2.4 Compare Experts to Algorithm
Q1 X X XX X
Q2
Q3
XX XX
XX XX XX XX XX XX
X XX XX
XX
XX XX XX XX XX
Human Subjects Protection: IRB approval will be required for the secondary data use and will be obtained from the University of Texas at Houstons Medical Center IRB.
Y. Collaborators
Principal Investigator: Matthew Vagnoni, MS brings an expertise in data integration, ontologies, and underlying technologies such as programming, system architecture, etc. Investigators: Parsa Mirhaji, MD brings an expertise in NLP (BLUEText), ontologies, data modeling, and medical domain. James Turley, RN, PhD brings an expertise in statistics, system evaluation, and the medical domain. Jorge Herskovic, M.D., Ph.D. brings an expertise in information retrieval, clinical decision support, and the medical domain. Min Zu, MD, MS brings experience in both the medical domain and as a data modeler.

Vagnoni Grant Proposal Submission 2 Full Word07

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Vagnoni Grant Proposal Submission 2 Full Word07

Загружено:

Авторское право:

Доступные форматы

Fusion: Integrating heterogeneous EHR data from diverse sources. A.

B. Background and Significance

B7. The Semantic Web

D. Research Design and Methods

D.3 Research Work Plan and Timeline Year 1 Timeline

Вам также может понравиться