Вы находитесь на странице: 1из 4

Generating Entity Relationship Diagram from

Requirement Specification based on NLP


P. G. T. H. Kashmira Sagara Sumathipala
Department of Computational Mathematics Department of Computational Mathematics
University of Moratuwa University of Moratuwa
Sri Lanka Sri Lanka
thkashmira@gmail.com sagaras@uom.lk

Abstract—An entity relationship data model is a high level The remainder of this paper is organized as follows.
conceptual model that describes information as entities, Section II presents related work. Section III provides the
attributes relationships and constraints. Entity relationship proposed model and Section IV presents the implementation
diagrams to design the database of the software. It involves a and Section V presents the evaluation and Section VI
sequence of tasks including extracting the requirements, discusses the research.
identifying the entities, their attributes, the relationship
between the entities, constraints and finally drawing the
diagram. As such entity relationship diagram design has II. RELATED WORKS
become a tedious task for novice designer. This research User requirement analysis is an Information Extraction
addresses the above issue, proposes a Natural Language application of NLP [2]. It needs to be able to distinguish
Processing based tool which accepts requirement specification between relevant and non-relevant information quickly and
written in English language and generates entity relationship respond to newly obtained data into the application. For an
diagram. example consider the sentence: There are lecturers who
teach courses. This sentence automatically conveys a large
Keywords—Entity relationship data model, Natural
amount of information. Because of language as well as
Language Processing, Requirement Specification.
world knowledge the reader knows that parties in an action,
a relationship exists, lecturer and course are not specific,
I. INTRODUCTION and the course is taught to students and so on. At the
Natural language descriptions often need to be analyzed, information extraction process, above mentioned complex
transformed and restructured into a form of design notation information need to be captured when reading the sentence.
during the development of software applications [1]. Recent It may comprise with a linguistic analysis of the NL for
researches have been focused on automating the extraction syntax and semantics, a set of transformation rules for
of information from natural language text using Natural extracting, rules and heuristics for handling world
Language Processing (NLP)[2]. knowledge, classification rules for modeling the behaviors
[3].
Unified Modeling Language (UML) provides several
types of diagrams that are used to increase the simplicity In a sense of data modeling, Information Extraction is
and conception of an application at the development. Entity the identification of specific elements within the user’s
Relationship (ER) diagram is one of the UML diagrams and requirements entered in textual form (e.g.: entities,
has played a central role when designing a system's attributes, relationships, etc.). Therefore, the role of NLP
database. However, obtaining entity-relationship models would be in identifying and extracting nouns, and other POS
from a system's specifications requires expert knowledge needed in the determination of entities, relationships,
and may be a lengthy process with time-consuming. Errors attributes, cardinalities [3]. There are several steps in NLP
occurred during these phases can be difficult to fix later on. such as morphological analysis, tokenization, tagged POS,
There are some graphical CASE tools such as Rational chunking, parsing which can be used to process the user
Rose, Microsoft Visio, Smart Draw etc. which provides help requirement entered in textual natural language.
in documenting UML diagrams. However, they do not Few researchers have done UML diagram generation
contribute to the initial, difficult stage of the analysis using natural language processing to bridge the gap between
process that of identifying the entities, attributes, requirement analysis and design phases of software
relationships, keys, cardinalities and other generalization applications. Most of the researchers have used linguistic-
specialization like extended features. Therefore it is based, rule-based and pattern-based approaches to get
desirable to have a tool or a system, which the students, expected outcomes and mostly considered about the class
novice analysts could use to automatically generate a quality diagram or the use case diagram. The following section
ER diagram based on natural language requirement provides a short review of the previous work related to this
specifications. area.
The primary goal of this research proposes an approach Authors in [4]-[9] have used NLP and linguistic theories
to identify the entities, their attributes, the relationship for translating English sentence structures into a UML
between the entities to automate the ER diagram design diagram's components. (Specifying nouns as objects/entities,
process based on requirement specifications written in the verbs as methods or relationships, adjectives as attributes)
English language by eliminating the user involvements in the However, these attempts are not complete and fully
tasks which need expert and domain knowledge. Then even accurate. Because entities can be identified by nouns in a
nontechnical people in small enterprises would be able to use requirement specification, nouns not only refer to entities
this tool to generate ER-diagram automatically. but also to attributes and other concepts. Further entities can

978-1-5386-4417-1/18/$31.00 ©2018 IEEE


also be identified from verb phrases and hidden Identify entities and sub-entities to design
requirements. Therefore this technique can only support to specialization/generalization concepts. Identify the attributes
manual or semi-automatic concept extraction. Also, of entities. Identify relationship between entities/sub
proposed rules are built based on the syntax of NLs. Some entities-attributes, entities-entities, entities-sub entities,
authors [10],[11],[17] suggested vocabulary constraints and attributes-attributes. Identify relationship types between
controlled sentence structures into specifications to obtain entities (association, generalization). Identify cardinalities in
good results from linguistic processing. But it has some a relationship in the ER diagram (One to One, One to Many,
limitations and causes burdens when writing the Many to Many). Most of the requirement specifications
requirements. consist of participating entities and their interactions/work
processes. ER diagram designer decides related attributes of
Heuristics, based on linguistic rules [12], are reported to the entities by using world knowledge. To address that issue
be utilized in many of the systems like [13],[14],[15]. Most proposed model uses ontology and web mining to filter out
of the approaches have used NLP based rule-based systems the relevant attributes into extracted entities.
to analyze the syntactic structure of input text and infer data
model entities, attributes, relationships. Among them, the ERD Modeling Module will be a visual studio modeling
use of heuristics to aid the construction of ER models from diagram project that consists of a drawing of the ER
natural language has been scare. Those approaches used the diagram using extracted entities, attributes, relationships and
parser and then fed to the syntactic heuristics to identify cardinalities.
suitable data modeling elements. Only researches in [15]
explore the value of semantic analysis to the understanding Inputs of the proposed model are requirement
of the results of parsing, lexical information, context, and specifications written in English language and output will be
common sense reasoning to have a more expressive a drawing of an ER diagram. The input will be taken from
power. the web application and users of the tool interact only with
the web application. The user will be able to upload a text
Authors in [16] used pattern based NLP techniques to file that contains a requirement specification or just copy
extract the relevant concepts from requirement and paste the text in the text area of the application. The
specifications. ABCD tool used regular expressions to process of the system begins by reading a text containing a
represent patterns and enabled to extract relationships, requirement specification in English. Then, the
multiplicity, and generalization of class diagrams, but failed preprocessing modules’ results will feed into a machine
to deal with redundant information and incomplete learning module to identify suitable data modeling concepts.
information. Finally, the ERD modeling module will result in a drawing
of an ER diagram.
All the systems discussed here need user involvement to
improve the accuracy of these systems during the processing
to resolve ambiguities between the entities and attributes Requirement
identifications, duplicates and synonym entities and so on. Specification
Further, these tools have been developed to support
designers, analysts, and students, not for non-technical
users. These tools are not available as they are not available
in free of cost. Most of them are at the laboratory level. Pre-processing
Therefore it is desirable to develop a tool that could be used NLP
Module Techniques
even by non-technical people without any cost to generate
ERD based on the requirement specification written in
English.
Machine Learning
III. PROPOSED APPROACH Module
Fig. 1 depicts the model of the proposed approach which
generates the ER diagram by addressing the issues of
incomplete information and redundancies in requirement
specifications with minimum user intervention. The
proposed model contains three major modules namely Pre- Entity Relationships Attributes Cardinality
processing Module, Machine Learning Module, and ERD
Modeling Module.
Pre- Processing Module analyzes individual words and
non-word tokens are separated from words. Depending on Web Mining Ontology
the standard of entities naming, the entity's name must be in
a singular form. Apply this standard by analyzing individual ERD Modeling
nouns to make sure they are appropriate for entity name by Module
removing plural suffixes and converting plural entity names
into singular. Then sentence tokenization and punctuation
removal. Machine Learning Module identifies entities,
attributes, and relationships from the given text. By using ER Diagram
supervised learning, it understands the above-mentioned ER
concepts as follows. Fig. 1. Proposed Model
IV. IMPLEMENTATION

Fig. 2. A sample of an Annotated Scenario•

TABLE1
Existing Named Entity Recognition (NER) taggers ACCURACY LEVELS OF THE ALGORITHMS
label words in a text which are the names of things such as
a person, company names. But not correctly tag the words Algorithm Class Precision Recall
into entities of the ER diagram. Currently, we have Entity 0.954 0.906
implemented the machine learning module to identify Random Forest
Attribute 0.989 0.743
some features of ER diagram such as entities, and Sub Entity 0.84 0.691
attributes. Phases of the implementation are as follows. Irrelevant 0.822 1.000
Entity 0.94 0.923
First, implement the pre-processing module by using Naive Bayes
Attribute 0.985 0.743
NLTK to preprocess the text in the scenarios – remove Sub Entity 1.000 0.164
punctuations, conversion to lowercase and tokenize to Irrelevant 0.822 1.000
Entity 0.958 0.822
sentences. Then, the machine learning module
Attribute 0.98 0.618
implemented by using supervised learning mechanism. Decision Table
Sub Entity 0.272 0.704
Fifty amounts of preprocessed requirement specifications Irrelevant 0.846 0.935
have been taken as the training corpus. Words of each and Entity 0.956 0.904
every specification in the corpus are annotated into an Attribute 0.975 0.748
SMO
entity, sub-entity, attribute, and irrelevant categories. For Sub Entity 0.849 0.664
the annotation purposes, used the “tagtog” API. It provides Irrelevant 0.854 0.999
automatic annotation into similar words appeared in the
text once we annotated single occurrence. The sample of
annotated requirement specification is shown in Fig. 2 VI. DISCUSSION AND FUTURE WORKS
(Green color denotes the entity, orange color denotes the
subentity and pink color denotes the attributes). “Weka” When comparing the accuracy of the above
has been used as the machine learning tool to develop the algorithms all the algorithms have achieved a considerable
machine learning module to identify above-mentioned ER level of accuracy for the identification of the entity,
diagram features. We considered four classifiers namely attribute, and irrelevant classes. In the literature, it has
Random Forest, Naive Bayes, Decision Table, SMO when found that earlier researches were not used intelligence
training the “Weka” model. with this type of research. Therefore this attempt is the
V. EVALUATION approach with some intelligence to identify ER concepts
using NLP. Hence this is an initiation of this research
In order to evaluate the accuracy of the algorithms there are more future works such as increasing the
recall and precision calculation were used. Formulas of
accuracy of identifying entities, sub-entities and attributes
precision “(1)” and recall “(2)” are as follows.
using different algorithms, extend the model to identify
a-Number of relevant item of the ER diagram identified. relationships, relationship types, and cardinalities,
b-Number of relevant item of the ER diagram not implement the ER diagram drawing module to obtain the
identified. drawing of an ER diagram.
c-Number of an irrelevant item of the ER diagram
identified.
Precision =a/a+c (1) REFERENCES
Recall =a/a+b (2)
[1] K. Li, R. G. Dewar, and R. J. Pooley, “Object-Oriented Analysis
Table 1 shows the accuracy levels of the different Using Natural Language Processing,” p. 7.
algorithms for the above mentioned ER features [2] F. Hogenboom, F. Frasincar, and U. Kaymak, “An Overview of
identification model. We have taken cross validation with Approaches to Extract Information from Natural Language
ten folds for all the algorithms when training the “Weka” Corpora,” p. 2.
model.
[3] E. Buchholz, H. Cyriaks, A. Dusterhoft, H. Mehlan, and B.
Thalheim, “Applying Natural Language Dialogue Tool for
Designing Databases,” p. 16.
[4] S. P. Overmyer, L. Benoit, and R. Owen, “Conceptual modeling
through linguistic analysis using LIDA,” in Proceedings of the 23rd
International Conference on Software Engineering. ICSE 2001,
Toronto, Ont., Canada, 2001, pp. 401–410.
[5] Harmain, M., Gaizauskas, R. “ CM-Builder: A natural language-
based CASE tool for object-oriented anaysis”. Automated Software
Engineering, 10(2), 157-181
[6] N. Omar, P. Hanna, and P. M. Kevitt, "Heuristics-based entity-
relationship modeling through natural language processing," p.
12.
[7] I. S. Bajwa and I. Hyder, “UCD-generator – A LESSA application
for use case design”, IEEE- International Conference on
Information and Emerging Technologies- IEE-ICIET, Karachi,
Pakistan.
[8] Priyanka More and Rashmi Phalnikar, “Generating UML diagrams
from natural language specifications.”, International Journal of
Applied Information Systems (IJAIS), vol. 1, no. 8, April 2012
[9] S. D.Joshi and D. Deshpande, “Textual Requirement Analysis for
UML Diagram Extraction by using NLP,” International Journal of
Computer Applications, vol. 50, no. 8, pp. 42–46, Jul. 2012.
[10] S. Du, “On the use of natural language processing for automated
conceptual data modeling,” p. 201.
[11] R. Giganto, “Generating class models through controlled
requirements,” in In NZCSRSC-08, New Zealand Computer
Science Research Student Conference, 2008.
[12] Lilac, “Natural Language Processing for Conceptual Modeling,”
International Journal of Digital Content Technology and its
Applications, vol. 3, no. 3, 2009.
[13] M. Z Alksasbeh, B. A. Y Alqaralleh, T. A Alramadin, K. A
Alemerien, “An Automated Use Case Diagrams Generator from
Natural Language Requirements.” [Online]. Available:
https://docplayer.net/63109625-An-automated-use-case-diagrams-
generator-from-natural-language-requirements.html.
[14] S. A. Jogdand, “NLP for Automated Conceptual Data Modeling,”
IOSR Journal of Computer Engineering (IOSR-JCE), vol. 18, no.
1, pp. 01-08, 2016.
[15] M. M. H. Eman S. Btoush, “Generating ER diagrams from
requirement specifications based on natural language processing.”,
International Journal of Database Theory and Application, vol. 8,
no. 2, 2015, pp. 61-70.
[16] W. Ben Abdessalem Karaa, Z. Ben Azzouz, A. Singh, N. Dey, A.
S. Ashour, and H. Ben Ghazala, “Automatic builder of class
diagram (ABCD): an application of UML generation from
functional requirements: Automatic Builder of Class Diagram
(ABCD),” Software: Practice and Experience, vol. 46, no. 11, pp.
1443–1458, Nov. 2016.
[17] B. Kouninef and B. Al-Johar, “Extracting Entities and
Relationships from Arabic Text for Information System,” vol. 2,
no. 11, p. 5, 2011.
[18] Hala Elsayed, Tarek Elghazaly, "A rule based entities recognition
system for modern standard arabic”, IJCSI International Journal of
Computer Science Issues, Vol12, Issue 1, No 2, January 2015.

Вам также может понравиться