Вы находитесь на странице: 1из 6

Automatic Nutrition Extraction From Text

Recipes
Debasish Bose
Affiliation not available
May 16, 2014
Abstract
The science of nutrition deals with all the effects on people of any
component found in food. This starts with the physiological and biochemical processes involved in nourishment how substances in food
provide energy or are converted into body tissues, and the diseases that
result from insufficiency or excess of essential nutrients (malnutrition).
The role of food components in the development of chronic degenerative
disease like coronary heart disease, cancers, dental caries, etc., are major targets of research activity nowadays. There is growing interaction
between nutritional science and molecular biology (esp. nutrigenomics)
which may help to explain the action of food components at the cellular
level and the diversity of human biochemical responses. However in our
daily lives we cook recipes made of ingredients, instead of focusing on raw
food components. Beyond dietitians advice and guidelines, its difficult
to continuously measure our daily nutritional intake, without manually
entering weight and amount of each constituent ingredients. Apart from
this manual process, effective nutritional intake also depends on the cooking process, retention factors of the individual ingredients. To alleviate
such difficulties we propose an algorithm and an accompanying web-based
tool to automatically extract nutritional information from any text-based
recipes

Introduction

Recipes show a tremendous amount of diversity in cooking styles and ingredients


some of which are highly community or culture or even country-specific. This
diversity makes it challenging to design a system which can infer nutritional information without much manual intervention and with substantial accuracy.
Although its possible to manually enter each ingredient from an enormous
database, its often time consuming and impractical in our day-to-day lives.
To automatically deduce nutritional information from textual recipes weve segmented the core procedure into following steps

Information Extraction (IE) from text recipes, using Rule-based or NLP


(Natural Language Processing) parser
Conversion to structured data - amount, unit, ingredient name and any
modifiers (ex. lightly beaten)
Mapping of each ingredient to an existing food ontology (USDA Food
Database is used for demonstrative purpose. It can be extended to other
food databases like NUTTAB)
Deduction of weights from various lexical clues and ingredient densities
Deduction of final nutritional information and

Figure 1: Core Procedure

2
2.1

Procedure
Information Extraction

Instead of adopting a full-ML approach, weve tried to capture linguistic rules


by representing them using regular expressions. The general idea behind this
technique is specifying regular expressions that capture certain types of information. For example, the expression (watchedseen) NP, where NP denotes
a noun phrase, might capture the names of movies (represented by the noun
phrase) in a set of documents. By specifying a set of rules like this, it is possible
to extract a significant amount of information. The set of regular expressions
2

are often implemented using finite-state transducers which consist of a series


of finite-state automata. Weve used Citrus to declaratively define the parsing
expression grammar (PEG) using the Ruby language. For example
grammar I n g r e d i e n t
rule ingredient line
q u a l i f i e r quantity ( space | )
unit additional unit u n i t s u f f i x
pre modifiers base ingredient post modifiers
end
end
The top level rule is defined by ingredient_line, which is, in turn, a composition of a series of other rules like quantity, unit and base_ingredient
capturing the quantity, unit and ingredient name from each line of a typical
recipe ingredient list. For example, unit is composed of following
rule unit
( ( ) number s p a c e r a w u n i t w i t h p l u r a l i t y ( ) )
raw unit with plurality
end
Weve combined the grammar with NLP tools such as Part-Of-Speech (POS)
taggers, noun phrase (NP) chunkers and stemmers/lemmatizers to make the
extraction procedure more robust. Although we could have used a Machine
Learning (ML) based parser, given the limited vocabulary of the recipe domain,
it seemed an overkill.

2.2

Ontology Mapping

Ontology represents knowledge as a hierarchy of concepts within a domain,


using a shared vocabulary to denote the types, properties and interrelationships of those concepts. In an approximate sense we can assume an existing
food database as a Food Ontology (ex. Food can have classifications, relationships etc.). Using United States Department of Agriculture (USDA)s National Nutrition Database Standard Reference as a standard food ontology, the
primary aim for this part of the algorithm was to map ingredient input like
1 teaspoon vanilla to a specific node Vanilla extract
(http://ndb.nal.usda.gov/ndb/search/list?qlookup=02050) in the database
Instead of a traditional approach, weve used open source search engine
ElasticSearchs text analysis capabilities for the ontology mapping.
Entire food database from USDA is indexed using ElasticSearch and each
food item is processed using the following analyzer
f o o d a n a l y z e r s n o w e d => {
c h a r f i l t e r => [ hyphenremover ] ,
tokenizer
=> s t a n d a r d ,
filter
=> [ s t a n d a r d , l o w e r c a s e , s t o p , f o o d s t o p ,

Figure 2: ElasticSearch Text Analysis. Token Filter will be referred as filter


subsequently

food synonym , k e y w o r d r e p e a t , s n o w b a l l , u n i q u e s t e m ] ,
type
=> custom
}
Out of the various filters configured, food_synonym is most important
for the mapping process. This filter uses a partially-autogenerated (Many food
items in USDA database has common names) file filled with frequently occurring
words in the recipe domain and their equivalent or synonymous words in the
indexed data (USDA data). For example (using Solr synonym format)
a u b e r g i n e => e g g p l a n t
s i l v e r b e e t => chard
g h e r k i n => p i c k l e
s u l t a n a s => r a i s i n s g o l d e n
Using the above file (food_synonym.txt) the food_synonym filter is built as
food synonym => {
type => synonym ,
synonyms path =>
R a i l s . r o o t . j o i n ( c o n f i g , food synonym . t x t ) . t o s
}
We use a Fuzzy Like This Query (FLTQ) using the raw ingredient obtained
in the extraction process and a specific configuration of max_query_terms and
fuzziness parameters. This query fuzzifies all terms provided as strings and
then picks the best and differentiating terms. In effect this mixes the behaviour
of FuzzyQuery and MoreLikeThis but with special consideration of fuzzy scoring
factors. Instead of using a single analyzed index field, we also store an additional index field (description.simple) which doesnt process the tokenized
stream using a snowball filter (stemmer). This increases the precision of our
query and overall mapping process.
i n d e x e s : d e s c r i p t i o n , : type => m u l t i f i e l d , : f i e l d s => {
s i m p l e => { : type => s t r i n g ,
: a n a l y z e r => f o o d a n a l y z e r s i m p l e } ,
snowed => { : type => s t r i n g ,
4

: a n a l y z e r => f o o d a n a l y z e r s n o w e d }
}
To optimize the overall mapping process we use ElasticSearchs Multi-Search
API to map all ingredients of a given recipe to their respective food item nodes
in the ontology.

2.3

Lexical Clues

After the parsing and mapping phase, one critical step is to determine the overall
weight of each ingredient. This is perhaps the most complex step in the whole
process and the subsequent nutritional calculation heavily depends upon this.
This is complicated by the ingredient listings like
Pinch of salt to taste
Two 15-ounce cans chickpeas (4 cups), rinsed and drained
pinch is a very common kitchen unit and has to be appropriately handled for
weight calculation. Similarly the second ingredient (chickpeas) has weight-hint
given in its description (4 cups). Identifying these lexical clues and incorporating them in the weight deduction is achieved in this step. This critical step is
often overlooked in classical Information Extraction literature and discussed by
[1].

2.4

Nutritional Information

Recommended Dietary Intake (RDI) is consulted in order to display the accumulated values of various macro and micro-nutrients of the recipe. There are
two complications
RDI values of some nutrients (ex. Cholesterol, Dietary Fiber etc.) depends
on the total calorie intake
RDI values are complex functions of age (life-stage) and special conditions
or diseases (ex. diabetic)
The proposed system gracefully handles all these and generates a more personalized nutritional annotation for the given recipe. For example, the Creme
Brulee Oatmeal recipe has following nutritional profile

References
[1] Fadi Badra, Sylvie Despres, and Rim Djedidi. Ontology and lexicon: the
missing link. In Workshop Proceedings of the 9th International Conference
on Terminology and Artificial Intelligence, pages 1618, 2011.

Figure 3: Recipe Nutrition Label

Вам также может понравиться