You are on page 1of 72

1 December 2018

NLP for Indian Languages

wingifydevfest@nirantk.com
Why should you care?

Language is emotion
Why should you care?

Language is emotion expressed


Why should you care?
Very few people care about making software and tech for us!

Indians who speak in mixed languages e.g. Hinglish or native


languages.
Other equally good Titles for this talk

1. Transfer Learning for Text


2. Making Deep Learning work for Small
Text Datasets
Who am I
Nirant Kasliwal (@NirantK)
● Claim to 5 minutes of
Internet fame ->

● Research Engineer /
NLP Hacker - Maker of
hindi2vec

● Work for Soroco


(Bengaluru)
Outline
● Text Classification
○ How much tagged data do we really need?
○ How can we use untagged data?
● Transfer Learning for Text
○ Language Models
○ Language Models for Hindi
○ Language Models for 100+ languages
What I expect you
know already
What I expect you know already

Python
What I expect you know already

Some exposure to

modern (deep)
machine learning
What I expect you know already

Great to know: modern


(neural) NLP*

Ideas like:

● Seq2seq
● Text Vectors: GloVe,
word2vec
● Transformer
What you'll learn
today
What you'll learn today
NEW Idea: Transfer Learning for Text
What you'll learn today
how to do NLP with small datasets
What you'll learn today
There are too many NLP challenges in any language!

Automatic speech recognition Relationship extraction Machine translation


CCG supertagging Semantic textual similarity Multi-task learning
Chunking Semantic parsing Relation prediction
Common sense Semantic role labeling Natural language inference
Constituency parsing Sentiment analysis Part-of-speech tagging
Coreference resolution Stance detection Question answering
Dependency parsing Summarization
Dialogue Taxonomy learning
Domain adaptation Temporal processing
Entity linking Text classification
Grammatical error correction Word sense disambiguation
Information extraction Named entity recognition
Language modeling
Lexical normalization
What you'll learn today
Selecting topics which deal more with text semantics (meaning) than
grammar (syntax)
Automatic speech recognition Relationship extraction Machine translation
CCG supertagging Semantic textual similarity Multi-task learning
Chunking Semantic parsing Relation prediction
Common sense Semantic role labeling Natural language inference
Constituency parsing Sentiment analysis Part-of-speech tagging
Coreference resolution Stance detection Question answering
Dependency parsing Summarization
Dialogue Taxonomy learning
Domain adaptation Temporal processing
Entity linking Text classification
Grammatical error correction Word sense disambiguation
Information extraction Named entity recognition
Language modeling
Lexical normalization
What you'll learn today
And for today’s discussion:

Automatic speech recognition Relationship extraction Machine translation


CCG supertagging Semantic textual similarity Multi-task learning
Chunking Semantic parsing Relation prediction
Common sense Semantic role labeling Natural language inference
Constituency parsing Sentiment analysis Part-of-speech tagging
Coreference resolution Stance detection Question answering
Dependency parsing Summarization
Dialogue Taxonomy learning
Temporal processing
Domain Adaptation
Entity linking Text
Grammatical error correction
Information extraction

Language modeling
Classification
Word sense disambiguation
Lexical normalization Named entity recognition
What you’ll learn today
EXAMPLE
What you’ll NOT learn
today
What you’ll NOT learn today
No Math.
What you’ll NOT learn today
No peeking under the hood. No code. We will do that later!
Text Classification needs a lot of
data!
But exactly how much data is enough?
Let's get some estimates from English datasets?

Dataset Type No. of Classes No. of Examples in


Training Split

IMDb Sentiment - Movie Reviews 2 25k

Yelp-bi Sentiment - Restaurant 2 560K


Reviews

Yelp-full Sentiment - Restaurant 5 650K


Reviews

DBPedia Topic 14 560K


But exactly how much data is enough?
And what is the lowest error rate we get on these?

Dataset No. of Classes No. of Examples in Test Error Rates


Training Split

IMDb 2 25k 5.9

Yelp-bi 2 560K 2.64

Yelp-full 5 650K 30.58

DBPedia 14 560K 0.88


Text Classification
needs a lot of data!
How? Transfer Learning!

Image from https://machinelearningmastery.com/transfer-learning-for-deep-learning/


Data++
Dataset No. of Classes Use Untagged Samples Data Efficiency

IMDb 2 No 10x

IMDb 2 Yes, 50k Untagged 50x = 100 samples needed

Comparing to identical accuracy when training from scratch


Data--;

On IMDb On TREC-6
SAME TASK MULTI-TASK
TRANSFER - TRANSFER -
Different Data Different Data,
Different Task
SAME TASK MULTI-TASK
TRANSFER - TRANSFER -
Different Data Different Data,
Different Task
How does this change
things for you?
Simpler code & ideas
Simpler code
NOW: DOWNLOAD AND ADAPT to
BEFORE: DEVELOP and REUSE
your Task
1. Select Source Task & Model e.g.
1. Select Source Model e.g. ULMFit
Classification
or BERT
2. Reuse Model e.g. for classifying
cars types or screenshot 2. Reuse Model e.g. for text
segmentation classification or any other text task
3. Tune Model to Your Dataset 3. Tune Model
a. Downside: Needs tagged a. Can use both untagged and
samples, does not learn from tagged samples
untagged samples
b. Upside: Can give me an initial
Can use the same source model
performance boost
4. Repeat for every New Challenge across multiple tasks, and languages
which you see. BORING!
TEXT BACKBONE TASK SPECIFIC
EMBEDDING LAYER
DATA FLOW DIRECTION
Simpler code
BEFORE: DEVELOP and REUSE NOW: DOWNLOAD AND ADAPT to
1. Select Source Task & Model e.g. your Task
Classification 1. Select Source Model e.g. ULMFit
2. Reuse Model e.g. for classifying or BERT
cars types or screenshot 2. Reuse Model e.g. for text
segmentation classification or any other text task
3. Tune Model to Your Dataset 3. Tune Model
a. Downside: Needs tagged
a. Can use both untagged and
samples, does not learn from
untagged samples tagged samples
b. Upside: Can give me
4. Repeat for every New Challenge Can use the same source model
which you see. BORING! across multiple tasks, and languages
GLoVe Language Classifier
Models
DATA FLOW DIRECTION
Simpler Code
We will download pre-trained language models instead of word
vectors
Making the Backbone
or Source Model
Making the Backbone
Pre-training for Language Models
The BERT model was trained in two tasks simultaneously: Masked Words
(Masked LM) and Next Sentence Prediction.
Making the Backbone

Task 1: Masked Language Models


Predict masked word anywhere. 5% of the words that were fed in
as input were masked. But not all tokens were masked in the
same way.
Making the Language Models

Task 1: Masked Language Models


Existing Ideas in word2vec and Glove -
Making the backbone

Task 1: Masked Language Models


Example: ‘My dog is hairy’

● 80% were replaced by the ‘<MASK>’ token


○ Example: “My dog is <MASK>”
● 10% were replaced by a random token
○ Example: “My dog is apple”
● 10% were left intact
○ Example: “My dog is hairy”
Making the backbone

Task 2: Next Sentence Prediction


Input = {

sentence1 : the man went to [MASK] store


sentence2: he bought a gallon [MASK] milk [SEP]

Label = isNext
Making the backbone

Task 2: Next Sentence Prediction


Input = {

sentence1 : the man [MASK] to the store

sentence2: penguin [MASK] are flight ##less birds

Label = NotNext
Pause!
Any questions at this point?
Indian Languages
e.g. Hindi, Telugu, Tamil
First Challenge: Making a good backbone
Indian Languages
e.g. Hindi, Telugu, Tamil
Text Backbone Task Specific
Embedding Layer
DATA FLOW DIRECTION
Hindi2vec: Based on ULMFit
- Designed to work well on tiny datasets and small
compute e.g. I work off free K80 GPUs via Colab

- State of the Art Classification Results on several


languages: Polish, German, Chinese, Thai
Hindi2vec: Download a ready to use Backbone
Disclaimer: I made this using FastAI v0.7, and it is a little outdated!

https://github.com/NirantK/hindi2vec
Alternative: Use Google AI’s BERT

Indian Languages
e.g. Hindi, Tamil
Text Embedding BERT Language Specific
Layer e.g. हंद
DATA FLOW DIRECTION
BERT: Based on OpenAI’s General Purpose
Transformer
- Designed to work well on larger datasets and
large compute e.g. they need few GPU-days to fine
tune for a specific language
- State of the Art Results on 11 NLP Tasks
BERT: Based on OpenAI’s General Purpose
Transformer
BERT-Multilingual : Works for 104 languages!
RELATED MYTH:
Not enough Indian
Language Resources!
Datasets
Ready to Use
Sidenote: You can Make
- Wikimedia Dumps with 100+ languages
your Own!
- IIT Bombay English Hindi Corpus includes
the following: - Online Newspapers
and Regional TV
Forums
- WhatsApp groups!

Just 2 things above are about 100M+


words/tokens with at least 100k unique works
Indic NLP Library
- Link: http://anoopkunchukuttan.github.io/indic_nlp_library/
- GPL! Do not use at work
- Languages Supported:
RELATED MYTH:
Non English is hard
in Python!
Related Myth: Non English is Hard
Works out of the box in Python3.5+!

Python is natively Unicode now. Not ASCII.


More!
This looks promising. What else can I do with this?

- Pretty crazy stuff:


E.g. Ask questions and learn inference!

Screenshot from SQuAD Explorer 1.1


Where does this fail?
Where does this fail?
1. Small Sentences e.g. chat, Tweets
2. Long tail inference e.g. stories
○ E.g. Who was on Arjuna’s chariot in
Mahabharata? Cannot infer Hanuman
3. Hinglish - but, bbbut - aap finetune kar sakte ho!
Takeaway
Takeaway
Transfer Learning for text is here

- It helps us work with really small compute and data

Key Idea: Language Models are great backbones

- BERT and ULMFit are reusable proven, LMs


What can I do from
this talk?
What can I try from this talk?
PyTorch: Tensorflow

- Download the Google - Download the


BERT or ULMFit Models GoogleAI BERT Models

Train your own good-morning message or not


classifier from WhatsApp chats!
Thanks for Coming!
Questions?

@NirantK
Created by @rasagy,
Typo: 1st Dec 2018 not 2019
Credits and Citations
- Slides and gifs from Writing Good Code for NLP Research by Joel
Grus at AllenAI
- ULMFit Paper and Blog by Jeremy Howard (fast.ai) and Sebastian
Ruder (@seb_ruder)
- Recommended Reading: Illustrated BERT
- BERT Dissections: Paper, Blogs:The Encoder, The Specific
Mechanics, The Decoder
- Visualisations Made from Neural Nets Visualisation Cheatsheet
Appendix
Appendix: 1 Slide Summary of ULMFit Paper
Howard and Ruder suggest using pre-trained models for solving a wide range of NLP problems. With this
approach, you don’t need to train your model from scratch, but only fine-tune the original model. Their
method, called Universal Language Model Fine-Tuning (ULMFiT) outperforms state-of-the-art results,
reducing the error by 18-24%. Even more, with only 100 labeled examples, ULMFiT matches the
performance of models trained from scratch on 10K labeled examples.

However, to be successful, this fine-tuning should take into account several important considerations:
● Different layers should be fine-tuned to different extents as they capture different kinds of
information.
● Adapting model’s parameters to task-specific features will be more efficient if the learning rate is
firstly linearly increased and then linearly decayed.
● Fine-tuning all layers at once is likely to result in catastrophic forgetting; thus, it would be better to
gradually unfreeze the model starting from the last layer.

From TopBots: The Most Important AI Papers of 2018


Appendix: 1 Slide Summary of BERT

Training Tasks: Masked Language Model tried on 5% at Random, Next Sentence Prediction

Results: SoTA on 11 NLP Tasks, mostly around Inference and QA. Indicated that model can be fine tuned on
new datasets and tasks both

Model: BERT-Base is inspired from OpenAI Transformer, roughly the same parameter size. BERT-Large is
340M parameters, based on Transformer Networks.

Want to understand Transformer Network architecture? Here is an Illustrated Intro to Transformers