Вы находитесь на странице: 1из 44

Big Data Analytics

Associate Prof. Amaury Lendasse

IE:4172:0001
Grades

2
Exams
• Midterm and Final
• October 13th and December ?th
• 3 questions:
1. An easy one
2. A normal one
3. A difficult one

3
Semester Project
• Per group of 4 or 5 students
• Groups have been created!!!!
• REAL Big Data problems! Different for each group
• You will improve your knowledge of one of the following
“programming languages”: Hadoop, MapReduce, GPU, clusters
using C, C++, Python or Matlab
• My TA will support you!
• Report + Code + Presentation
• You WILL do a GREAT job so I will invite some people from the
Informatics Initiative to see your presentations :)

4
5
Presentations
• Dates: December 8th and December 10th
• Group 1, 2 and 3 will present on December 8th
• 20 minutes per group
• Group 4 and 5 will present on December 10th
• 25 minutes per group
• Everybody should participate to the presentations!
• Everybody should be able to answer questions
• Everybody should ask questions

6
Ressources? Neon & Helium

7
Resources? Neon & Helium
• College of Engineering
1. Helium (6) 8-core, 24-GB
2. Neon (10) 16-core, 64-GB
• My own node:

8
Xeon Phi

9
10
Xeon Phi

11
Resources? Bonus!

• A small Hadoop cluster (1 name node/ 4


data nodes)
• Pilot cluster from Informatics Initiative and
EPSCo

12
Storage

• /Shared/bdagroup1-5
• Ready by Friday
• I will put the data in the share
• I can get the result too

13
Group 1:
Santanu Bhowmick, Mark Betman, Elliott Soemadi, Daniel Weinbeck

14
Group 1: Detection of Malicious Files 

(using Hadoop)

15
Group 1: Detection of Malicious Files 

(using Hadoop)
•  Zero False Positives
•  High coverage
•  Millions of Samples Prediiction
•  Small computational time
Clean Malware

True False
Clean
Negative Positive
Actual
False True
Malware
Negative Positive

16
Group 1: Detection of Malicious Files 

(using Hadoop)
1 10 AGKGYFUYGKJNLJK
1 13 TGCGKJNLHJKGJHVJKBLHU;JH
1 25 JHJHVKJNKLHIKJB
1 26 A
1 30 HMVJHKNMKLN
1 30 IJNOEQVBVQOUONKJQIOINV
1 30 LKNCIWBVIWQN
1 30 IJBIQBVONVIOQRPVUQI
1 30 JVDQNVDKJVQHJKHVKJJQEKHHJ
1 30 LKMLUVELEKBVBEQYYUVQUIVKBQUYI
1 30 KNNO
1 30 REEXXTCUVLBHBIHLGTYFTYUVGVUUYCVLHVKTVKJ
1 30 KJLBJHKHJ
1 30 KLNJKSNJVKVBVJKJSDB

7
17
Group 1: Detection of Malicious Files 

(using Hadoop)

• Task: Build a classifier using a training set


• Predict the classes for the samples of a
Test Set (unknown class for you)
• Evaluation: based on the percentage of
False positive and False Negative

(so you have to submit your predictions to me)

18
Group 1: Detection of Malicious Files 

(using Hadoop)

• Reference (not a very good one): 




Methodology for Behavioral-based Malware Analysis and
Detection using Random Projections and K-Nearest
Neighbors Classifiers, J. Hegedüs, Y. Miche, A. Ilin and A.
Lendasse. In 7th International Conference on Computational
Intelligence and Security (CIS2011). December, 2011. 

http://research.ics.aalto.fi/eiml/Publications/
Publication191.pdf

19
Group 2:
Timur Dogan, Steven Hanson, Kelcey Fabianski, Claire Van Ingen

20
Group 2:
Rain Prediction in Iowa

21
Group 2:
Rain Prediction in Iowa

22
Group 2:
Rain Prediction in Iowa

23
Group 3:
Trace Yuhas, Tom Werner, Reddy Pratap Gandrajula, Alex Junk

24
Group 5:
Zhiya Zuo, Meeshanthini Dogan, Joel Tosado Jimenez, Conner Kester,
Andrey Gritsenko

25
Groups 3 and 5

Detection of Skin

26
Groups 3 and 5

Detection of Skin

27
Groups 3 and 5

Detection of Skin

• Each pixel has to be classified


• The class depends on its color (RGB)
• But also depends on the color of the
surrounding pixels

28
Groups 3 and 5

Detection of Skin
Skin or not?

29
Groups 3 and 5

Detection of Skin
NOT!

30
Groups 3 and 5

Detection of Skin
Yes!

31
Groups 3 and 5

Detection of Skin

• 4000 images
• Use blocks of 7x7 pixels (or more?)
• So 147 colors, 49x3 (RGB)
• 2000 images: mask is known (training set)
• 2000 images: mask is unknown (test set)

32
Groups 3 and 5

Detection of Skin

• 4000 images
• Use blocks of 7x7 pixels (or more?)
• So 147 colors, 49x3 (RGB)
• 2000 images: mask is known (training set)

536 million pixels

33
Groups 3 and 5

Detection of Skin

• 4000 images
• Use blocks of 7x7 pixels (or more?)
• So 147 colors, 49x3 (RGB)
• 2000 images: mask is known (training set)

536 millions samples with 147 variables

About the same for the test set

34
Groups 3 and 5

Detection of Skin

• Each Group
• Evaluation: based on the percentage of
False positive and False Negative

(so you have to submit your predictions to me)

35
Groups 3 and 5

Detection of Skin
• Reference: 


S. L. Phung, A. Bouzerdoum, and D. Chai,
"Skin segmentation using color pixel
classification: Analysis and comparison," IEEE
Transactions on Pattern Analysis and Machine
Intelligence, January 2005, vol. 27, no. 1, pp.
148-154.

36
Groups 3

Detection of Skin with ELM

• Use ELM and predict the mask (test set)


• Use blocks of 7x7 pixels (or more?)
• So 147 colors, 49x3 (RGB)
• 2000 images: mask is known (training set)

536 millions samples with 147 variables

About the same for the test set

37
Groups 3

Detection of Skin with ELM
• Reference: 


OP-ELM: Optimally-Pruned Extreme Learning
Machine, Y. Miche, A. Sorjamaa, P. Bas, O. Simula,
C.Jutten and A. Lendasse. In IEEE Transactions on
Neural Networks, volume 21, pages 158-162.
January, 2010.

38
Groups 5

Detection of Skin with KNN

• Use KNN and Hadoop (and MapReduce or Spark)


to predict the mask (test set)
• Use blocks of 7x7 pixels (or more?)
• So 147 colors, 49x3 (RGB)
• 2000 images: mask is known (training set)

536 millions samples with 147 variables

About the same for the test set

39
Groups 5

Detection of Skin with KNN
• Reference: 


Speedy Greedy Feature Selection: Better Redshift
Estimation via Massive Parallelism, F. Gieseke, K.
Lars Polsterer, C. Oancea, and C. Igel, ESANN
2014

https://www.elen.ucl.ac.be/Proceedings/esann/
esannpdf/es2014-171.pdf

40
Group 4:
Yusen He, Yuanyang Liu, Michael Kirchner, Mary Wilson, Abhay Shah

41
Groups 4

SOM on GPU

• Use Self-organizing maps to visualize and


classify dataset
• Million song Data set: not the subset! The
full dataset!
• Use the TESLA cardS on NEON

42
Groups 4

SOM on GPU

43
Groups 4

SOM on GPU
• Reference: 


Sparse linear combination of SOMs for data imputation:
Application to financial database, A. Sorjamaa, F. Corona, Y.
Miche, P. Merlin, B. Maillet, E. Séverin and A. Lendasse. In
Lecture Notes in Computer Science: Advances in Self-
Organizing Maps - Proceedings of WSOM 2009, Saint
Augustine (Florida), volume 5629/2009, pages 290-297.
June 8-10, 2009. http://research.ics.aalto.fi/eiml/
Publications/Publication144.pdf

44

Вам также может понравиться