Turing Data Engineering Challenge

Загружено:

Osama Rasheed

0% нашли этот документ полезным (0 голосов)

15 просмотров4 страницы

Авторское право

Доступные форматы

PDF, TXT или читайте онлайн в Scribd

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Пожаловаться на этот документ

Turing Data Engineering Challenge

Авторское право:

Доступные форматы

Скачайте в формате PDF, TXT или читайте онлайн в Scribd

Отметить как неприемлемый контент

0% нашли этот документ полезным (0 голосов)

15 просмотров4 страницы

Turing Data Engineering Challenge

Загружено:

Osama Rasheed

Turing Data Engineering Challenge

Авторское право:

Доступные форматы

Скачайте в формате PDF, TXT или читайте онлайн в Scribd

Отметить как неприемлемый контент

Перейти к странице

Вы находитесь на странице: 1из 4

Поиск в документе

Turing
1900 Embarcadero Road #104,
Palo Alto, California 94303, U.S.A

Data Engineering
Challenge
Overview
The purpose of this challenge is to test your ability to obtain large amounts of data from web
sources and perform processing in a distributed manner under the given constraints.

In this challenge, you will download 100,000 public Github repositories and perform some
processing on the downloaded code. Usage of Amazon Web Services/Google Cloud Platform
and multiple instances to perform the processing is highly recommended. AWS free tier lets you
use about 750 hours/month of t2.micro instance which should be sufficient for this challenge.
Let us know if you face any problems while setting things up.

Instructions
Obtaining the data
A list of 100,000 repositories is provided h
ere. These repositories are around 2Mb and the
primary language is python. You need to clone all the repositories (only the master branch) and
find a fast and cost efficient way of obtaining the data and perform the processing described
below.

Processing
For each repository, our goal is to compute certa

in statistics only for the Python code present. Here is the list of items that you need to compute
for each repository:

1. Number of lines of code [this excludes comments, whitespaces, blank lines].

2. List of external libraries/packages used.
3. The Nesting factor for the repository: the Nesting factor is the average depth of a
nested for loop throughout the code.

#Loop 1
for i in range(100):
for elem in elements:
...do something..
for k in elem:
..do something..

#Loop 2
for i in range(100):
for elem in elements:
..do something..
for k in range(100):
..do something..

#Loop 1 has nesting depth of 3

#Loop 2 has nesting depth of 2
#The average nesting depth for the code is (3+2)/2 = 2.5

Note: You must report the average nesting factor for the entire repository and not
individual files.

4. Code duplication: What percentage of the code is duplicated per file. If the same 4
consecutive lines of code (disregarding blank lines, comments, etc. other non code
items) appear in multiple places in a file, all the occurrences except the first occurence
are considered to be duplicates.
5. Average number of parameters per function definition in the repository.
6. Average Number of variables defined per line of code in the repository.

Deliverables

1. An output file ‘results.json’ with the results of the computation in the following JSON
format. Each item in the Array list represents the result for a single repository.

[
{ 'repository_url': 'https://github.com/tensorflow/tensorflow',
'number of lines': 59234,
'libraries': ['tensorflow','numpy',..],
'nesting factor': 1.457845,
'code duplication': 23.78955,
'average parameters': 3.456367,
'average variables': 0.03674
}, ......
]

2. The code accompanied with a Readme containing instructions to run the code. Please
mention the dependencies, external packages, etc used in order to execute the code.

Grading Criteria
1. Use of distributed systems: We expect you to use multiple nano/micro instances to
distribute the workload.
2. Efficiency: Since these tasks require you to rummage through a lot of text data, you need
to make sure the algorithms and methods used to calculate the statistics are efficient
(time and memory).
3. Accuracy: How accurate are the statistics? Are all the edge cases covered? We are not
looking for exact answers and will accept anything as long as it is within 5% of the
original answer.
4. Optimisations: Given that we aren’t looking for 100% accuracy, can you trade off some
accuracy for a much faster method?
5. Comments: Well documented and commented code with docstrings, comments and/or
a readme.

Additional Questions

For any additional questions, please use the contact details listed in the email that
accompanied the challenge.

Вам также может понравиться

C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
От Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
Оценок пока нет
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
От Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
Оценок пока нет
CSE 5311: Design and Analysis of Algorithms Programming Project Topics
Документ3 страницы
CSE 5311: Design and Analysis of Algorithms Programming Project Topics
Fgv
Оценок пока нет
From Projects
Документ5 страниц
From Projects
tk
Оценок пока нет
Python For Sec Introduction
Документ19 страниц
Python For Sec Introduction
Dan
Оценок пока нет
1 APT Portfolio Interview Experience - Set 1 (On-Campus) 2
Документ6 страниц
1 APT Portfolio Interview Experience - Set 1 (On-Campus) 2
Ajay Kumar
Оценок пока нет
Unit Iii Programittinc Concepts AND Embedded Programming IN C, C++
Документ16 страниц
Unit Iii Programittinc Concepts AND Embedded Programming IN C, C++
Rama Rajesh
Оценок пока нет
Summary of C++ Data Structures: Part 0: Review
Документ77 страниц
Summary of C++ Data Structures: Part 0: Review
Lakshya Rajput
Оценок пока нет
IT Security - 2 Exercise 3 (Buffer Overflows) : 1 Task - 1: Heap and Stack
Документ5 страниц
IT Security - 2 Exercise 3 (Buffer Overflows) : 1 Task - 1: Heap and Stack
tanmaya1991
Оценок пока нет
ML Lab Experiments
Документ116 страниц
ML Lab Experiments
Saniya
Оценок пока нет
DAA Unit-1
Документ21 страница
DAA Unit-1
Burra Kotamma
Оценок пока нет
Data Structures RPI Spring 2017 Lecture Notes
Документ166 страниц
Data Structures RPI Spring 2017 Lecture Notes
Zack Nawrocki
Оценок пока нет
Week 2 Project - Search Algorithms - CSMM
Документ8 страниц
Week 2 Project - Search Algorithms - CSMM
Raj kumar manepally
Оценок пока нет
Code Review
Документ4 страницы
Code Review
Niks Mumbai
Оценок пока нет
Lab01 Exercise
Документ2 страницы
Lab01 Exercise
jeetendrasidhi
Оценок пока нет
S1, 2019 - V:1 (Monday, 25.02.2019, 16:00) Use For Class Communications
Документ5 страниц
S1, 2019 - V:1 (Monday, 25.02.2019, 16:00) Use For Class Communications
Mister007jh
Оценок пока нет
CS 211: Computer Architecture, Summer 2021 Programming Assignment 3: Cache Simulator (100 Points)
Документ4 страницы
CS 211: Computer Architecture, Summer 2021 Programming Assignment 3: Cache Simulator (100 Points)
Muhammad Hammad Bashir
Оценок пока нет
Practical 1
Документ8 страниц
Practical 1
Chidera Abanulo
Оценок пока нет
Interview Level QA On C
Документ20 страниц
Interview Level QA On C
Ram Bhardwaj
Оценок пока нет
Web Scraping
Документ28 страниц
Web Scraping
sai rohith
Оценок пока нет
Sun Microsystem Green Oak Java: Susceptible To Bugs That Can Crash System .In Particular, C++ Uses Direct
Документ157 страниц
Sun Microsystem Green Oak Java: Susceptible To Bugs That Can Crash System .In Particular, C++ Uses Direct
Supriya Sharma
Оценок пока нет
12computer Science-Python Libraries and Idea of Efficiency-Notes
Документ14 страниц
12computer Science-Python Libraries and Idea of Efficiency-Notes
Esha
Оценок пока нет
Assignment
Документ5 страниц
Assignment
matlakalaanita4
Оценок пока нет
Python Standard Library - Fredrik Lundb PDF
Документ346 страниц
Python Standard Library - Fredrik Lundb PDF
Sinther
100% (1)
IITML - Package Maintenance
Документ11 страниц
IITML - Package Maintenance
Jaime Pizarroso Gonzalo
Оценок пока нет
Chapter 6. Exception Handling and Stream I - O
Документ38 страниц
Chapter 6. Exception Handling and Stream I - O
suresh.subedi.3000
Оценок пока нет
02 - Chapter 2 - Conventional Encryption and Message Confidentiality
Документ16 страниц
02 - Chapter 2 - Conventional Encryption and Message Confidentiality
Shreesha N Rodrigues
Оценок пока нет
Trắc Nghiệm Big data
Документ69 страниц
Trắc Nghiệm Big data
Minh
Оценок пока нет
Abinitio Intvw Questions
Документ20 страниц
Abinitio Intvw Questions
ayon hazra
100% (1)
Lab 10 - Subprograms (Answers) PDF
Документ6 страниц
Lab 10 - Subprograms (Answers) PDF
li
Оценок пока нет
SCS16L
Документ3 страницы
SCS16L
sanjay mehra
Оценок пока нет
Real Time Systems Lab: Handling Shared Resources
Документ14 страниц
Real Time Systems Lab: Handling Shared Resources
J roberts
Оценок пока нет
Project Report Industrial Training: DAV Institute of Engineering & Technology, Jalandhar
Документ24 страницы
Project Report Industrial Training: DAV Institute of Engineering & Technology, Jalandhar
madhav verma
Оценок пока нет
Python Model Paper 1
Документ28 страниц
Python Model Paper 1
SAMANVITA Rd
Оценок пока нет
Unit I
Документ197 страниц
Unit I
Mr.Benjmin Jashva Munigeti
Оценок пока нет
FIT5196-S2-2020 Assessment 1: Task 1: Parsing Text Files (U)
Документ4 страницы
FIT5196-S2-2020 Assessment 1: Task 1: Parsing Text Files (U)
Ambika A
Оценок пока нет
Compiler Design KCS5
Документ10 страниц
Compiler Design KCS5
gaurav2125cse
Оценок пока нет
Information Retreival Assignment
Документ4 страницы
Information Retreival Assignment
Ajith Kanduri
Оценок пока нет
INTR C Qns
Документ41 страница
INTR C Qns
Malli Naidu Karpurapu
Оценок пока нет
Template Engine
Документ4 страницы
Template Engine
wanna_ac
Оценок пока нет
HW 1
Документ3 страницы
HW 1
postscript
Оценок пока нет
Algorithms and Data Structures
Документ11 страниц
Algorithms and Data Structures
Mark Devlin Dungo
Оценок пока нет
Lab 3: Unix and Linux
Документ3 страницы
Lab 3: Unix and Linux
Michael Fiorelli
Оценок пока нет
Lab 4 - Threads Creation BY 22BAI1324
Документ8 страниц
Lab 4 - Threads Creation BY 22BAI1324
johnsneak63
Оценок пока нет
PWP Model Answer Summer 2022
Документ23 страницы
PWP Model Answer Summer 2022
Pratiksha Jadhav
100% (8)
MCS 041 2011
Документ7 страниц
MCS 041 2011
peetkumar
Оценок пока нет
C Programming in Unix
Документ38 страниц
C Programming in Unix
Watsh Rajneesh
80% (5)
Assignment 3: Named Entity Recognition: Training Dataset
Документ4 страницы
Assignment 3: Named Entity Recognition: Training Dataset
ryder the ryder
Оценок пока нет
SL Lab Manual
Документ9 страниц
SL Lab Manual
S.Narayanaswamy Iyer
Оценок пока нет
Computer Science Textbook Solutions - 18
Документ31 страница
Computer Science Textbook Solutions - 18
acc-expert
Оценок пока нет
quickStartGuide Py
Документ30 страниц
quickStartGuide Py
andersson benito herrera
Оценок пока нет
Chapter 1.3 - Programming Tools and Techniques (Cambridge AL 9691)
Документ12 страниц
Chapter 1.3 - Programming Tools and Techniques (Cambridge AL 9691)
lyceum_fans
Оценок пока нет
Assignment 5: Raw Memory: Bits and Bytes
Документ6 страниц
Assignment 5: Raw Memory: Bits and Bytes
nik
Оценок пока нет
Computer Security EITA25: Final Exam in
Документ6 страниц
Computer Security EITA25: Final Exam in
Georges karam
Оценок пока нет
Iot QB
Документ5 страниц
Iot QB
Adil Qureshi
Оценок пока нет
List Vs Tuples
Документ10 страниц
List Vs Tuples
Krishna
Оценок пока нет
Secret-Key Encryption Lab
Документ7 страниц
Secret-Key Encryption Lab
Reveng Eng
Оценок пока нет
Assignment No.: 01: Name: Shraddha Umesh Mulay Roll No.: 221083 GR No.: 22020260 Sy-A
Документ8 страниц
Assignment No.: 01: Name: Shraddha Umesh Mulay Roll No.: 221083 GR No.: 22020260 Sy-A
shraddha mulay
Оценок пока нет
Module - 5 Web Extension: XML and PHP 1 XML-: Example Home - XML File
Документ16 страниц
Module - 5 Web Extension: XML and PHP 1 XML-: Example Home - XML File
Kulumlo
Оценок пока нет
Ee382M - Vlsi I: Spring 2009 (Prof. David Pan) Final Project
Документ13 страниц
Ee382M - Vlsi I: Spring 2009 (Prof. David Pan) Final Project
sepritahara
Оценок пока нет
Tableau BI
Документ7 страниц
Tableau BI
Shiva CH
Оценок пока нет
50 - Define Number Ranges For Each Material Type in SAP MM
Документ10 страниц
50 - Define Number Ranges For Each Material Type in SAP MM
RameshNmkl
Оценок пока нет
ISTQB Latest Sample Paper 1
Документ9 страниц
ISTQB Latest Sample Paper 1
Kiran Kumar
Оценок пока нет
Salesforce Admin Exam Study Guide
Документ9 страниц
Salesforce Admin Exam Study Guide
Aferiat Aicha
100% (1)
Sap BW Data Modeling Guide
Документ18 страниц
Sap BW Data Modeling Guide
Sapbi Sri
Оценок пока нет
TAKSI 11 Audit Implementasi SI
Документ19 страниц
TAKSI 11 Audit Implementasi SI
Bayu Boy Ramadhani
Оценок пока нет
Microsoft Certified Azure Developer Associate Skills Measured
Документ4 страницы
Microsoft Certified Azure Developer Associate Skills Measured
Ash Kapoor
Оценок пока нет
7 Erwin
Документ35 страниц
7 Erwin
Meonline7
Оценок пока нет
01 SWSV Overview PDF
Документ39 страниц
01 SWSV Overview PDF
coralonso
Оценок пока нет
En-Annex+6+to+TCS PRO-00XX IT Release and Deployment Management
Документ14 страниц
En-Annex+6+to+TCS PRO-00XX IT Release and Deployment Management
Roman
Оценок пока нет
Trelix ENS Admin Guide
Документ44 страницы
Trelix ENS Admin Guide
nic more
Оценок пока нет
Dbms Report
Документ38 страниц
Dbms Report
Rajat Sharma
Оценок пока нет
Hazus CDMS User Guidance April - 2019
Документ93 страницы
Hazus CDMS User Guidance April - 2019
juho jung
Оценок пока нет
Vsan 6.7 Theory and Lab
Документ450 страниц
Vsan 6.7 Theory and Lab
lweesdafeed
100% (1)
Service Catalog Template
Документ15 страниц
Service Catalog Template
Hello Word :-)
100% (1)
2.0 Sneak Preview!!!
Документ36 страниц
2.0 Sneak Preview!!!
Emanuel Di Felice
Оценок пока нет
Visit Braindump2go and Download Full Version AZ-204 Exam Dumps
Документ7 страниц
Visit Braindump2go and Download Full Version AZ-204 Exam Dumps
Andre Ayala
Оценок пока нет
Hookup Format 2
Документ5 страниц
Hookup Format 2
Yomzy
100% (8)
Desktop Intelligence Developer Guide: Businessobjects Xi 3.1
Документ74 страницы
Desktop Intelligence Developer Guide: Businessobjects Xi 3.1
Shafath Ali
Оценок пока нет
CIS Controls and Sub Controls Mapping To PCI DSS V1.0.a
Документ126 страниц
CIS Controls and Sub Controls Mapping To PCI DSS V1.0.a
Nazir Ahmed
Оценок пока нет
Oracle SOA Suite 12.2.1.4.0 Installation
Документ64 страницы
Oracle SOA Suite 12.2.1.4.0 Installation
Mian Hmd
Оценок пока нет
Q4 MODULE4 G9 COMPUTER PROGRAMMING BautistaNHS
Документ9 страниц
Q4 MODULE4 G9 COMPUTER PROGRAMMING BautistaNHS
Sam Ilasin
Оценок пока нет
(An Introduction To Computer) : Security
Документ51 страница
(An Introduction To Computer) : Security
gunturean
Оценок пока нет
Forensic Analysis of Microsoft Windows Recycle Bin Records PDF
Документ12 страниц
Forensic Analysis of Microsoft Windows Recycle Bin Records PDF
albertu
Оценок пока нет
Cloud Computing: Cloud Computing Is A General Term Used To Describe A New Class of
Документ7 страниц
Cloud Computing: Cloud Computing Is A General Term Used To Describe A New Class of
sori1
Оценок пока нет
Hon660 - Android Upgrade User Guide: Honeywell
Документ31 страница
Hon660 - Android Upgrade User Guide: Honeywell
hughJ
Оценок пока нет
Lab Experiment 7 Concept of Joins, Grouping of Data and Subqueries
Документ2 страницы
Lab Experiment 7 Concept of Joins, Grouping of Data and Subqueries
Sumit Patil
Оценок пока нет
Programming Constructs Paradigms PDF
Документ26 страниц
Programming Constructs Paradigms PDF
ADRIAN ANDREW ALABAT
Оценок пока нет
SPLK 1003
Документ5 страниц
SPLK 1003
PRAGATI
0% (1)
Cloudera Quickstart
Документ32 страницы
Cloudera Quickstart
Gabriel Cane
Оценок пока нет