Вы находитесь на странице: 1из 5

RESEARCH PROPOSAL

BHAVIK NITIN MER


17304936

Contents
1 Proposed Supervisor 1

2 Research Area 1

3 Research Background 1

4 Research Aims 1

5 Research Question 2

6 Proposed Methodology/Implementation Approach and Litera-


ture Research Strategy 2

7 Ethical Issues 2

8 Sources of literature 3

9 Your own expertise and how well you are positioned to carry
out the work 3

10 Proposed Table of contents for your dissertation 3


1 Proposed Supervisor
Professor. Myra O’Regan

2 Research Area
The domain of my research interest can be broadly classified as Exploring en-
semble method named XGBoost and subject of this study is Data Analytics

3 Research Background
eXtreme Gradient Boosting (XGBoost) is an improvement over Gradient Boost-
ing. It’s widely used machine learning method by data scientist to achieve
state-of-the-art results. Tree Boosting gives excellent results on many standard
classification benchmarks. XGBoost provides a parallel tree boosting (GBM)
that solve many data science problems in a fast and accurate way.
The most important factor behind the success of XGBoost is scalability
which is nothing but several vital systems and algorithmic optimisations; these
include use of novel tree algorithm for handling sparse(missing) data, Paral-
lel and distributed computing making learning process faster enabling quicker
model exploration [2].
XGBoost belongs to a broader collection of tools under the umbrella of the
Distributed Machine Learning Community who are also the creators of the fa-
mous mxnet deep learning library [1]. Tianqi Chen and Carlos Guestrin tried to
address the directions such as out-of-core computation,cache-aware and sparsity-
aware learning have not been explored in existing works on parallel tree boosting.

4 Research Aims
The aims for my research are sequentially defined as follows:
1. To understand idea behind XGBoost Understanding scalability term
coined for XGBoost by TChen , understanding the advantages and short-
comings of XGBoost and mathematics behind this ensemble.
2. To implement XGBoost Implement XGBoost with different frame-
works such as R, Python, Spark and different dataset such as imbalanced
dataset, dataset with more missing values.
3. To explain result Since all ensembles are mostly treated as blackbox,
interpreting the result is more important and for XGBoost we have package
in R named ’xgboostExplainer’ which explains

1
4. To compare Gradient Boosting vs XGBoost vs other ensembles
. Major focus will be comparing gradient boosting with XGBoost, under-
standing the shortcoming of gradient boosting and understanding bene-
fits of using XGBoost. Finally, doing a comparison based on performance
among different popular ensembles.

5 Research Question
Considering the background and the current scenarion, we have four concrete
quetions to answer:
Q1. What is idea behind development of coming with XGboost and understand-
ing mathematics?
Q2. How can we implementing XGBoost model with different datasets using
different frameworks?
Q3. What are the results generated by XGBoost and how we can go about ex-
plaining why XGBoost made particular decision?
Q4. Comparing and constrasting Gradient Boosting against XGBoost or XG-
Boost with different ensembles?

6 Proposed Methodology/Implementation Ap-


proach and Literature Research Strategy
There will be more then one dataset but for each dataset we will attempt to
procure the data first from the listed sources and them analyse them to obtain
insights.
1. Descriptive analysis - Explaining the result .
2. Data Cleansing - After analysis, our next step is to clean them here we
will uderstand how XGBoost deals with missing data.

3. Implementing XGBoost - Implementing XGBoost on cleaned dataset


using different framework.
4. Comparing results - Comparing the results genereated against different
framework or comparing with different ensembles

7 Ethical Issues
During the research of the given topic, the researcher must act in a proper way
which is he must ethical, and no malpractices are being done during the given
research. All the data which are being collected from the respective Trinity
College Dublin are being kept confidential, and the data is mainly used for the
required research purposes, and there will be no leakage of the respective data.

2
8 Sources of literature
The sources of literature can be broadly classified as:
XGBoost
A Scalable Tree Boosting System
Improvement over gradient boosting

9 Your own expertise and how well you are po-


sitioned to carry out the work
This research is based on XGBoost using R and statistical analysis using Excel.
I have excellent analytical and programming skills which, I believe, would help
me in carrying out the research successfully. I am also familiar with R stu-
dio and Microsoft Excel as a statistical analysis tool as well as I have required
fundamental statistical knowledge. In short, my strong interest, my prepara-
tory initiatives combined with my technical knowledge make me well suited to
undertake this research.

10 Proposed Table of contents for your disser-


tation
1. ACKNOWLEDGMENTS
2. ABSTRACT
3. TABLE OF CONTENTS
4. TABLE OF FIGURES
5. INTRODUCTION
(a) RESEARCH BACKGROUND
(b) RESEARCH OBJECTIVE
(c) RESEARCH QUESTION
(d) STRUCTURE OF THESIS
6. STATE OF THE ART
7. METHODOLOGY
8. RESULTS
9. CONCLUSION
10. DISCUSSION

3
(a) STRENGTHS, LIMITATIONS AND RELIABILITY OF THE RE-
SEARCH RESULTS
(b) SCOPE OF FUTURE WORK
11. BIBLIOGRAPHY

12. APPENDICES

References
[1] A gentle introduction to xgboost for applied ma-
chine learning. https://machinelearningmastery.com/
gentle-introduction-xgboost-applied-machine-learning/. Accessed: 2017-
12-31.
[2] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting sys-
tem. In Proceedings of the 22Nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New
York, NY, USA, 2016. ACM.