Вы находитесь на странице: 1из 13

Using big data to map and analyze collaboration

networks in an organization

FIT5145: Introduction to Data Science

Assignment: Data Science Project Proposal / Case Study

Student Name: Amrik Singh

Student ID: 28086252

Page 1 of 13
Using big data to map and analyze collaboration networks in an organization

Project Description

Background and Importance of the topic:

Performance of an organization depends upon how people within the organization collaborate
to achieve organization goals. Apart from the formal organization structure that defines roles
and responsibilities (Distelzweig), organizations have informal networks that determine how
work is actually done in practice (Informal Structure, 2016). The informal network that is
formed during collaboration is called collaboration network (see Figure: 1) and consist of
interactions, information flows, knowledge sharing, and informal relationships among people
in an organization (Moore, 2011). As the organizations have become geographically
distributed, a shared technological platform such as email server is used by people, within a
collaboration network, to communicate with each other (Collaborative network).

Figure 1

In today’s fast changing business environment and knowledge based economy, human capital
has become the main determinant of sustainable competitive advantage and continuous
innovation (Brown, 2016). Thus, every organization needs to better understand how its
people collaborate to achieve common organizational goals. This understanding can help
organizations in improving their efficiency and effectiveness by creating a better environment
for collaboration, improving the flow of information sharing among employees, and
optimizing the use of shared resources.

Page 2 of 13
Using big data to map and analyze collaboration networks in an organization

The project proposal and data science involved:

The method of Organizational Network Analysis provides a way to map and analyze
collaboration patterns in an organization (Cross). However, the traditional survey approach to
organizational network analysis is time consuming, cost intensive, and have many other
limitations that makes it unsuitable for an organization operating in a dynamic business
environment (Hass, 2016). Employing big data analytics can not only address these issues but
also provide better actionable insights in a real time. The advancements in technology have
made it possible to track human interactions and capture data that can be easily accessed and
analyzed (The Big Data Opportunity for HR and Finance, 2013). By leveraging big data to
generate and analyze collaboration network maps, an organization can not only discover
hidden patterns but also improve team work among its employees.

Improved collaboration results in improved employee engagement and helps employees in

attaining new skills. These effects lead to increased efficiency of the organization and thus
improve organizational performance (Boyer). This report proposes a data science project to
use big data to map and analyze collaboration networks in an organization. The idea is to use
the data available on the email server and business information system of an organization to
track the flow of information and create network maps for each employee. The first step is to
use “hashing” to automatically scan emails and then utilize the information extracted from
hashes or hash codes for further processing (Epner, 2014). The role of data scientist starts
from creating hashes that uniquely identify the contents of emails, and then writing hash code
based algorithms and use supervised machine learning tools for automatic scanning of emails.
The data generated form such email analytics is then categorized according to the type and
importance of the content or the hash value. Using the process of social network analysis, an
organization wide collaboration network can be formed on the basis of hashes of emails
exchanged among employees, where each email account serves a node. Similarly, email
account of each employee can also be analyzed to form a collaboration network for every
employee. Further, both organizational and individual networks can be enhanced by
integrating the information about the formal organizational structure, role and responsibilities
of individual employees, key result areas, and key performance indicators.

The collaboration networks formed are then analyzed on the basis of network size, strength,
range, density, and centrality as explained in business model section. Visualizing
collaboration networks can tell how information flows, how employees communicate, and
how decisions are made in the organization. These insights can help in identifying key
employees or roles, what are the bottlenecks in the system, and what steps the management
can take to improve the existing networks (Robert L. Cross, 2007). Thus, an organization can
benefit from these insights in many ways and can improve its overall performance.

Page 3 of 13
Using big data to map and analyze collaboration networks in an organization

Business Model

This entire project proposal can be classified under the emerging field of people analytics,
which aims to understand and predict business performance, and drive actionable insights by
integrating data about the workforce and business data (Bersin, 2016). A recent research on
global human capital trends has found that 92% of companies believe that organization
structure is important for success and they are not optimally organized for success (Jeff
Schwartz, 2016). Therefore, it becomes imperative to develop tools to better understand how
the people are organized within an organization. Thus, the focus of this project proposal is to
use data driven approach to visualize communication and collaboration patterns within an
organization and help organization to restructure in way that provides better environment for
collaboration (Bersin, 2016).

The primary data that can be used to for the said purpose is emails because every
organization maintains a database of email accounts of all employees either on their own
email systems or on third party email server such a Gmail. This database is maintained in
very structured format and can be accessed by organization. However, how an organization
access such data depends upon both statutory requirements on workplace privacy and internal
policy of organization regarding the use of official emails (Workplace privacy). Other forms
of data that can be used include data from instant messaging service of an organization, its
web portal, phone calls records, conferences and meetings, organization’s social network, etc.

The tools and techniques of Organizational Network Analysis (ONA) are used to describe
and evaluate collaboration network maps (Cross). Every employee is treated as a node in the
network (figure: 2 provides an example of collaboration network).

Figure 2: Example of Collaboration Network

Page 4 of 13
Using big data to map and analyze collaboration networks in an organization

Building Blocks of a Collaboration Network:

A collaboration pattern can be described in terms of following building block (Hass, 2016):

1. Network Size: Network size is a measure of the number of nodes in a network. In

case of an individual node, the network size is the number of different connections a
particular node has.

Figure 3

2. Network Strength: Network strength is a measure of strength of relationships. An

employee who has strong relationships with his colleagues will have high network
strength. Stronger ties are represented by thicker lines.

Figure 4

Page 5 of 13
Using big data to map and analyze collaboration networks in an organization

3. Network Range: Network range is measure of diversity of nodes in a network. For

example: a network of an employee who is connected with employees from different

Figure 5

4. Network Density: Network density tells us about the density of interconnections

between the nodes in a network. Nodes that shared more interconnection among them
form a dense network, which represents a closely related group.

Figure 6

Page 6 of 13
Using big data to map and analyze collaboration networks in an organization

5. Network Centrality: Network centrality is specific to a node and tells us how this
node is located in the given network. A node which is located in the center of a
network has high centrality.

Figure 7

Data Characteristics and Data Processing

Data required for this project can be divided into two groups:

1. Primary data: The primary data includes the database of emails that would be
analyzed to construct collaboration network. Most of the organizations maintain very
structure record of all emails and they also retain it for several years. Further, the real
time data of email exchanged can also be used to visualize real time patterns of
collaboration. Daily visualization reports can provide valuable insights about the
different patterns in collaboration and identify the patterns that works well or achieve
better business results.

2. Secondary data: Data about the formal structure of the organization, about roles
and responsibilities of employees, about relationship among employees, and about
key result areas and key performance indicators. This data primarily comes from the
human resource information system and is required to interpret the collaboration
networks for meaningful insights. Relationship between primary and secondary data
will be explained later in the data analysis section.

Page 7 of 13
Using big data to map and analyze collaboration networks in an organization

Table 1

Data Characteristics:

Volume Volume of the data depends upon the size of a company and its business
activity. A large organization having 5000 employees may accumulate few
terabytes of data in a year. Further, as most organizations retain data for
several years, the data volume can be in hundreds of terabytes. Volume also
depends upon the contents of emails, which can be simple text, rich text or
html having different types of attachments such as .doc file, .xls file, .pdf or
.mp4 file.
Velocity Velocity of data generated also depends upon the size and business activity of
an organization. For instance, an ecommerce company having 100 employees
with 40 emails per employee per day (sent/received) will have 4000 emails to
be analyzed in one day. However, number of emails exchanged per day can
be more 10000 for a larger company.
Variety There are different types of emails such as official email, personal email,
email requesting meeting or follow up, seeking information, sharing
information, junk or spam email. Thus, emails are required to be filtered,
clustered, and classified based on their relative importance and purpose.
Veracity Here the veracity is a measure of correctness of email content and of
receipt/delivery of a specific email to its intended recipients or target
audience. Emails sent accidently or with incorrect details or to unintended
recipient should not be taken into account for creating a network map.

Data Processing:

Pre-processing will require the use of hashing algorithms, text mining and data wrangling to
make data suitable for further processing for visualization or performing predictive analytics.
During wrangling, the data will be converted to structured form and checked for anomalies
and clusters, and classified based on the importance of the contents of emails. The algorithm
will be designed to check for various type of email and give a score to each email depending
upon the various factors such as who is the recipient of email and why an email is sent, is the
sender of email asking for information or providing information. These criteria for analysis
are explained in more details in the data analysis section.

After the data is extracted from emails, it will be integrated with the data about the formal
structure of the organization in order to draw an organizational collaboration network
diagram. The data about the structure and formal relationships is retrieved from the human
resource information system of the organization. After that the technique of social network
analysis will be used to draws social network graphs to study the relationship among
employees, how these relationships are formed, and what are the outcomes of such
relationships or collaborations (Social network analysis). Each employee is treated a node in
the network having several links. Based on the data analysis criteria presented in the data
analysis section, supervised machine learning will be used to map the network each

Page 8 of 13
Using big data to map and analyze collaboration networks in an organization

individual and the organization as a whole. Supervised machine learning is appropriate in this
case because we already have data about the formal structure and social network analysis will
provide us organization wide network. Figure-8 below presents an example of formal
structure of an organization and collaboration network. The formal structure is a rigid
structure whereas collaboration network is flexible and changes its structure according to the
work at hand (Cross).

Figure 8 – Example of Formal Structure and Informal Structure (Collaboration Network)

Data Resources and Tools

As mentioned earlier, the primary source of data will be the emails because every
organization has a database of all emails accounts. The secondary but equally important
source of data is organization’s human resource system or business information system such
as SAP-HCM or SAP-ERP, which provides data about the formal structure of organization,
roles and responsibilities of each employee. Other sources of data can data from instant
messaging service of an organization, its web portal, phone calls records, conferences and
meetings, organization’s social network, etc.

Python is suitable tool for writing supervised machine learning algorithm to extract data by
scanning emails. Python can further be used to integrated data from emails and formal
structure. For visualization of networks, the most widely used open data tools are (Social
network analysis software)-

Page 9 of 13
Using big data to map and analyze collaboration networks in an organization

 NetMiner with Python scripting engine

 statnet suite of packages for the R statistical programming language
 igraph, which has packages for R and Python
 muxViz (based on R statistical programming language and GNU Octave) for the
analysis and the visualization of multilayer networks
 the NetworkX library for Python, and the SNAP package for large-scale network
analysis in C++ and Python
 UNISoN (Social Network Analysis Tool)

Data Analysis

As discussed in the business model section, the collaboration network can be described using
following parameters:

1. Network Size (say N1)

2. Network Strength (say N2)
3. Network Range (say N3)
4. Network Density (say N4)
5. Network Centrality (say N5)

But these parameters alone doesn’t give complete information about a given collaboration
network. For instance, consider two employees who has similar network in terms of all five
parameters. One employee is from payroll and interacts with different employees from all
departments and from all levels of hierarchy i.e. from a junior level executive to chief
executive officer. Other employee is from production department and responsible for
delivering the high quality product and also communicates with different employees from all
departments and from all levels of hierarchy. So, we can see that although these two
employees have similar networks, the employee from production department is more valued
resource or a star employee for the organization. Therefore, relative weightage are to be
assigned to each parameter and the Network Score for an employee can be defined as
function of the following:

Total Network Score = f (N1.W1) + f (N2.W2) + f (N2.W2) + f (N2.W2) + f (N2.W2)

Where, W1, W2, W3, W4, and W5 are weights given to N1, N2, N3, N4, and N5 respectively
and can have value from 0 to 1. The value of relative weight depends upon the job profile,
role and responsivity of an employee. For instance, in case of two employees, one from
marketing and one from research and development (R&D), the relative weights for N1, N2,
N3, and N5 can be same. But, for employee from R&D relative weight for network density
(N4) is expected to be higher as he/she works closely with small group of employee. Thus,
he/she has a denser network and must have value for W4.

Page 10 of 13
Using big data to map and analyze collaboration networks in an organization

Analytic Levels

The above equation for “Total Network Score” forms the basic criteria for analyzing the
collaboration network for an individual employee and help organization to perform different
levels of high level analytics. A visualization tools (high level such as SAS visualization) can
be created for the management to perform following actions and make data driven decisions:

a) Descriptive analytics: Please see figure 8 & 9, visualizing the collaboration network
can help organization to describe a given situation or see the root cause of a problem
in a project. Eg. To identify an employee, who has formed bottleneck situation for the
flow of information.

b) Predictive analytics: Model will learn from the historic data and predict which node is
getting disconnect from the network and at the risk of exit.

c) Prescriptive analytics: Learning from the data from previous steps taken to remove a
bottleneck in the network, the supervised machine learning algorithm can prescribe a
suitable of action plan the manager.

Figure 9 – Levels of Analytics

Benefits to Stakeholders

There are two immediate stakeholders (Employer and Employees) that will directly benefit
from improved collaboration and communication. But, other stakeholders such as customers
and suppliers will also reap the benefits as a result of improved performance of the
organization. Using information from collaboration analytics, employer can make data-driven
decision about people related issues such as promotion to reward and retain star employee

Page 11 of 13
Using big data to map and analyze collaboration networks in an organization

who holds the network together, facilitates good communication, and achieve high results.
Employer can achieve high level of performance by creating a strong environment for
collaboration by restructuring the organizational structure. Employer can also identify and
remove employees who create bottlenecks or hold on to critical information required for a
given task and slow down the process. On the other hand, employees can receive feedback on
their team work and role in the network. Thus, employees can improve their relationship with
other employees and leverage their network for career growth.


Analyzing data from emails involves issues of privacy and confidentiality (Workplace
privacy). If data integrity and privacy is not maintained, then the data can be used for un-
desired purposes, which can have negative consequences. Other challenge involves
operational issues such as finding writing right algorithm that provides statistically significant
results and further the data model needs to updated and trained continuously because the
different people tends to have different collaboration networks but still producing the same

Works Cited
Bersin, J. (2016, July 1). People Analytics Market Growth: Ten Things You Need to Know. Retrieved
from http://joshbersin.com: http://joshbersin.com/2016/07/people-analytics-market-

Boyer, S. (n.d.). The Importance of Collaboration in the Workplace. Retrieved from

www.nutcache.com: https://www.nutcache.com/blog/the-importance-of-collaboration-in-

Brown, D. (2016). Global Human Capital Trends 2016. Retrieved from www2.deloitte.com:

Collaborative network. (n.d.). Retrieved October 6, 2016, from en.wikipedia.org:


Cross, R. (n.d.). Introduction to Organizational Network Analysis. Retrieved September 2, 2016, from
www.robcross.org: http://www.robcross.org/network_ona.htm

Distelzweig, H. (n.d.). Organizational Structure. Retrieved 2016, from

www.referenceforbusiness.com: http://www.referenceforbusiness.com/management/Ob-

Epner, M. (2014, August 7). www.cnbc.com. Retrieved from


Page 12 of 13
Using big data to map and analyze collaboration networks in an organization

Hass, M. R. (2016). Collaboration. Retrieved from www.coursera.org:


Informal Structure. (2016, May 26). (Boundless) Retrieved Oct 15, 2016, from www.boundless.com:

Jeff Schwartz, U. B. (2016). Global Human Capital Trends 2016. Retrieved from www2.deloitte.com:

Moore, K. (2011, September 15). Collaborative network. Retrieved September 21, 2016, from
www.forbes.com: http://www.forbes.com/sites/karlmoore/2011/09/15/from-social-

Robert L. Cross, S. P. (2007, April). The role of networks in organizational change. Retrieved from
www.mckinsey.com: http://www.mckinsey.com/business-functions/organization/our-

Social network analysis. (n.d.). Retrieved from en.wikipedia.org:


Social network analysis software. (n.d.). Retrieved from en.wikipedia.org:


The Big Data Opportunity for HR and Finance. (2013). Retrieved from hbr.org:

Workplace privacy. (n.d.). Retrieved from www.fairwork.gov.au: https://www.fairwork.gov.au/how-


Page 13 of 13