Академический Документы
Профессиональный Документы
Культура Документы
Virus-like Nanoparticle
Andrew Favor
A thesis submitted in partial fulfillment of the requirements for the degree of Bachelor
Matthew Francis
Spring 2018
“In all chaos there is a cosmos, in all disorder a secret order."
-Carl Jung
Contents
1 Introduction: 1
2.1 Background: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Definitions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Standardization: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Results: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Discussion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Conclusion: 23
5 Acknowledgements: 23
References 24
3
1 Introduction:
Many questions about the physical basis of life still weigh heavy upon our minds, unanswered.
In self-replicating molecular systems we see an interplay of thermodynamic principles that are often
counter to our conception of universal trends – a preference for increased organization and a rever-
sal of entropy within a world otherwise driven towards disorder [1],[2],[3]. In recent years, a variety of
mathematical models have been developed with the goal of characterizing the interplay of various forces
that contribute to the perceived "order" of chemical systems that exhibit propagation of structural iden-
tity [4],[5]. In evolutionary time, the chemical frameworks that we find our lives supported by have dy-
namically moved through the manifold of many physical properties, and when projected upon a plane
bounded by our current observational capabilities, much of the fundamental information regarding the
guiding reasons for such changes is lost or transformed beyond practicality. Within biological systems,
proteins reside at the limit of solubility, which holds the physical consequence that at all times they exist
on the edge of "falling apart", so to speak [6]. In the evolution of such molecules, the new capabilities
that arise from mutations may often come at the cost of stability. Thus, the compounding effect of mu-
tations over time may follow a path bounded by thermodynamic penalties that prevent beneficial new
functionalities from arising. The goal of this study is to apply new analytical and predictive tools in order
to establish a model for the evolution of self assembling systems which we hope to utilize in developing
Directing focus towards the self-assembly process of proteins, we see complex molecular struc-
tures fluctuating within a chaotic multiplex of conformational microstates [7],[8]. The foundation of pro-
tein engineering lies in introducing new mutations for the purpose of modifying protein function. How-
ever, the challenge of predicting the fitness of a given mutation is always a limiting factor when intro-
ducing changes to a given amino acid sequence [9], [10], [11] . Based on the immense number of variant
primary sequences that can result from just two co-expressed mutations, it is virtually impossible to test
the fitness of all possible combinations of mutations through physical synthesis or even computational
modeling[12], which necessitates the development of a means by which to effectively predict the viability
of potential mutant structures for the purpose of extensive protein modification [13],[14].
1
Virus-like nanoparticles maintain great potential as versatile drug delivery systems due to their
structural stability under physiological conditions, and their ability to encapsulate a wide range of ther-
apeutic agents and biomolecules [15], [16]. This research focuses on reengineering the icosahedral coat-
protein of the MS2 bacteriophage for such applications using a multifaceted approach. Herein we explore
the use of computational chemistry, mutability analyses, and the use of machine-learning algorithms
as predictive tools for determining the resultant stability of imposing a variety of double-mutations on
this protein. Such analyses shed light upon the position-based requirements for the various physical
properties that are needed in order to achieve successful self-assembly to a stable nanoparticle, and the
structural parameters that can be altered in order to modify the functionality of the icosahedral MS2-
bacteriophage coat protein. In effect, we have attained a design blueprint for how the structure of the
MS2 nanoparticle may be altered in order to introduce changes in its physical properties such as size,
cargo affinity, and thermal stability. In order to achieve our goal of both introducing novel functional-
ity and increasing stability, it is imperative to determine which mutations would allow the formation of
It is important to note that due to the high mutability of RNA-viruses, they serve as good sub-
jects for case studies such as this, which peer into the natural order that governs mutability itself, and
how evolution occurs [17], [18]. Much research has sought to explore both human-directed and naturally
occurring trends in virus mutability, as the mechanism of their self-propagation is fundamentally depen-
dent on both highly mutation prone processes, and the need to adapt to rapidly changing environmental
parameters [19]. In addition to the basic knowledge that can be acquired from such analyses, the appli-
cations to medical and technological innovation that engineering virus-inspired nanoparticles promises
is profound to say the least. The coat proteins associated with viral anatomy provide a stable scaffolding
for the development of nanomaterials, with a versatile range of new functionality and a remarkable toler-
ance to extensive synthetic modification [20], [21], [22]. As a result, in recent years, these materials have
come to be used as the structural basis for a broad range of technological applications, such as the devel-
opment of drug delivery vehicles [16], [15], nanoreactors [23], imaging agents [24], [25], and catalysts for
2
The information gained herein sheds light upon the underlying mechanics of evolution through
the co-mutation of amino acid residues, and fosters an examination of the synergistic effects of their pair-
wise interactions. Each mutation can positively or negatively affect the protein’s overall fitness on its own,
but how a second mutation is impacted by the first one is an issue still shrouded in mystery. The com-
bined effect of any two mutations remains very challenging to predict, as the result is not simply an ad-
dition of the two individual effects. In the structures of multiply-mutated proteins, a whole host of novel
interactions are introduced, rendering the effective change in stability or functionality to be greater than
2.1 Background:
In order to quantify the position-based physical preferences along the backbone of MS2, a large
library of mutants was generated for the purpose of providing the fitness data for such mutants. A plasmid
library of all possible single amino acid mutants for the MS2 coat protein was produced through the use of
golden gate cloning technique [29]. These plasmids were then used to transform chemically competent E.
coli (DH10B cells), which were allowed to grow on LB agar plates. Following subsequent bacterial growth
in liquid media, coat-protein production was induced via arabinose addition, and the expressed proteins
were purified through sonication and precipitation in ammonium sulfate. The precipitated coat proteins
were then separated based on their ability to form stable capsids through size exclusion chromatography
on an Akta FPLC system. An important characteristic of these proteins (as with many viral capsids) is
their ability to encapsulate the mRNA sequences that encode their own protein [30], [31], [32]. Thus, after
this size selection, the encapsulated RNA was collected from the capsids, and reverse transcribed to DNA,
which was submitted to Illumina for sequencing [16]. In this analysis, the wild-type proteins served as
Fitness scores for each mutant sequence were generated by taking the base-10 logarithm of the
ratio of each sequence count observed in the capsid-group over the non-capsid group. The full apparent
3
fitness landscape for single amino acid mutations across MS2’s backbone can be seen in Appendix,[A1]).
From this data, we were able to apply further mathematical analyses in order to quantitatively make sense
of the position-based physical requirements that this protein needed in order to form stable capsids.
From a range of online literature, indices pertaining to the physical properties of each canoni-
Table 1: Physical property indices for the 20 canonical amino acids, serving as a basis set for analysis of
the position-based physical requirements for effective protein folding and capsid stability [33], [34], [35],
2.2.1 Definitions:
ε: one of the 10 physical property indices used in our analysis (volume, molecular weight, length, ster-
ics, polarity, polar area, fraction water, hydrophobicity, non-polar area, flexibility)
R p : a vector containing the fitness scores of each amino acid for a given position, p
4
· ¸
R p = f A,p f S,p f T,p ··· fG,p f P,p
1×20
ξε : a vector containing the physical property indices, ϕa,ε corresponding to a given property, ε, and
amino acid, a.
· ¸
ξε = ϕ A,ε ϕS,ε ϕT,ε ··· ϕG,ε ϕP,ε
1×20
µ(R p ) : the mean value of the fitness scores for a given position
σ(R p ) : the standard deviation of the fitness scores for a given position
µ(ξε ) : the mean value of the amino acid indices for a given physical property
σ(ξε ) : the standard deviation of the amino acid indices for a given physical property
2.2.2 Standardization:
We proceed to produce standardized fitness scores, f˜a,p , by taking the difference from a given
f a,p − µ(R p )
µ ¶
f˜a,p =
σ(R p )
5
³ ´ ³ ´ ³ ´ ³ ´
f A,1 −µ1 f S,1 −µ1 f P,1 −µ1 f ∗,1 −µ1
σ1 σ1 ··· σ1 σ1
³ ´ ³ ´ ³ ´ ³ ´
f A,2 −µ2 f S,2 −µ2 f P,2 −µ2 f ∗,2 −µ2
σ2 σ2 σ2 σ2
.. .. ..
F =
. . .
³ ´ ³ ´ ³ ´ ³ ´
f f S,128 −µ128 f P,128 −µ128 f ∗,128 −µ128
A,128 −µ128
σ128 σ128 σ128 σ128
³ ´ ³ ´ ³ ´ ³ ´
f A,129 −µ129 f S,129 −µ129 f P,129 −µ129 f ∗,129 −µ129
σ129 σ129 ··· σ129 σ129
129×20
f˜A,1 f˜S,1 ··· f˜G,1 f˜P,1
˜ ˜
f A,2 f P,2
. .. ..
= .. . .
f˜ ˜
f P,128
A,128
f˜A,129 f˜S,129 ··· f˜G,129 ˜
f P,129
129×20
Similarly, we produced standardized property indices [ϕa,ε ]sc al ed ,0 , by taking the difference from a given
property’s mean index value, and dividing by the associated standard deviation:
ϕa,ε − µ(ξε )
µ ¶
[ϕa,ε ]sc al ed ,0 =
σ(ξε )
Next, we subtracted the minimum value of [ϕa,ε ]sc al ed ,0 for a given property, setting the minimum value
to zero:
© ª
[ϕa,ε ]scal ed ,1 = [ϕa,ε ]scal ed ,0 − mi n [ϕ A,ε ]sc al ed ,0 , [ϕS,ε ]sc al ed ,0 , · · · , [ϕP,ε ]sc al ed ,0
Finally, we divide each value of [ϕa,ε ]scal ed ,1 for a given property by the maximum value of its associated
set, thus setting the max value to 1, and producing a set of standardized indices, ϕ̃a,ε , fit between 0 and 1:
[ϕa,ε ]sc al ed ,1
µ ¶
ϕ̃a,ε = © ª
max [ϕ A,ε ]sc al ed ,0 , [ϕS,ε ]sc al ed ,0 , · · · , [ϕP,ε ]sc al ed ,0
6
ϕ̃ A,vol ume ϕ̃S,vol ume ··· ϕ̃G,vol ume ϕ̃P,vol ume
ϕ̃ A,wei g ht ϕ̃P,wei g ht
.. .. ..
Φ=
. . .
ϕ̃ ϕ̃P, n.p.−ar ea
A, n.p.−ar ea
ϕ̃ A, f l exi bi l i t y ϕ̃S, f l exi bi l i t y ··· ϕ̃G, f l exi bi l i t y ϕ̃P, f l exi bi l i t y
10×20
We can produce an array, Ψ ∈ R129×10 , with entries representing each position’s preference for
Ψ = F · ΦT
With individual entries, ψp,² , corresponding to a summation of the following product over all 20 canonical
amino acids, where ² corresponds to a given physical property, and p corresponds to a position in the
protein backbone:
20
ψp,² = f˜ai ,p · ϕ̃ai ,²
X
i =1
7
Such that:
˜ ˜
¡P20 ¡P20
i =1 f a i ,1 · ϕ̃a i ,vol . i =1 f a i ,1 · ϕ̃a i , f l ex.
¢ ¢
···
.. .. ..
Ψ=
. . .
¡P ¢
20 ˜ ˜
¡P20
f a ,129 · ϕ̃a ,vol . f a ,129 · ϕ̃a , f l ex.
¢
i =1 i i
··· i =1 i i
129×10
f˜A,1 · ϕ̃ A,vol . + · · · + f˜P,1 · ϕ̃P,vol . f˜A,1 · ϕ̃ A, f l ex. + · · · + f˜P,1 · ϕ̃P, f l ex.
¡ ¢ ¡ ¢
···
.. .. ..
=
. . .
¡ ¢
f˜A,129 · ϕ̃ A,vol . + · · · + f˜P,129 · ϕ̃P,vol . f˜A,129 · ϕ̃ A, f l ex. + · · · + f˜P,129 · ϕ̃P, f l ex.
¢ ¡
···
129×10
ψ1,vol ume ψ1,wei g ht ··· ψ1,n.p. ar ea ψ1, f l exi bi l i t y
ψ2,vol ume ψp, f l exi bi l i t y
.. .. ..
=
. . .
ψ ψ128, f l exi bi l i t y
128,vol ume
ψ129,vol ume ψ129,wei g ht ··· ψ129,n.p ar ea ψ129, f l exi bi l i t y
129×10
Thus, we have obtained an array containing information about position-wise preferences for various
physical properties, wherein rows index to the positions in the protein backbone, and columns index
8
2.3 Results and implications:
Figure 1: The effect of various physical properties on the apparent fitness of the MS2 capsid protein[16].
9
The results of this analysis provide information about favored residue types with respect to
their sequential position in MS2’s backbone, and their spatial arrangement in the 3D protein structure
overall. From this, we have a blueprint that allows us to acknowledge what types of mutations are well
tolerated within a given region in the capsid structure, and thereby provides a guiding range of parameters
analysis:
As it stands, the protein folding problem remains unsolved, with much progress yet to be
made[41],[42],[43],[44]. The goal of defining a relationship between peptide sequence and favored 3D
structure has remained at the center of many scientific fields for decades; a multifaceted physics prob-
lem, with many hypotheses having been formed to predict how proteins find their minimum energetic
state [45],[46],[47]. The physical properties of constituent amino acids has certainly been observed to be
a primary factor driving the conformational dynamics of this process, largely due to phenomena such
as the hydrophobic effect [48], salt-bridge formation [49], and hydrogen bonding [50],[51] . However,
an extremely exhaustive search amongst conformational samplings remains a partition only surpassable
through complex cooperative behaviors between the protein’s molecular substructures [52].
This aforementioned phenomena reflect the hidden dimension of importance that epistatic ef-
fects have, not only on the energetic stability of a protein’s folded state, but also on its ability to reach that
state. However, for the purposes of protein engineering, perhaps the full dynamical trajectory associated
with a given structural alteration need not be known in order to reach our end goal. While our previous
analysis provided information regarding the physical characteristics that contributed to protein fitness
with respect to the effects of single mutations [16], such modifications are barely scratching the surface
when it comes to truly reengineering a protein. This led us to look towards double mutants for our next
analytical pursuit. While the introduction of single amino acid mutations provides information about
10
the tolerance of that residue with respect to the wild-type protein overall, the mystery of epistasis still
remains, leaving us to consider: the introduction of new physical interactions, favorable or disfavorable,
We define epistasis herein as the difference in the effect that multiple mutations have when
expressed together, from the additive sum of the individual mutations [53]. Due to the cooperative nature
of interactions between neighboring amino acids in a protein, it is already hard enough to make sense of
the effect that a single amino acid mutation has on the protein’s overall ability to fold and retain a stable
and functional state. The introduction of multiple mutations increases the complexity of this problem
exponentially, and can effectively result in novel interactions between amino acids, not observed in the
wild type.
Neural nets and related machine learning algorithms have gained much acclaim in recent years
due to the efficacy of their application as tools for predictive data analysis, classification, and disease
diagnostics [54],[55], [56], due to their ability to find patterns in large data sets which may be elusive from
the point of view of conventional analysis. Thus, it stands to reason that neural nets promise a means by
which to project relevant information away from a space with a temporal dimension, and replace it with
the dimensions of quantized physical descriptors. In the field of biological engineering, neural networks
have been used as predictive tools to guide the design of mutant sequences to enhance functionality
[57],[58]. In the age of high-throughput sequencing, the ability to generate, express, and analyze a large
number of mutants has become a remarkably feasible task, which allows for the efficient acquisition of
large bodies of data – a fundamental requisite for effective use in neural networks, due to their "data
For our examination of the efficacy of using such tools for the purposes of protein engineering
and directed evolution, a basic objective is to develop a model that can predict the effects of epistasis
of multiple co-expressed mutations with better accuracy than the predictions provided by the additive
changes of from multiple single-mutant fitness scores. There have been many approaches used to quan-
tify a representative metric for epistasis – a way to quantify the effect of interactions between multiple
mutations, which produce results that differ from the sum of their parts. Following the approach of previ-
11
ous work conducted in the evaluation of epistatic phenomena[58],[53], our analysis employs the following
mathematical description of an epistasis metric describing the double mutation imposed on residues i
and j :
² = ∆ f i j − (∆ f i + ∆ f j ), where ∆ f x = f x − f W T
Quantification of the epistasis exhibited by double mutants throughout MS2’s FG-loop can be seen in
Appendix,[A2]. The positions where either favorable or disfavorable epistasis is most pronounced give us
an idea of where to direct focus in our analysis of synergistic interactions between nearby residues, and
We trained a neural net by feeding it two sources of data to map together: information re-
garding the quantified physical properties exhibited at each position in the peptide backbone of a given
mutant, and the associated fitness score of that mutant. In analogy to using a neural net to perform im-
age classification, we trained the neural net to look at the physical properties along MS2’s backbone in
order to perform capsid-formation classification. The 12 physical properties used in our input data were
treated mathematically as a range of discrete "colors" (more formally, channels), and the positions in the
backbone were treated as order-correlated pixel values. All development of our neural network model
was done in python, using the Tensorflow library [62]. Additionally, the Numpy and Matplotlib libraries
were used for mathematical processing of data and figure generation, respectively.
The input and output data for our functional neural network model, as well as the standard-
Input, m i (data reflecting the physical state of a given mutant): A 129 x 12 array, with rows corre-
sponding to positions in a mutant’s peptide backbone, and columns corresponding to the physical
properties of the amino acid there (the same as used in our previous discussion), standardization
as follows:
• Subtract the mean for each property (to center data around zero)
12
• Add the minimum value for each property group (to set the minimum value to 0)
• Divide by the maximum value for each property group (to set the maximum value to 1)
For each mutant, "m i ", in our mutation data set, we generated matrices containing information
about the physical properties at each position in its backbone, based on what amino acid was
present. Mutants with fitness scores greater than or equal to -4 were removed, and missing data
points were excluded entirely. In the input-matrices for each mutant data point, rows correspond
to the sequence position (ordered from N-terminal to C-terminal), and columns correspond to the
normalized physical property indices described above, but using 12 physical properties (positive
charge, negative charge, volume, molecular weight, length, sterics, polarity, polar area, fraction wa-
ter, hydrophobicity, non-polar area, flexibility) instead of the 10 properties used in our analysis of
Output, f i : fitness data reflecting a given mutant’s ability to form stable capsids:
The data set used to "train" the network (develop optimal weight values for producing best fit predictions)
consisted of the physical property matrices and corresponding fitness scores of the single amino acid mu-
tants found in previous experiments [16]. The data set that was used to validate the predictive abilities of
our network consisted of the same data types, but with double mutants generated in an epistatic land-
13
3.2.1 Design of the neural network function:
Algorithm 1: Functional model for the convolutional neural net predictive function
Input : m i second order tensor ∈ R129×12 .
Output: f i the predicted fitness score associated with mutant, "i ", a scalar value.
1 CNN model (m i );
2 C onv 1 () ← m i : First convolutional layer; ∈ R129×12 , containing 12 filters, and a kernel size of 5, with
4 C onv 2 ← Pool 1 (•) : Second convolutional layer; ∈ R32×4 containing 12 filters, and a kernel size of 5,
5 Pool 2 ← C onv 2 (•) : Second pooling layer; ∈ R128 containing 4 filters, and a kernel size of 5;
6 F ool 2 F l at (•) ← Pool 2 (•) : Reshape Pool 2 to a vector; ∈ R128 with a pool size of 2;
7 Dense 1 ← Pool 2 F l at (•): First fully-connected layer; ∈ R128 , with rectified-linear activation
function;
8 Dense 2 ← Dense 1 (•) : Second fully-connected layer; ∈ R32 , with rectified-linear activation
function;
10 return f i
14
First Convolutional
Layer
Second Convolutional
Layer
Backbone Position
Fully-Connected
Layers
!"
#" 1×1
129×12
Physical Properties
Figure 2: Schematic of neural net functional model, with input m i , and output f i .
The design of our neural network model was composed of two sequential convolutional layers
(each followed by a pooling operation), ultimately feeding into two fully-connected layers, which output
a scalar fitness score prediction. The property-columns in our input matrices are separately fed into the
function, to be processed individually, and combined later on. The first convolutional layer takes small
fragments of the input vector (of length defined as "kernel size" = 5), corresponding to 5-residue long
sequences of amino acids in MS2’s backbone – a size which was chosen to represent small units of se-
quence that can exhibit characteristic patterns in their physical identities. Each of these are then passed
through 12 "filters" (more formally, algebraic transformations), which posses weights that are iteratively
adjusted in order to only let through data from fragment sequences that contain mathematical features
that provide useful information for the overall numerical flow. After each pass through the filters, a "pool-
ing" operation reduces the size of the data passed along by taking the maximum value of each 2-unit
long subdivisions of the filter outputs, and consolidating them to be fed into the next layer. Due to the
input of the second convolutional layer having undergone an extensive mathematical transformation by
this point, it holds an abstract relation to the physical significance of the input data, but the principles of
convolutional data processing performed are the same as seen in the first layer.
15
The pooled data that comes out of the second convolutional layer is then fed into the first "fully
connected" layer. This means that all remaining pieces of data (which happen to be 128 distinct values
now rather than the 129 × 12 values that we started with, due to the dimension reductions performed in
the pooling steps) are combined into a single vector and are together subjected to the same mathemati-
cal processing from this point on. This "fully connected" processing consists of combining each element
in the "fully connected" vector through a linear combination, and applying a rectified-linear function,
ReLU (x) = max(0, x), to the resulting sum. This is done 128 times, to produce 128 new distinct data
points, each using the same input, but being transformed by different weights. These 128 values then un-
dergo the same processing operation of linear combination followed by rectified-linear transformation,
but only 32 times, to produce 32 new values. Finally, these 32 values are passed through a regular linear
combination (this time not transformed by a function) to produce a single value that should, in theory,
match the fitness score associated with a given mutant. Thus, the function of the architectual regions
can be summarized as serving distinct mathematical roles in a data-processing procedure that maps the
The convolutional layers serve to reduce data by filtering through significant motif patterns, to produce a
vector:
The fully connected layers apply the following function to the vector provided by the convolutional layers
in order to find a general interactive-relationship between the vector values and a fitness score:
à à !!
32
X 128
X 128
X
f i (x) = a α ReLU b α,β ReLU c α,β,γ x γ
α=1 β=1 γ=1
The algorithm then iterates through the single mutant data, and readjusts the weights within
the functional model in a manner that best minimizes error between the function output, and the actual
fitness score for a given mutant, with the ultimate goal of finding weight values for that provide a "best-
fit" for all of the mutant data being trained on, as well as potential new data that the developed function
16
3.3 Results:
From computing the predicted fitness scores for MS2’s FG-loop epistatic landscape (all of which
can be found in Appendix,[A3]), we see the neural network predictions matching the experimental fitness
trends for many of the position combinations. However, some deviations from predictive accuracy do
stand out.
(a).
(b).
(c).
Figure 3: Comparison of the actual fitness scores and predicted fitness scores for double mutations be-
17
For example, as seen in Figure 3, mutations of positions 71 and 72, and positions 71 and 73 exhibit sim-
ilar patterns of positive fitness. However, co-mutation of position 71 with position 74 deviates from this
pattern, with overall low tolerance to mutation. This is somewhat surprising, because even in our single-
mutant analysis, positions 72 and 73 are highly mutable, while 74 is not. Having trained our model on
the single-mutant data set, one might expect position 74’s intolerance to mutation to carry over to the
predicted values in an almost additive manner, yet our data suggest that the predictive character of our
model is more heavily influenced by position 71’s single mutant data than position 74. Overall, the pre-
dicted fitness scores for co-mutating position 71 with position 74 heavily resemble those of 72 and 73,
suggesting that our model may be subject to bias towards the assumed recurrence of mutability trends. A
possible explanation for this is that as we feed our input data through the convolutional layers, the protein
backbone is divided into increments of 5 amino acids. In the single mutant data set, positions 70 through
73 are all highly mutable, whereas position 74 is not. If these five positions influence the neural net model
through the same input kernel, it makes sense that the influence of position 74’s general intolerance to
mutation is dwarfed by the high mutability of the other positions within the kernel. Nonetheless, the total
predictive accuracy does perform a remarkable job of matching mutability patterns overall.
Following our computational experiments, we were able to calculate the error between the
actual fitness scores of all FG mutants sampled and the values predicted (through either neural network
1 X N q
E pr ed i ct ed = ( f i ,ac t ual − f i ,pr ed i c t ed )2
N i =1
The average error for the convolutional neural net predictions was found to be E C N N = 5.89 × 10−1 , with
a 95 % confidence interval of µ(EC N N ) = 5.89 × 10−1 ± 1.30 × 10−2 . Likewise, average error for the ad-
ditive fitness score calculations was found to be E ad d i t i ve = 6.81 × 10−1 , with a 95 % confidence inter-
val of µ(EC N N ) = 6.81 × 10−1 ± 1.99 × 10−2 . These values indicate that the convolutional neural network
predictions yielded less error than the additive fitness scores, with a statistically significant difference (
18
3.4 Discussion:
The development of such an algorithm holds great promise for protein engineering. To use a
predictive tool such as this, one would simply need to generate a library of mutants for each position,
and have a characteristic observable quantity to test the mutants on, which could be ability to fold, an
introduction of a novel functionality to the protein’s activity, or as we did in this study, the ability to form
During the process of training our neural network model, the rate of decreasing validation-set
error is directly proportional to the rate decreasing training-set error, which indicates that in "learning"
how to predict single mutant fitness, based on the physical property data provided, our model is also
picking up on the sequence-structure relationship needed to predict the effect of double mutants, which
supports the idea that neural networks hold promise as effective tools for determining fitness for appli-
Figure 4: The decrease in error of predicting the fitness scores of the training set (single mutants) and
validation set (FG-epi mutants) changes at a proportional rate between these two sets, with respect to
training iterations (1 epoch = 1 round of iteration through the entire training set).
19
Results of particular interest include mutants with pronounced epistasis that counters the ex-
pectations produced by single mutants: that is to say, double mutants with fitness values that are signifi-
cantly different than the predictions of the two constituent single-mutant changes in fitness. For instance,
in co-expressing mutations at both residues 71 and 76, we see that when there is a negatively charged
amino acid present at position 71, and the charge of residue 76 (natively negatively charged) undergoes
charge inversion to a positively charged amino acid, like Lysine or Arginine, there is a resultant positive
fitness value, which goes against the additive fitness expectations due to complimentary interactions be-
tween two residues that would not yield a favorable state of fitness alone.
20
Figure 5: Experimental fitness data, neural network predictions, and additive predictions (top to bottom)
for double mutants of positions: [a,b,c] 72 (left) and 74 (bottom), [d,e,f ] 71 (left) and 76 (bottom), [g,h,i]
Despite being far from perfect overall, our model does successfully predict the epistatic results
for a significant number of double mutants. A common theme amongst tolerated double mutations is the
pairing of positive and negative charges, either of which might be disfavorable on their own. The intro-
duction of new structural features, like salt bridges, that may confer improvements in capsid stability are
21
an example of the types of interactions that it would be desireable to introduce into our protein through
directed evolution.
While neural networks are useful tools for learning about and interacting with complex systems
characterized by mathematical patterns that often evade human definition, the proper development of
their models requires the provision of extremely large data sets. In order to tune their many facetious
weights in a manner that can best process and interpret the vast and range of possible input data that
may be encountered remains a challenge, even for our less demanding application of these tools.
To develop any function of this sort such that it generalize to the novel situations that arise from
the interplay between units in an amino acid chain requires an appropriately large amount of training
data, such that the training procedure might "cover ground" wide enough to represent most possible
results that one might encounter from moderate variation of peptide identity. Thus, despite the seemingly
wide-spanning data set of our single-mutant fitness landscape, by machine-learning standards, this set
To overcome the large data set requirement for effectively training neural networks, a tech-
nique of growing popularity is that of "Transfer-Learning". In this technique, larger data sets are utilized
to initially train the first few layers of the function model to recognize general motifs exhibited by both a
small function-specific data set, such as our fitness data, and a larger data source, like an online protein
data bank[61],[63],[64]. After this initial motif-recognition is achieved, that processing function is set up
to feed into a model specifically designed to develop predictive efficacy for the specific purposes of the
Future directions will utilize the transfer-learning technique in order to train the first few lay-
ers of a neural network to develop a comprehensive processing system to recognize the ways in which the
sequence-identity of neighboring amino acids interact in order to form local motifs and secondary struc-
tural elements[65]. However, the specific arrangements and interactions of secondary structure motifs
that contribute to the unique tertiary and quaternary structures of a protein or family of proteins remains
defined by a large body of case-specific information. Thus, making sense of these arrangements is where
having a data set like a fitness landscape for a given protein truly plays a necessary role.
22
4 Conclusion:
Unraveling the rules governing protein self-assembly and epistasis is a profound and hum-
bling endeavor. Thus, such an undertaking requires the use of extensive computational tools in order to
process the associated data systems, due to the difficulty of analyzing and understanding them by hand.
In the pursuit of extensive protein modification, we encounter complex dynamical trajectories, the birth
and extinction of new forms and functionality, all guided by interactions in a vast system of intricate inter-
dependencies. To harness these phenomena in a marriage between human innovation and the evolved
forms that uphold life itself provides a huge step in scientific progress, and thus requires novel tools and
techniques to meet its unique needs. To understand and utilize the intricate weavings of the fabric of
our reality would, in essence, truly be for us to reach an intimate communion with the very forces that
5 Acknowledgements:
My research experience in the Francis group has undoubtedly been the most transformative
chapter of my undergraduate education. In my time there, I learned much about experimental design
and methodology, the raw excitement of discovery, and perhaps most importantly, how to acknowledge
failures within a project, accept them, learn from them, and carry on in an appropriately adjusted di-
rection. For the guidance that I received throughout my time there, I would like to thank my friend and
mentor, Marco Lobba, for exposing me to the many facets of life as a researcher, and also for helping me
navigate through this pivotal point in my scientific career. I would also like to thank Emily Hartman for
her work in generating all the experimental data used in my research analyses, and I would like to thank
Professor Matthew Francis for fostering such a fun and engaging scientific environment.
23
References
[1] Eric D Schneider and James J Kay. “Order from disorder: the thermodynamics of complexity in
biology”. In: What is life? The next fifty years: Speculations on the future of biology (1995), pp. 161–
172.
[2] F Eugene Yates. Self-organizing systems: The emergence of order. Springer Science & Business Media,
2012.
[3] W Ross Ashby. “Requisite variety and its implications for the control of complex systems”. In: Facets
[4] Jeremy L England. “Statistical physics of self-replication”. In: The Journal of chemical physics 139.12
(2013), 09B623_1.
[5] David Andrieux and Pierre Gaspard. “Nonequilibrium generation of information in copolymeriza-
tion processes”. In: Proceedings of the National Academy of Sciences 105.28 (2008), pp. 9516–9521.
[6] Tsutomu Arakawa and Serge N Timasheff. “Theory of protein solubility”. In: Methods in enzymol-
[7] Gabriel J Rocklin et al. “Global analysis of protein folding using massively parallel design, synthesis,
[8] Hege Beard et al. “Applying physics-based scoring to calculate free energies of binding for single
amino acid mutations in protein-protein complexes”. In: PloS one 8.12 (2013), e82849.
[9] A Elisabeth Eriksson et al. “Response of a protein structure to cavity-creating mutations and its
relation to the hydrophobic effect”. In: Science 255.5041 (1992), pp. 178–183.
[10] Philip A Romero and Frances H Arnold. “Exploring protein fitness landscapes by directed evolu-
tion”. In: Nature Reviews Molecular Cell Biology 10.12 (2009), p. 866.
[11] John Maynard Smith. “Natural selection and the concept of a protein space”. In: Nature 225.5232
(1970), p. 563.
24
[12] Frances H Arnold. “Combinatorial and computational challenges for biocatalyst design”. In: Nature
[13] Thomas A Hopf et al. “Mutation effects predicted from sequence co-variation”. In: Nature biotech-
[14] Romas J Kazlauskas and Uwe T Bornscheuer. “Finding better protein engineering strategies”. In:
[15] Jeff E Glasgow et al. “Osmolyte-mediated encapsulation of proteins inside MS2 viral capsids”. In:
[16] Emily C Hartman et al. “Quantitative characterization of all single amino acid variants of a viral
capsid-based drug delivery vehicle”. In: Nature communications 9.1 (2018), p. 1385.
[17] Maximilian Hecht, Yana Bromberg, and Burkhard Rost. “News from the protein mutability land-
[18] DA Steinhauer and JJ Holland. “Rapid evolution of RNA viruses”. In: Annual Reviews in Microbiol-
[19] Andrew L Ferguson et al. “Translating HIV sequences into quantitative fitness landscapes predicts
viral vulnerabilities for rational immunogen design”. In: Immunity 38.3 (2013), pp. 606–617.
[20] David S Peabody. “Subunit fusion confers tolerance to peptide insertions in a virus coat protein”.
[21] Adel M ElSohly et al. “Synthetically modified viral capsids as versatile carriers for use in antibody-
based cell targeting”. In: Bioconjugate chemistry 26.8 (2015), pp. 1590–1596.
[22] Ernest W Kovacs et al. “Dual-surface-modified bacteriophage MS2 as an ideal scaffold for a viral
capsid-based drug delivery system”. In: Bioconjugate chemistry 18.4 (2007), pp. 1140–1147.
[23] Jeff E Glasgow et al. “Influence of electrostatics on small molecule flux through a protein nanoreac-
[24] Keunhong Jeong et al. “Targeted molecular imaging of cancer cells using MS2-based 129Xe NMR”.
25
[25] Tyler Meldrum et al. “A xenon-based molecular sensor assembled on an MS2 viral capsid scaffold”.
In: Journal of the American Chemical Society 132.17 (2010), pp. 5936–5937.
[26] Nicholas Stephanopoulos, Zachary M Carrico, and Matthew B Francis. “Nanoscale integration of
sensitizing chromophores and porphyrins with bacteriophage MS2”. In: Angewandte Chemie Inter-
[27] Ying-Zhong Ma et al. “Energy transfer dynamics in light-harvesting assemblies templated by the
tobacco mosaic virus coat protein”. In: The Journal of Physical Chemistry B 112.22 (2008), pp. 6887–
6892.
[28] Rebekah A Miller, Andrew D Presley, and Matthew B Francis. “Self-assembling light-harvesting sys-
tems from synthetically modified tobacco mosaic virus coat proteins”. In: Journal of the American
[29] Carola Engler et al. “Golden gate shuffling: a one-pot DNA shuffling method based on type IIs re-
[30] Gabriel L Butterfield et al. “Evolution of a designed protein assembly encapsulating its own RNA
[31] Wilf T Horn et al. “The crystal structure of a high affinity RNA stem-loop complexed with the bacte-
riophage MS2 capsid: further challenges in the modeling of ligand–RNA interactions”. In: Rna 10.11
[32] Karim M ElSawy. “The impact of viral RNA on the association free energies of capsid protein assem-
bly: bacteriophage MS2 as a case study”. In: Journal of molecular modeling 23.2 (2017), p. 47.
[33] Joan Pontius, Jean Richelle, and Shoshana J Wodak. “Deviations from standard atomic volumes
as a quality measure for protein crystal structures”. In: Journal of molecular biology 264.1 (1996),
pp. 121–136.
[34] WP Jencks and J Regenstein. Handbook of Biochemistry and Molecular Biology, ; Fasman, GD, Ed.
1976.
26
[35] M Charton. “Protein folding and the genetic code: an alternative quantitative model”. In: Journal
[36] Mauno Vihinen, Esa Torkkila, and Pentti Riikonen. “Accuracy of protein flexibility predictions”. In:
[37] JM Zimmerman, Naomi Eliezer, and R Simha. “The characterization of amino acid sequences in
proteins by statistical methods”. In: Journal of theoretical biology 21.2 (1968), pp. 170–201.
[38] Maria Sandberg et al. “New chemical descriptors relevant for the design of biologically active pep-
tides. A multivariate characterization of 87 amino acids”. In: Journal of medicinal chemistry 41.14
[39] WR Krigbaum and Akira Komoriya. “Local interactions as a structure determinant for protein molecules:
[40] Jean-Luc Fauchère et al. “Amino acid side chain parameters for correlation studies in biology and
pharmacology”. In: Chemical Biology & Drug Design 32.4 (1988), pp. 269–278.
[41] Ken A Dill and Justin L MacCallum. “The protein-folding problem, 50 years on”. In: science 338.6110
[42] Michael J Behe, Eaton E Lattman, and George D Rose. “The protein-folding problem: the native fold
determines packing, but does packing determine the native fold?” In: Proceedings of the National
[43] Martin Karplus and David L Weaver. “Protein-folding dynamics”. In: Nature 260.5550 (1976), p. 404.
[44] Ulrich HE Hansmann and Yuko Okamoto. “Comparative study of multicanonical and simulated
annealing algorithms in the protein folding problem”. In: Physica A: Statistical Mechanics and its
[45] Linus Pauling, Robert B Corey, et al. “Stable configurations of polypeptide chains”. In: Proc. R. Soc.
27
[46] Steven S Plotkin, Jin Wang, and Peter G Wolynes. “Statistical mechanics of a correlated energy land-
scape model for protein folding funnels”. In: The Journal of chemical physics 106.7 (1997), pp. 2932–
2948.
[47] Zhenqin Li and Harold A Scheraga. “Monte Carlo-minimization approach to the multiple-minima
problem in protein folding”. In: Proceedings of the National Academy of Sciences 84.19 (1987), pp. 6611–
6615.
[48] Charles Tanford. “The hydrophobic effect and the organization of living matter”. In: Science 200.4345
[49] George I Makhatadze et al. “Contribution of surface salt bridges to protein stability: guidelines for
protein engineering”. In: Journal of molecular biology 327.5 (2003), pp. 1135–1148.
[50] Ken A Dill. “Dominant forces in protein folding”. In: Biochemistry 29.31 (1990), pp. 7133–7155.
[51] George D Rose and Richard Wolfenden. “Hydrogen bonding, hydrophobicity, packing, and protein
folding”. In: Annual review of biophysics and biomolecular structure 22.1 (1993), pp. 381–415.
[52] Ken A Dill, Klaus M Fiebig, and Hue Sun Chan. “Cooperativity in protein-folding kinetics.” In: Pro-
[53] Karen S Sarkisyan et al. “Local fitness landscape of the green fluorescent protein”. In: Nature 533.7603
(2016), p. 397.
[54] Henadzi Vaitsekhovich. “Neural Networks in Disease Diagnostics”. In: BALTIC CONFERENCE. Cite-
seer, p. 47.
[55] Abu Bakar Siddiquee et al. “A Constructive Algorithm for Feedforward Neural Networks for Medical
[56] Pedro J Ballester and John BO Mitchell. “A machine learning approach to predicting protein–ligand
binding affinity with applications to molecular docking”. In: Bioinformatics 26.9 (2010), pp. 1169–
1175.
[57] Robert J Tunney et al. “Accurate design of translational output by a neural network model of ribo-
28
[58] Victoria Pokusaeva et al. “Experimental assay of a fitness landscape on a macroevolutionary scale”.
[59] Jeffrey Dean et al. “Large scale distributed deep networks”. In: Advances in neural information pro-
[60] Maryam M Najafabadi et al. “Deep learning applications and challenges in big data analytics”. In:
[61] Jason Yosinski et al. “How transferable are features in deep neural networks?” In: Advances in neural
[62] Martın Abadi et al. “Tensorflow: Large-scale machine learning on heterogeneous distributed sys-
[63] Jeff Donahue et al. “Decaf: A deep convolutional activation feature for generic visual recognition”.
[64] Hoo-Chang Shin et al. “Deep convolutional neural networks for computer-aided detection: CNN
architectures, dataset characteristics and transfer learning”. In: IEEE transactions on medical imag-
[65] Ning Qian and Terrence J Sejnowski. “Predicting the secondary structure of globular proteins using
neural network models”. In: Journal of molecular biology 202.4 (1988), pp. 865–884.
29
30
A Appendix : Data and Figures
Figure 6: MS2’s single-mutant fitness landscape, the quantified effect of mutating each position in the
protein backbone to each of the 20 canonical amino acids and nonsense mutations[16].
31
A.2 Epistasis values for double mutants in MS2’s FG-loop:
Figure 7: Epistasis values, ², for the epistatic landscape of MS2’s FG-loop, with epistasis being defined as:
scores and the actual fitness scores for the FG loop double mutants".
32
A.3 Neural Net Predictions and Additive Predictions
Figure 8: Experimental fitness data, neural network predictions, and additive predictions (left to right) for
Figure 9: Experimental fitness data, neural network predictions, and additive predictions (left to right) for
Figure 10: Experimental fitness data, neural network predictions, and additive predictions (left to right)
33
Figure 11: Experimental fitness data, neural network predictions, and additive predictions (left to right)
Figure 12: Experimental fitness data, neural network predictions, and additive predictions (left to right)
Figure 13: Experimental fitness data, neural network predictions, and additive predictions (left to right)
34
Figure 14: Experimental fitness data, neural network predictions, and additive predictions (left to right)
Figure 15: Experimental fitness data, neural network predictions, and additive predictions (left to right)
Figure 16: Experimental fitness data, neural network predictions, and additive predictions (left to right)
35
Figure 17: Experimental fitness data, neural network predictions, and additive predictions (left to right)
Figure 18: Experimental fitness data, neural network predictions, and additive predictions (left to right)
Figure 19: Experimental fitness data, neural network predictions, and additive predictions (left to right)
36
Figure 20: Experimental fitness data, neural network predictions, and additive predictions (left to right)
Figure 21: Experimental fitness data, neural network predictions, and additive predictions (left to right)
Figure 22: Experimental fitness data, neural network predictions, and additive predictions (left to right)
37