Академический Документы
Профессиональный Документы
Культура Документы
Introduction:
The North Queensland climate is made up of two seasons, with warm temperatures and
low rainfall during the winter period while summer sees higher rainfall and warmer,
balmy temperatures. Because of the warmer weather, winter is more commonly known
as the 'dry' season and runs from May to October with low humidity and plenty of
sunshine. Interestingly, summer is therefore known as the 'wet' season and experiences
tropical downpours with the occasional electric storm from November to April. Most of
Australias produce is grown in the winter season in the temperate regions of the west,
south and east. The timing of growing seasons varies from region to region and planting
and harvesting dates can vary from year to year depending on the rainfall.
States typically commence seeding in mid-April if there is sufficient moisture but usually
farmers wait till late April when daytime temperatures start falling to more optimal levels.
Seeding is usually finished in most regions by mid-June but some wetter areas can seed
opportunistically well into July if required.
The harvest commences early to mid-October in Queensland, with the lower rainfall
farms generally starting earlier than the higher rainfall farms. Sometimes the harvest in
Queensland can be earlier, but generally only in a poor season.
To summarise the Sowing period within Queensland occurs during April to June and
Harvesting Period occurs from September to December.
This information is especially relevant as we would expect that the the months of April
through to July rainfall to be especially important in the yield of Sugar content.
Furthermore, it would be expected that any months aside from those would not be so
important.
2
Data analysis:
The data analysis process will be an additive process whereby variables will be added
accordingly to their relevance to the overall model. First, we will make assumptions as to
which variables are important in predicting the Sugar Cane Yield.
Initially, a model was produced that contains the hypothesized relevant factors for the
dependent response variable Sugar * Tonn.Hect.
- Notice that the wets months of Nov 1996-Dec 1996 and Nov 1997-Dec 1997 are
not included. This is due to cutting season before the wet months, thus the rain
would have minimal effect on the Sugar Cane Yield.
- Age was used instead of ratoon, as ratoon and age are the same thing (this avoids
collinearity).
- District was included for simplicity sake and a general variable encompassing the
district group and position.
- SoilID was included was only included to offer more specific information on the
soil type. The name of the soil type is one in the same with the SoilID thus not
included.
- Harvest Month would be intuitively relevant as its time would have an effect on
how long the plants have left to grow.
- Wet Months of Jan to Feb will be added later and the also the months of sowing
season from April to June as well (note research introduction).
3
Results- First Model:
first=lm(I(Tonn.Hect*Sugar)~District+fs+Area+Variety+Age+Ha
rvestMonth,data=training41)
Observe that 80.27% of the variation in the data can be explained by the model with
Tonn.Hect*Sugar~District+fs+Area+Variety+Age+HarvestMonth. This along with the
Adjusted R-squared of 0.789 suggest that the model is quite good, but one must be wary
of overfitting as the model might be too good to be true. The F-test global utility p-value
is virtually zero suggesting that model is significant. However, in observing the 0.2835
F-test p-value for the HarvestMonth we have evidence to suggest the HarvestMonth is
insignificant. (fs is the factor version of SoildID)
4
A Nested F-test was done to study the significance of
HarvestMonth:
anova(first,reducedfirst)
Analysis of Variance Table
Thus we see that the Pr(>F) values is 0.2835 which is very much above the threshold of
the 0.05 significance level. Then we can conclude that HarvestMonth is not significant as
we cannot reject the null hypothesis that the HarvestMonth Beta is equal to zero.
reducedfirst=lm(I(Tonn.Hect*Sugar)~District+fs+Area+Variety
+Age,data=training41,na.action = na.exclude)
5
Graph of Residuals:
Notice that there is heteroskedasticity behaviour, meaning that the residuals get larger
as the prediction moves from small to large (or from large to small).
6
Other models:
It was observed that when district was removed from the model the rainfall values were
finally able to play a role in the model. In doing this the independent variable
DistrictGroup was added to the model to provide some relation to where the Sugar
Cane was produced. Furthermore, it also reduces the complexity of the model by
swapping District for DistrictGroup.
rainfall1=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age,data=training41,na.action = na.exclude)
Anova:
Residual standard error: 1227 on 2854 degrees of freedom
Multiple R-squared: 0.8006, Adjusted R-squared: 0.7876
F-statistic: 61.6 on 186 and 2854 DF, p-value: < 2.2e-16
Notice how the model is still very good despite the lack of District factor.
7
Now we add the months:
We will now observe the influence of rainfall on the models. In particular wet months->
Jan.97,Feb.97.
rainfall2=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age+Jan.97+Feb.97,data=training41,na.action =
na.exclude)
anova(rainfall2,rainfall1)
Analysis of Variance Table
The high Pr(>F) suggest that we cannot reject the null hypothesis, implying the wet
months are not relevant.
rainfall3=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age+Jul.96+Aug.96+Sep.96+Oct.96+Jul.97+Aug.97+Sep.97+Oct
.97,data=training41,na.action = na.exclude)
rainfall1=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age,data=training41,na.action = na.exclude)
anova(rainfall3,rainfall1)
Analysis of Variance Table
8
1 2849 4263674046
2 2854 4299704285 -5 -36030239 4.8151 0 .0002184 ***
---
Signif. codes: 0 e***f 0.001 e**f 0.01 e *f 0.05 e.f 0.1 e f 1
The nested F-test suggest that the months of rainfall over cutting season are significant
as the p-value is very small.
rainfall4=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age+Jul.96+Aug.96+Sep.96+Oct.96+Jul.97,data=training41,n
a.action = na.exclude)
When months of Oct.97 were included the model rainfall4 gave NA values for Jun.97
suggesting collinearity. Thus it was wise to remove either Jul or Oct. Oct was removed.
anova(rainfall4)
Analysis of Variance Table
Notice how the months of July appear to be significant, this suggest we could try to see if
only the months of July are significant.
anova(rainfall4,rainfall5)
Analysis of Variance Table
9
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2849 4263674046
2 2852 4276681856 -3 -13007810 2.8973 0 .03386 *
---
Signif. codes: 0 e***f 0.001 e**f 0.01 e *f 0.05 e.f 0.1 e f 1
As the p-value is small we may reject the null hypothesis indicating that the months of
July are significant.
rainfall5=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Variety+Age+Jul.96+Ju
l.97,data=training41,na.action = na.exclude)
anova(rainfall5)
Analysis of Variance Table
As seen the model is very significant with a Global Utility of virtually zero and mostly
significant variables.
Now a stepwise test will used to determine if a better model can be derived.
step(rainfall5,direction = "both")
Start: AIC=43427.92
I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +
10
Age + Jul.96 + Jul.97
Call:
lm(formula = I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area +
Variety + Age + Jul.96 + Jul.97, data = training41, na.action =
na.exclude)
No there can not be a better model so far as the stepwise procedure could not find
anything better.
There is heteroskedasticity behaviour present , indicating that we may have too much
linear dependence within the model.
11
Rainfall over off-season:
rainfall6=lm(I(Tonn.Hect*Sugar)~DistrictGroup+fs+Area+Varie
ty+Age+Jul.96+Jul.97+Mar.97+Apr.97+May.97+Jun.97,data=train
ing41,na.action = na.exclude)
anova(rainfall6,rainfall5)
Analysis of Variance Table
Suggest that there are months in the offseason that are significant to model.
anova(rainfall6)
Analysis of Variance Table
12
The large p-values suggest that May.97 and Mar. 97 are insignificant. We will can test this
with a nested F-test.
Analysis of Variance Table
Model 1: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +
Age + Jul.96 + Jul.97 + Mar.97 + Apr.97 + May.97
Model 2: I(Tonn.Hect * Sugar) ~ DistrictGroup + fs + Area + Variety +
Age + Jul.96 + Jul.97 + Apr.97
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2849 4263674046
2 2851 4269651799 -2 -5977753 1.9972 0.1359
This suggest that we can not reject the null hypothesis that Mar.97 and May.97 are
significant to the model.
13
Final models:
Hence the final models would look like this:
rainfalllog=lm(log(I(Tonn.Hect*Sugar))~DistrictGroup+SoilID
+I(Area^0.5)+Area+Variety+Age+Jul.96+Jul.97+Apr.97,data=tra
ining41,na.action = na.exclude)
The only issue with this model for the one SoilID variable of 838 appears as NA. However,
in all other SoilIDs of that function are completely fine. The model has a very good
adjusted R square value of 0.7888 and has a Global Utility where the p-value is virtually
zero. This factors suggest that the Tonne.Hec*Sugar yield value can be modelled by such
a model where the Log expected value:
Log((Tonn.Hect*Sugar))~DistrictGroup+SoilID+I(Area^0.5)+Area+Variety+Age+Jul.96+Jul.97
+Apr.96
Also worth noting is that the QQ plots are linear in nature suggesting the residuals are
normally distributed. Furthermore, the residual vs fitted graph also shows that the
residuals are randomly and evenly distributed indicating that the residuals are
independent and have constant variance.
14
15
In other words sugar cane yield of the final model could be as follows:
(Tonn.Hect*Sugar)~Exp(DistrictGroup+SoilID+I(Area^0.5)+Area+Variety+Age+Jul.96+Jul.97
+Apr.96)
Alternatively is the model can be improved by the removal of the SoilID and the months
of Jul 97 and April 96. The months can be removed as they demonstrate a lack of
significance to the model in the anova table due to their large p-values. The issue with
this is simply having a model without rainfall which is not entirely accurate as rainfall is
key to plant growth.
Here is also another model that would be appropriate and that is the following:
(Tonn.Hect*Sugar)~Exp(DistrictGroup+Variety+I(Area^0.5)+Area+Age+Jul.96+Jul.97+Apr.9
7)
As the SoilID variable did provide us with an NA value for the specific variety of 838. We
could not utilize it hence we have removed it!
rainfalllog=lm(log(I(Tonn.Hect*Sugar))~DistrictGroup+I(Area^0.5)+Area
+Variety+Age+Jul.96+Jul.97+Apr.97,data=training41,na.action =
na.exclude)
Residual standard error: 0.4724 on 3003 degrees of freedom
Multiple R-squared: 0.7837, Adjusted R-squared: 0.7811
F-statistic: 294.1 on 37 and 3003 DF, p-value: < 2.2e-16
16
17
Again the QQ plots are moderately linear in nature suggesting the residuals are normally
distributed. The residual vs fitted graph also shows that the residuals are randomly and
evenly distributed indicating that the residuals are independent and have constant
variance. Although in this model only the 78.37% percent of the variation in data can be
explained, this is model is likely the better alternative. It is the better alternative due to
the fact that when SoilID was removed the predictive power of the test was only reduced
by a mere 2%, this is very good as a model with lesser terms is often the better one as it
is less likely to produce overfitting. The global utility of the model is virtually zero and
hence a significant model.
The test data output for the model is as follows: (this is the loged output data)
1 2 3 4 5 6 7 8 9
The following is a short summary of how the Expected Log Tonn.Hect*Sugar would be
influenced if there were changes in the variables: (variety/district was not included due to
the vast amount of data)
Age: 0.0565101 unit loss of Expected Log Tonn.Hect*Sugar per year of age increase.
Jul.96: 0.0004569 unit increase of Expected Log Tonn.Hect*Sugar when July 96 rainfall is
present.
Jul.97: 0.0017393 unit loss of Expected Log Tonn.Hect*Sugar when July 97 rainfall is
present.
Apr.97: 0.0001665 unit increase of Expected Log Tonn.Hect*Sugar when April 97 rainfall
is present.
18
I(Area^0.5): 2.4300759 => on average a 2.091599 unit increase of Expected Log
Area. ( 2.4300759-0.3384773=2.09159)
Conclusion:
log(I(Tonn.Hect*Sugar))~DistrictGroup+I(Area^0.5)+Area+Variety+Age+Jul.96+Jul.97+
Apr.97i
Is a very good candidate as it contains all of the deemed relevant terms within the
produce of Sugar Cane Yield. As mentioned the months of winter are primary due to the
rainfall, and hence through our analysis they were found to be relevant. The area was
also very important as implied by the model, more area means more yield. The Age had a
negative relation to the Sugar Cane Yield as it is obvious that older plants would decay
and generate less yield. Variety was also shown to be important as the type Sugar Cane
would have an affect on the Sugar content.
19