Академический Документы
Профессиональный Документы
Культура Документы
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TIE.2015.2420792, IEEE Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS
0278-0046 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TIE.2015.2420792, IEEE Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS
optimum relationship obtained from one MPP search has errors, an action a is taken, and is the reward function that
especially when the learned MPP is not correct. determines the reward after the state transition.
This paper proposes a novel online reinforcement learning At each time step t, the agent receives the state signal st ∈ S
(RL)-based intelligent MPPT method for variable-speed WECS, that represents the state of the environment and selects an
as shown in Fig. 1. The method consists of two processes: an action at ∈ A based on a mapping, which is called an action
online learning process and an online application process. In selection policy and denoted as p. For example, p(st, at) is the
the online learning process, the controller of the WECS probability that the action at is chosen in the state st. One time
behaves like an agent to interact with the environment to learn step later, in part as a result of the action at, the state changes
the MPPs from its own experience using a model-free from st to st+1 ∈ S according to ; and the agent receives a
Q-learning algorithm. The optimum rotor speed-electrical numerical reward rt+1 based on , which evaluates the
output power curve is then obtained from these MPPs and used immediate effect of the action at on this state transition.
for fast MPPT control of the WECS in the online application In the RL, the agent’s goal is to find an action selection
process. Compared to the traditional TSR and ORB methods, policy to maximize the total discounted rewards (expected
the proposed MPPT method does not require wind speed return) it receives over the future [23]. The expected return is
sensors, prior knowledge of the WECS, or a tedious offline described by a state value function or an action value function.
design process. Compared to the P&O methods, the RL enables The estimation of the value function is the core of almost all RL
the agent to reach the MPP faster at a wind speed that has been algorithms.
experienced before and respond quickly to wind speed There are three basic classes of methods for solving the RL
variations after learning. In addition, it is not necessary to problem: model-based dynamic programming, model-free
record the MPPs in the whole operating range as in [19] to form Monte Carlo method, and model-free temporal-difference
the optimum curve, only the learned MPPs are needed. Thus, (TD) learning [23]-[25]. Q-learning is a form of model-free
the learning time is shortened compared to that in [19]. RL algorithm based on TD and has various industrial
Moreover, the estimation error of the optimum curve obtained applications, such as optimal control and multi-agent RL
from the proposed method is smaller than that obtained from systems [26], [27].
the method only using one MPP. The proposed RL-based The one-step form of Q-learning is defined by
MPPT method is validated by simulation results on a 1.5-MW Qt +1 ( st , at ) = Qt ( st , at ) + lt [rt +1 + γ max a Qt ( st +1 , ai ) − Qt ( st , at )] (1)
doubly-fed induction generator (DFIG)-based WECS and i
experimental results on a 200-W emulated permanent-magnet where γ ∈ [0,1) is the discount factor that determines the
synchronous generator (PMSG)-based WECS, which are the current values of the rewards to be received in the future, i is
two most commonly used types of variable-speed WECSs. the index of the action in action space. Qt(st, at) is the action
value function (Q value function) that needs to be estimated.
II. BACKGROUND OF REINFORCEMENT LEARNING At each time step, the agent firstly observes its current state st
and selects an action at to perform. At the same time, the Q
Unlike the supervised learning in which an agent learns from
value Qt(st, at) is remembered. Then, the subsequent state st+1
the examples provided by an external supervisor, in the RL, an
is observed with an immediate reward rt+1 and the maximum
agent is able to learn from its own experience by directly
Q value corresponding to st+1, ( , ), is picked
interacting with the environment through actions, states, and
out using the method described below. The recorded Qt(st, at)
rewards (Fig. 1). The agent receives a reward each time when it
takes an action to transit from one state to another. The will be updated according to (1). The parameter ∈ (0,1] is
objective of the RL is to form a mapping from states to actions the learning rate, which determines how far the currently
so as to maximize the rewards [23]. estimated Qt(st, at) is adjusted toward the newly estimated Q
Most of the RL theories deal with the finite Markov decision value + ( , ). To ensure that the Q value
process (MDP). A finite MDP can be described by a tuple [S, function converges to the optimal value, all the state-action
A, , ], where S is the set of states, A is the set of actions, pairs should be updated. This means that the agent should try
is the state transition probability function that provides the all the actions in all states to find a trade-off between
exploration and exploitation [27]. One method to achieve this
probability of transition from a state s to another state s’ when
is to use the Boltzmann exploration as an action selection
policy, in which the action ai in the state s is chosen with the
following probability.
eQ ( s , ai )/τ
P e ωr r p( s, ai ) = (2)
a eQ ( s,ai )/τi
0278-0046 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TIE.2015.2420792, IEEE Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS
0278-0046 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TIE.2015.2420792, IEEE Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS
D. Discount Factor, Learning Rate, and Temperature right of the state, a “decrement” action if the MPP locates on
The discount factor γ in (1) controls the discounted values of the left of the state, and a “stay” action if the MPP lies in the
the future rewards. In the Q-learning for the MPPT application, current state. The action with the highest value is the best for
the agent tries to maximize the total future rewards instead of that state, which should be selected to move the operating point
towards the MPP. Initially, when a state is detected, all the three
the reward received immediately.
actions should be selected randomly regardless of their action
To ensure the convergence of the Q-learning, a criterion
values. As the temperature τ decreases, the action with a higher
should be considered when selecting the learning rate in (1), value will be selected with a higher possibility. After a certain
namely, a larger (smaller) learning rate should be used for a period of learning, for example, when reaches a set value,
state that the agent has experienced less (more) times. In this
the action with the highest value will be selected.
paper, each state skj is associated with a learning rate , which For each state, if the “stay” action has the highest value, it
is calculated as follows based on a parameter representing means that the MPP for the specific wind speed locates in that
the number of times that the state has been experienced. state because either an “increment” or a “decrement” action
k1 will lead to a negative reward that decreases the action value.
lkj = . (9)
k2 + k3 N kj Therefore, each time when such a state is found, the
corresponding MPP ( and ) is recorded. The learned MPPs
The larger the parameter , the more likely the convergence
are used to form the optimum curve of and for the ORB
of the action value to the optimum for that state and the smaller
MPPT control of the WECS in the online application process.
the learning rate .
The flow chart of the proposed RL-based MPPT algorithm is
The temperature τ in (2) also decreases during the learning shown in Fig. 3 and the details are explained as follows.
process, which can also be controlled by using . Initially, 1) Q-table Initialization: Online learning is a process of
the agent is encouraged to explore in the action space. updating the learning experience represented by the action
Therefore, even the actions that receive negative rewards values stored in the Q-table. Since each state has three actions,
should be selected with certain random probabilities. For the total number of action values stored in the Q-table is
instance, when the WECS operates at the point A (Fig. 2) at the 3 × × . Better initial Q values make the Q-learning
wind speed , a “decrement” action will lead to a positive algorithm converge faster, thus requiring less time for online
reward if the wind speed does not change. However, this action learning. In real-world applications, to shorten the online
will receive a negative reward if the wind speed decreases learning time, a prelearning can be performed offline by
suddenly, which makes this “good” action a “bad” one. To simulations first, where the initial Q values can be simply set to
minimize the effect caused by wind speed variations, all the be zero. The prelearned Q values are then used to constitute an
actions should be selected randomly before the learned initial Q-table for the online learning algorithm.
experience could be exploited, which can be controlled by τ. 2) Implementation of the Q-learning Algorithm: After the
For example, τ can be expressed by Q-table is initialized, the agent starts to learn online based on
N kj the Q-learning algorithm. In each cycle of learning, , and
τ min + 1 − (τ max − τ min ) , if N kj ≤ N max
τ = N max (10) , are acquired to determine the indices k and j of the state st
τ , if N kj > N max
(the location of st) in the Q-table, and the parameter is
min increased by 1 to calculate the learning rate and temperature
where and are the minimum and maximum values of τ using (9) and (10), respectively. Then an action at is selected
temperature, respectively; is the number of times a state from the action space defined by (6) using the Boltzmann
has been visited after which the learned experience of the state exploration. The corresponding action value ( , ) is
seems to be “mature” and could be exploited. selected from the Q-table and will be updated by the end of the
IV. INTELLIGENT RL-BASED MPPT ALGORITHM current sampling interval. The sample interval should be larger
than the response time of the speed control loop to obtain a
The proposed intelligent RL-based MPPT algorithm consists correct electrical power measurement [20]. In the beginning of
of two processes: an online learning process and an online the next sampling interval, a new state st+1 is observed and the
application process, as shown in Fig. 1. In the online learning difference of the two successive electrical power measurements
process, a map from state to action is learned. The learned is used to calculate the reward rt+1 by (8). The maximum action
action values are stored in a Q-table, based on which, the value in st+1 is picked out and expressed as ( , ′).
relationship between the optimal and can be derived. In
The value of ( , ) is updated by (1) in the Q-table, which
the application process, the WECS is controlled using the
shows how the agent learns from its own experience in one
optimum relationship learned.
learning cycle.
A. The Learning Process The Boltzmann exploration is implemented as follows:
In the learning process, the agent learns how to behave at (1) Calculate the possibility (s, ) for selecting each action
each discrete state (operating point). For example, in each state, (i = 1, 2, 3) by(2).
three actions can be taken, as described in Section III-B. The (2) Randomly generate a number between 0 and 1.
agent will take an “increment” action if the MPP locates on the (3) Select an action based on : if is larger than (s, ),
the first action is selected; if is between (s, ) and
0278-0046 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TIE.2015.2420792, IEEE Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS
Pe
0278-0046 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TIE.2015.2420792, IEEE Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS
TABLE I 0.45 12
PARAMETERS OF THE RL ALGORITHM USED IN SIMULATION AND EXPERIMENT 11
0.44 10
Cp
7
discount factor γ in (1) 0.75 0.75 0.42 Actual Cp 6
Optimal Cp 5
temperature τ in (2) 0.1 Equation (10) 0.41 Wind speed 4
τmin in (10) --- 0.08/0.1 3
0.4 2
τmax in (10) --- 0.8 0 60 120 180 240 300 360 420 480 540 600 660 720 780 840 900
Time (s)
N in (5) 25/30 25/30
M in (5) 46/50 46/50 Fig. 6. Simulation results of the WECS in the learning process of the RL-based
Δωr in (6) 0.02 p.u./10 rad/s 0.02 p.u./10 rad/s MPPT control during step changes in wind speed.
δ1 in (8) 0.0007 MW/0.8 W 0.0007 MW/0.8 W
δ2 in (11) --- 0.05 TABLE III
learning rate in (9) 0.3 k1=10, k2=25, k3=0.6 SELECTED STATES AND CORRESPONDING ACTION VALUES FROM Q-TABLE IN
in (12) --- 60/25 STEP-CHANGED WIND SPEED CONDITION
State Action value
0.45
Actual Cp ωr (p.u.) Pe (MW) Increment Decrement Stay
0.44 Optimal Cp 0.97 1.0375 0.0469 -0.3 0 7
0.99 1.0375 -0.348 0.007 0.2919 44
0.43 1.01 1.3875 -0.1223 -0.4425 -0.4718 21
Cp
0.4 wind speed conditions. The learned MPPs are then used to
0 50 100 150 200 250 300
Time (sec) generate the optimum ωr−Pe curve for the ORB MPPT control.
Fig. 5. Simulation result of the wind turbine power coefficient in the learning The online learning algorithm is tested for 18 hours and the
process of the RL-based MPPT control in a constant wind speed condition. simulation results for the last 6 hours are shown in Fig. 7,
during which the agent tries to follow the maximum Cp from the
TABLE II learned experience. However, due to the sudden,
SELECTED STATES AND CORRESPONDING ACTION VALUES FROM Q-TABLE IN
CONSTANT WIND SPEED CONDITION
large-magnitude changes in the wind speed, magnitude drops
are observed in the Cp curve. The learned states that contain
State Action value
MPPs are shown in Table IV. In this online learning test, the
ωr (p.u.) Pe (MW) Increment Decrement Stay
0.91 0.8625 0.7992 0 0 4
value of Nmax in (12) is set to be 60. As shown in Table IV, the
0.93 0.8625 0 -0.0256 0.0258 13 minimum value of Nkj is 60, which satisfies the learning
0.95 0.8625 -0.34 0.0058 0 18 stopping criterion (12) for each MPP. In practice, a smaller
0.97 0.8625 -0.2325 0.6570 0 6 value of Nmax, e.g., 20, can be used and, therefore, the online
0.99 0.8625 -0.2325 0.51 0 4
12
should increase to approach the MPP each time this state is
visited.
Wind speed (m/s)
10.5
2) Learning with Step-Changed Wind Speed: In this test, a
wind speed profile with step changes alternately between 9.5 9
have been found (e.g., at 120 s and 180 s), it still explores. The (a)
0.5
agent learns this wind speed condition in about 480 s; after that
it approaches directly to the MPP each time the wind speed 0.4
Optimal Cp
0.2
speed has been experienced before. Some selected action
values are shown in Table III. It indicates that the two MPPs 0.1
0278-0046 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TIE.2015.2420792, IEEE Transactions on Industrial Electronics
IEE
EE TRANSAC
CTIONS ON IN
NDUSTRIAL ELECTRONIC
CS
leaarning process can be shorterr than 18 hours for this WEC CS. value wwell [Fig. 8(b)].. Compared to the theoreticall wind power
Thhe optimum ωr−Pe curve ob btained by currve fitting on the output calculated by (3), the outpuut electrical ppower of the
reccorded MPPs iss in the form off (13) or (14), where
w Kopt is 1.07. DFIG iis smoothed [[Fig. 8(c)]. Thhe deviations between the
4) Application Process: The obtained optim mum ωr−Pe curve actual and maximum m Cp and beetween the acctual output
is uused for ORB MPPT
M control in
i the applicatiion process. So ome electricaal power and tthe theoretical maximum winnd power are
typpical simulation results are shhown in Fig. 8,8 where the wind caused by fast wind speed variatioons while a rellatively slow
speeed [Fig. 8(a)] varies in the range
r between 6 m/s and 9 m/s,m responsse of the W WECS due to a large sysstem inertia.
whhich is differentt from that used
d for learning. It
I can be seen that
t Specificcally, when thee wind speed inncreases, the innstantaneous
fasst MPPT und der a randomlly varying wind condition is wind poower is largerr than the outpput electrical ppower. As a
achhieved by usinng the ORB con ntrol expressed d by (14). During result, thhe WECS acceelerates and thee extra wind poower is stored
thee wind variatio
on, the actual value
v of Cp traccks the maximu um in the shhaft as kinetic eenergy. On thee other hand, when the wind
speed ddecreases, the iinstantaneous w wind power is less than the
output electrical pow wer. As a result, the kinetiic energy is
TABL
LE IV releasedd to balance the input winnd power andd the output
SELLECTED STATES AN
ND CORRESPONDIN
NG ACTION VALUES FROM Q-TABLE
E IN electricaal power, leadiing to a deceleration of the WWECS.
RANDDOMLY VARYING WIND SPEED COND
DITION
5
0 20 40 60 80 100 120
0
Time (s)
(aa)
0.5
Actual Cp
Optimal Cp
0.45
Cp
0.4
C
0.35
Electrical power
Wind power
1
0.5
0
0 20
0 40 60 80 100 120
0
Time (s)
(cc)
Figg. 8. Simulation results
r of the WEECS in the appliccation process of the
RL-based MPPT con ntrol. (a) Wind speeed, (b) wind turbine power coefficieent,
andd (c) maximum win nd power and DFIIG electrical powerr. Fig. 10. E
Experimental systeem setup.
0278-0046 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TIE.2015.2420792, IEEE Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS
Fig. 11. It takes about 150 s for the agent to complete (a)
exploration and stay at the MPP based on the learned
experience.
2) Learning with Step-Changed Wind Speed: A wind speed
profile with step changes alternately between 7 m/s and 8 m/s
every 30 s is used to verify that the agent is able to learn from
experience; and the results are shown in Fig. 12. Initially, the
agent is “naive” and a large drop of Cp is observed during this
learning process. This is caused by a wrong-direction
exploration. However, the agent is able to change its action to (b)
the correct direction, which verifies its learning ability even 150
Wind power
from a wrong experience. After a 15-minute online learning
process, the agent is able to obtain the optimal parameter Kopt 100
based on the learned MPPs. The Kopt for the optimum ωr
(rad/s)−Pe (W) curve obtained by curve fitting is 5.1×10-6.
50
3) The Application Process: The optimal Kopt is used in the
application process for the ORB MPPT control of the WECS.
During the experiment, the wind speed varies in the range 0
0 20 40 60 80 100 120
between 5 m/s and 8 m/s. The results in Fig. 13 again verify that Time (s)
the obtained ORB control has fast MPPT performance under (c)
various wind conditions. The actual value of Cp tracks the Fig. 13. Experimental results of MPPT control of the WECS in the application
process of the RL-based method. (a) Wind speed, (b) wind turbine power
maximum value well. Moreover, the electrical power of the coefficient, and (c) maximum wind power and PMSG electrical power.
PMSG tracks the theoretical maximum wind power well with
no obvious deviations owing to a small system inertia. These 4) Comparative Studies: To further demonstrate the
results also prove that the optimal Kopt learned in this advantage of the proposed RL-based MPPT method, it is
experiment is accurate. compared with the traditional P&O MPPT method for two
hours using a measured wind speed profile shown in Fig. 14(a).
0.5
Fig. 14(b) shows that the electrical power generated by the
0.45 PMSG using the proposed RL-based MPPT method is higher
0.4
Optimal Cp
than that using the P&O method, particularly when the wind
Actual Cp speed has abrupt changes. The ORB MPPT control obtained
Cp
0.35
from the RL has a fast MPPT capability under the varying wind
0.3
speed condition. On the contrary, the P&O method searches the
0.25 MPP from the beginning each time the wind speed changes,
0.2
0 50 100 150 200 250 300
which results in a slower convergence to the MPP. Fig. 14(c)
Time (s) compares the energies produced by the WECS using the two
Fig. 11. Experimental result of the power coefficient of the emulated wind different MPPT methods over the two hours. The results show
turbine in the learning process of the RL-based MPPT control in a constant that the WECS produces more energy when using the proposed
wind speed condition. RL-based MPPT method. At the end of the test, the WECS
using the RL-based MPPT method produces 0.4 MJ of energy,
0.7 9
which is 5.6% more than 0.379 MJ produced by the WECS
0.6 8 using the P&O MPPT method. The results also show that the
Wind speed (m/s)
0.5
7 longer the WECS with the RL-based MPPT control is operated,
6 the more energy it produces than the WECS with the P&O
Cp
0.4
5 MPPT control.
0.3 Actual Cp
Optimal Cp
4 The proposed RL-based MPPT method is further compared
0.2 Wind speed 3 with the TSR, ORB, and P&O MPPT methods in terms of
0.1 2 design requirement, online learning ability, and computational
0 100 200 300 400 500 600
Time (s) costs (time and memory). All of the methods are implemented
Fig. 12. Experimental results of the WECS emulator in the learning process of in MATLAB/Simulink running on the same desktop computer.
the RL-based MPPT control during step changes in wind speed. The results are compared in Table V. Compared with the other
0278-0046 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TIE.2015.2420792, IEEE Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS
9 TABLE V
COMPARISON OF DIFFERENT MPPT METHODS.
Wind speed (m/s)
8
7 MPPT Need prior Online Computational Required
method knowledge learning time (ms) memory
6 of WECS ability (kB)
5 TSR Yes No 0.042 0.483
ORB Yes No 0.003 0.178
4 P&O No No 0.015 0.181
0 20 40 60 80 100 120
Time (min) RL No Yes 0.003 0.178
(a)
PMSG electrical power (W)
0.3
0.379 MJ
0.2 APPENDIX
P&O TABLE A.I
0.1
Proposed RL-based PARAMETERS OF THE DFIG-BASED WECS USED IN SIMULATION STUDY
0
0 20 40 60 80 100 120 Rated power 1.5 MW
Time (min) Rated stator voltage 690 V
(c) Rated dc-bus voltage 1.2 kV
Fig. 14. Comparison of the WECS controlled by the P&O MPPT method and Stator resistance 0.007 pu
the proposed RL-based MPPT method. (a) Wind speed, (b) electrical power Wound rotor resistance 0.009 pu
generated by the PMSG, and (c) energy produced by the WECS. Magnetizing inductance 2.9 pu
Stator leakage inductance 0.171 pu
three MPPT methods, the proposed method learns the wind Wound rotor leakage inductance 0.156 pu
turbine characteristic online without the need of the prior DFIG inertia constant 0.62 s
Turbine inertia constant 3.8 s
knowledge of the WECS. The online learning algorithm is
universally applicable to any WECSs. Another benefit of the TABLE A.II
proposed RL-based MPPT method is that when the wind PARAMETERS OF THE EMULATED PMSG-BASED WECS
turbine characteristic changes due to the aging of the system,
DC motor PMSG
the proposed method can relearn the wind turbine characteristic Rated speed 3500 RPM Rated speed 3000 RPM
to generate a new value of Kopt to adapt to the change. In the Rated power 200 W Rated power 200 W
experiment, the learning process of the RL-based MPPT Back EMF constant 8.7 V/kRPM Number of poles 8
method needs 40 kilobytes (kB) memory to store a 3×30×50 Stator resistance 0.39 Ω Back EMF constant 9.5 V/kRPM
matrix of action values and other parameters that will be used in Armature inductance 0.67 mH Stator resistance 0.233 Ω
d-axis inductance 0.275 mH
the next step, including the speed, power, state index, and Nkj in
q-axis inductance 0.364 mH
(12); and the computational time of the proposed online
learning algorithm in each step is only 0.045 ms. The memory REFERENCES
requirement and computational time of the proposed method [1] Y. Zhao, C. Wei, Z. Zhang, and W. Qiao, “A review on position/speed
are low, making it suitable for real system applications. After sensorless control for permanent magnet synchronous machine-based
the MPPs are learned, the optimal parameter Kopt is then wind energy conversion systems,” IEEE Journal of Emerging and
obtained and the RL-based MPPT control is switched from the Selected Topics in Power Electron., vol. 1, no. 4, pp. 203-216, Dec. 2013.
[2] Y. Xia, K.H. Ahmed, and B.W. Williams, “Wind turbine power
online learning process to the online application process, which coefficient analysis of a new maximum power point tracking technique,”
essentially is an ORB MPPT control. As shown in Table V, the IEEE Trans. Ind. Electron., vol. 60, no. 3, pp. 1122-1132, Mar. 2013.
computational costs (memory and computational time) of the [3] M.A. Abdullah, A.H.M. Yatim, C.W. Tan, and R. Saidur, “A review of
proposed RL-based MPPT control during the online application maximum power point tracking algorithms for wind energy systems,”
Renewable & Sustainable Energy Review, vol. 16, no. 5, pp. 3220-3227,
are the same as the ORB method and lower than the TSR and Jun. 2012.
P&O MPPT methods. [4] J. Chen, J. Chen, and C. Gong, “On optimizing the transient load of
variable-speed wind energy conversion system during the MPP tracking
VII. CONCLUSION process,” IEEE Trans. Ind. Electron., vol. 61, no. 9, pp. 4698-4706, Sep.
2014.
This paper has proposed a novel intelligent RL-based MPPT
algorithm for variable-speed WECSs. A model-free Q-learning
0278-0046 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TIE.2015.2420792, IEEE Transactions on Industrial Electronics
IEE
EE TRANSAC
CTIONS ON IN
NDUSTRIAL ELECTRONIC
CS
[5] H. Li, K.L. Shi, and P.G. McLaaren, “Neural-netw work-based sensorrless C hun Wei (S’12) reeceived a B.S. deggree in electrical
maximum wind d energy capture with compensated d power coefficieent,” enngineering from Beijing Jiaotonng University,
IEEE Trans. Ind d. Appl., vol. 41, no.
n 6, pp. 1548-155 56, Nov./Dec. 2005. Beeijing, China, in 2009, and an M M.S. degree in
[6] W. Qiao, W. Zh hou, J.M. Aller, an nd R.G. Harley, “WWind speed estimaation ellectrical engineeriing from North China Electric
based sensorless output maximizaation control for a windw turbine drivin
ng a Poower University, B Beijing, China, in 22012. Currently,
DFIG,” IEEE Trans.
T Power Electron., vol. 23, no. 3,
3 pp. 1156-1169, MayM hee is working tow ward the Ph.D. degree in the
2008. D epartment of Elecctrical and Compuuter Engineering
[7] W. Qiao, X. Yang,Y and X. Go ong, “Wind speed and rotor posiition att the University of N
Nebraska—Lincoln, Lincoln, NE,
sensorless contrrol for direct-drivee PMG wind turbin nes,” IEEE Trans. Ind.
I USA.
U
Appl., vol. 48, no.
n 1, pp. 3-11, Jan n-Feb. 2012. His research iinterests include wind energy
[8] A.G. Abo-Khallil and D.-C. Leee, “MPPT controll of wind generaation conversioon systems, poweer electronics, m motor drives, andd computational
systems based on estimated wind d speed using SVR R,” IEEE Trans. Ind.
I intelligennce.
Electron., vol. 55,
5 no. 3, pp. 1489 9-1490, Mar. 2008.
[9] M. Pucci and M.M Cirrincione, “Neeural MPPT contrrol of wind generaators
with induction n machines witho out speed sensorss,” IEEE Trans. Ind. I Z
Zhe Zhang (S’10) received a B B.S. degree in
Electron., vol. 58,
5 no. 1, pp. 37-47, Jan. 2011. ellectrical engineeering from X Xi’an Jiaotong
[100] M. N. Soltani, T. Knudsen, M. Svenstrup,
S R. Wissniewski, P. Brath h, R. U
University, Xi’an, China, in 2010. C Currently, he is
Ortega, and K. Johnson, “Estim mation of rotor efffective wind speed d: A w
working toward a Ph.D. degreee in electrical
comparison,” IE EEE Trans. Contrrol Systems Technology, vol. 21, no o. 4, enngineering at the U
University of Nebrraska—Lincoln,
pp. 1155-1167, Jul. 2013. L
Lincoln, NE, USA.
[11] R. Cardenas, R. R Pena, S. Alepuzz, and G. Asher, “Overview
“ of conntrol He worked as aan electrical enginneering intern at
systems for the operation of DFIGs in wind energy y applications,” IE
EEE R
Rockwell Automattion in summer 20014. His current
Trans. Ind. Elecctron., vol. 60, no. 7, pp. 2776-2798, Jan. 2013. reesearch interests include control oof wind energy
[122] H.Camblong, I. I Martinez de Alegria,
A M. Rodriiguez, and G. Ab bad, coonversion systems, power electronnics, and motor
“Experimental evaluation of wind w turbines maaximum power point drrives.
tracking controllers,” Energy Co onversion & Mana agement, vol. 47, no.
18/19, pp. 2846 6-2858, Nov. 2006.
[13] S. Morimoto, H. H Nakayama, M. Sanada, and Y. Takeda, “Sensorrless W
Wei Qiao (S’05–M M’08–SM’12) recceived a B.Eng.
output maximizzation control for variable-speed wiind generation sysstem annd M.Eng. degreees in electrical enngineering from
using IPMSG,” IEEE Trans. Ind. Appl., vol. 41, no.. 1, pp. 60-67, 200 05. Zhhejiang Universityy, Hangzhou, Chinna, in 1997 and
[144] R. Cardenas and a R. Pena, “Seensorless vector control of induction 20002, respectivelyy, an M.S. deegree in high
machines for variable-speed
v winnd energy applicaations,” IEEE Tra ans. peerformance compputation for enginneered systems
Energy Converssion, vol. 19, no. 1, 1 pp. 196-205, Maar. 2004. froom Singapore-MIT Alliance (SMA)) in 2003, and a
[15] A. Tapia, G. Taapia, J.X. Ostolaza, and J.R. Saenz, “Modeling
“ and conntrol Phh.D. degree in eleectrical engineerinng from Georgia
of a wind turbiine driven doubly fed induction gen nerator,” IEEE Tra ans. Innstitute of Technollogy, Atlanta in 20008.
Energy Converssion, vol. 18, no. 2, 2 pp. 194-204, Jun n. 2003. Since August 2008, he has bbeen with the
[166] A. Mirecki, X. Roboam,
R and F. Riichardeau, “Architeecture complexity and U
University of Nebbraska—Lincoln (UNL), USA,
energy efficienccy of small wind tu urbines,” IEEE Traans. Ind. Electron.,, vol. where he is currently an A Associate Professoor in the Departmeent of Electrical
54, no. 1, pp. 6660-670, Feb. 2007. and Compmputer Engineeringg. His research innterests include rennewable energy
[177] Y. Zou, M. Elbu uluk, and Y. Sozer, “Stability analysiis on maximum po ower systems, smart grids, coondition monitoriing, power electtronics, electric
points tracking (MPPT) method in n wind power systeem,” IEEE Trans. Ind.I machiness and drives, andd computational inntelligence. He iss the author or
Appl., vol. 49, no.
n 3, pp. 1129-113 36, May/Jun. 2013 3. coauthor of 3 book chapterss and more than 1550 papers in refereeed journals and
[18] B. Shen, B. Mw winyiwiwa, Y. Zhan ng, and B. T. Oo, “Sensorless
“ maxim mum conferencce proceedings.
power point traacking of wind by y DFIG using roto or position phase lock
l Dr. Qiiao is an Editor off the IEEE Transaactions on Energy Conversion, an
loop (PLL),” IE EEE Trans. Powerr Electron., vol. 24 4, no. 4, pp. 942-9951, Associateed Editor of IET Poower Electronics aand the IEEE Journnal of Emerging
Apr. 2009. and Seleccted Topics in Pow wer Electronics, annd the Correspondiing Guest Editor
[199] Q. Wang and L. Chang, “An intelligenti maximu um power extraction of a speciial section on Conddition Monitoring,, Diagnosis, Prognnosis, and Health
algorithm for innverter-based variaable speed wind tu urbine systems,” IE EEE Monitorinng for Wind Energgy Conversion Sysstems of the IEEE Transactions on
Trans. Power Electron.,
E vol. 19, no.
n 5, pp. 1242-1249, Sept. 2004. Industriall Electronics. He w
was an Associate E Editor of the IEEE Transactions on
[200] R. Datta and V.T.
V Ranganathan, “A method of traccking the peak po ower Industry AApplications fromm 2010 to 2013. Hee was the recipientt of a 2010 U.S.
points for a variiable speed wind energy
e conversion system,” IEEE Tra ans. National Science Foundatiion CAREER Aw ward and the 20100 IEEE Industry
Energy Converssion, vol. 18, no. 1, 1 pp. 163-168, Maar. 2003. Applicatioons Society Andreew W. Smith Outsstanding Young M Member Award.
[21] S.M. Raza Kazm mi, H. Goto, Hai-JJiao. Guo, and O. Ichinokura, “A no ovel
algorithm for faast and efficient speed-sensorless maximum
m power point
tracking in wind d energy conversio on systems,” IEEE Trans. Ind. Electrron., L
Liyan Qu (S’05–M M’08) received thhe B.Eng. (with
vol. 58, no. 1, pp.29-36,
p Jan. 2011 1. thhe highest distinnction) and M.Enng. degrees in
[222] E. Koutroulis an nd K. Kalaitzakis, “Design of a max ximum power track king ellectrical engineerring from Zhejiaang University,
system for wiind-energy-converrsion application,”” IEEE Trans. Ind. I H
Hangzhou, China, in 1999 and 20002, respectively,
Electron., vol. 53,
5 no. 2, pp. 486-4 494, Apr. 2006. annd the Ph.D. degrree in electrical enngineering from
[23] R. S. Sutton and d A. G. Barto, Reiinforcement Learn ning: An Introducttion. thhe University of Illinois at Urbaana–Champaign,
Cambridge, MA A: MIT Press, 1998 8. U
USA, in 2007.
[244] R. S. Sutton, “LLearning to predict by the methods off temporal differen nces,” From 2007 to 2009, she was an Application
Machine Learning, vol. 3, pp. 9-4 44, 1988. E
Engineer with Ansooft Corporation, Irrvine, CA, USA.
[25] C. Szepesvári, “Algorithms fo or reinforcement learning,” Synth hesis S ince January 20100, she has been withh the University
Lectures on Artiificial Intelligencee and Machine Learning, vol. 4, no. 1,1 pp. of Nebrasska—Lincoln (UN NL), where she is ccurrently an Assisttant Professor in
1-103, 2010. the Deparrtment of Electricaal and Computer EEngineering. Her reesearch interests
[266] C. J. C. H. Watkkins and P. Dayan, “Q-learning,” Machine Learning, vo ol. 8, include ennergy efficiency, rrenewable energy, numerical analysiis and computer
pp. 279-292, 19 992. aided desiign of electric macchinery and powerr electronic devices, dynamics and
[277] L. Bușoniu, R. Babuška
B and B. D.
D Schutter, “A com mprehensive survey of control oof electric machinnery, permanent-m magnet machines,, and magnetic
multi-agent rein nforcement learniing,” IEEE Transs. Systems, Man and materials..
Cybernetics. Part C: Applicatio ons and Reviews, vol. 38, no. 2, pp.
156-172, Mar. 2008.
2
[28] E. Even-Dar and a Y. Mansour, “Learning rates for Q-learning,” The
Journal of Machine Learning Ressearch, vol. 5, pp. 1-25, Dec. 2004.
0278-0046 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.