Reinforcement Learning-Based Intelligent Maximum Power Point Tracking Control For Wind Energy Conversion Systems

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TIE.2015.2420792, IEEE Transactions on Industrial Electronics
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS
Reinforcement Learning-Based Intelligent

Maximum Power Point Tracking Control for
Wind Energy Conversion Systems
Chun Wei, Student Member, IEEE, Zhe Zhang, Student Member, IEEE, Wei Qiao, Senior Member,
IEEE, and Liyan Qu, Member, IEEE
MPP tracking (MPPT) control algorithm forms an essential part

Abstract—This paper proposes an intelligent maximum power of the control system of a modern variable-speed WECS.
point tracking algorithm for variable-speed wind energy Different MPPT control algorithms have been proposed and
conversion systems based on the reinforcement learning method. they mainly fall into three categories: tip speed ratio (TSR)
The model-free Q-learning algorithm is used by the controller of
the wind energy conversion system to learn a map from states to
control, optimum relationship-based (ORB) control, and
optimal control actions online by updating the action values perturbation and observation (P&O, also known as hill climb
according to the received rewards. The experienced action values search) control [1]-[4]. The TSR control has a fast response and
are stored in a Q-table, based on which the maximum power high accuracy. However, its performance largely depends on
points are obtained after a certain period of online learning. The the wind speed information. Wind speed can be measured using
learned maximum power points are then used to generate an anemometers. This, however, increases the capital, installation,
optimum speed-power curve for fast maximum power point
tracking control of the wind energy conversion system. Since the
and operational costs of the WECSs. In addition, the measured
reinforcement learning enables the wind energy conversion wind speed may not be accurate for high-performance MPPT
system to learn by directly interacting with the environment, the control. To solve the problems of using sensors, wind speed
knowledge of wind turbine parameters or wind speed information estimation methods have been proposed in [5]-[10]. However,
is not required. The proposed maximum power point tracking these methods require a tedious offline training process using
control algorithm is validated by simulation studies for a 1.5 MW the data of wind turbine characteristics which vary in different
doubly-fed induction generator-based wind energy conversion
system and experimental results for a 200-W permanent magnet
WECSs.
synchronous generator-based wind energy conversion system An ORB MPPT method controls the WECS based on a
emulator. certain optimum relation between system parameters [11]. For
example, an optimum electromagnetic torque-generator rotor
Index Terms—Maximum power point tracking (MPPT), speed curve or lookup table was used in [12]-[14]; in [15]-[18]
Q-learning, reinforcement learning (RL), wind energy conversion the relationship between the maximum electrical output power
system (WECS). and generator rotor speed was used. Although the ORB control
has been widely used in the industry because of its simplicity,
I. INTRODUCTION
fast response, and high accuracy, field tests are required for
W IND power has been extensively developed in the last

decade and is expected to be a major alternative source
for clean, renewable, affordable electricity supply in the future.
each WECS to obtain the optimum relationship [19].
The P&O control does not require any prior knowledge of
the WECS and is independent of wind speed information and
Currently, the majority of wind power is generated by wind turbine characteristics [20]-[22]. However, in the P&O
variable-speed wind energy conversion systems (WECSs). In a control, the WECS does not learn from the experience.
variable-speed WECS, the shaft rotating speed can be Therefore, it searches for the MPP all the time even the MPP
controlled such that the WECS can track the maximum power has been experienced before. This makes the WECS response
points (MPPs) to generate the maximum power for all the wind slowly to wind speed variations.
speed conditions within the rated value. Therefore, an effective Some MPPT methods combined the P&O control and the
ORB control. In [19], the optimum relationship between output
Manuscript received August 29, 2014; revised February 13, 2015; accepted power and DC voltage of the power converter was obtained
March 20, 2015.
Copyright © 2015 IEEE. Personal use of this material is permitted. using the P&O method and was then used for fast MPPT
However, permission to use this material for any other purposes must be execution. However, this method needs a significant learning
obtained from the IEEE by sending a request to pubs-permissions@ieee.org. time because the MPP for each DC voltage level needs to be
This work was supported in part by the U.S. National Science Foundation
under CAREER Award ECCS-0954938.
recorded. In [2], a P&O method was used to search for one
The authors are with the Power and Energy Systems Laboratory, MPP during the learning process to obtain the optimum
Department of Electrical and Computer Engineering, University of relationship of the rectified DC voltage and DC-side current of
Nebraska—Lincoln, Lincoln, NE 68588-0511 USA (e-mail:
the diode bridge rectifier. Then, the MPPT control was
cwei@huskers.unl.edu; zhang.zhe@huskers.unl.edu; wqiao3@unl.edu;
lqu2@unl.edu). implemented based on this optimum relationship. However, the
0278-0046 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
optimum relationship obtained from one MPP search has errors, an action a is taken, and is the reward function that
especially when the learned MPP is not correct. determines the reward after the state transition.
This paper proposes a novel online reinforcement learning At each time step t, the agent receives the state signal st ∈ S
(RL)-based intelligent MPPT method for variable-speed WECS, that represents the state of the environment and selects an
as shown in Fig. 1. The method consists of two processes: an action at ∈ A based on a mapping, which is called an action
online learning process and an online application process. In selection policy and denoted as p. For example, p(st, at) is the
the online learning process, the controller of the WECS probability that the action at is chosen in the state st. One time
behaves like an agent to interact with the environment to learn step later, in part as a result of the action at, the state changes
the MPPs from its own experience using a model-free from st to st+1 ∈ S according to ; and the agent receives a
Q-learning algorithm. The optimum rotor speed-electrical numerical reward rt+1 based on , which evaluates the
output power curve is then obtained from these MPPs and used immediate effect of the action at on this state transition.
for fast MPPT control of the WECS in the online application In the RL, the agent’s goal is to find an action selection
process. Compared to the traditional TSR and ORB methods, policy to maximize the total discounted rewards (expected
the proposed MPPT method does not require wind speed return) it receives over the future [23]. The expected return is
sensors, prior knowledge of the WECS, or a tedious offline described by a state value function or an action value function.
design process. Compared to the P&O methods, the RL enables The estimation of the value function is the core of almost all RL
the agent to reach the MPP faster at a wind speed that has been algorithms.
experienced before and respond quickly to wind speed There are three basic classes of methods for solving the RL
variations after learning. In addition, it is not necessary to problem: model-based dynamic programming, model-free
record the MPPs in the whole operating range as in [19] to form Monte Carlo method, and model-free temporal-difference
the optimum curve, only the learned MPPs are needed. Thus, (TD) learning [23]-[25]. Q-learning is a form of model-free
the learning time is shortened compared to that in [19]. RL algorithm based on TD and has various industrial
Moreover, the estimation error of the optimum curve obtained applications, such as optimal control and multi-agent RL
from the proposed method is smaller than that obtained from systems [26], [27].
the method only using one MPP. The proposed RL-based The one-step form of Q-learning is defined by
MPPT method is validated by simulation results on a 1.5-MW Qt +1 ( st , at ) = Qt ( st , at ) + lt [rt +1 + γ max a Qt ( st +1 , ai ) − Qt ( st , at )] (1)
doubly-fed induction generator (DFIG)-based WECS and i
experimental results on a 200-W emulated permanent-magnet where γ ∈ [0,1) is the discount factor that determines the
synchronous generator (PMSG)-based WECS, which are the current values of the rewards to be received in the future, i is
two most commonly used types of variable-speed WECSs. the index of the action in action space. Qt(st, at) is the action
value function (Q value function) that needs to be estimated.
II. BACKGROUND OF REINFORCEMENT LEARNING At each time step, the agent firstly observes its current state st
and selects an action at to perform. At the same time, the Q
Unlike the supervised learning in which an agent learns from
value Qt(st, at) is remembered. Then, the subsequent state st+1
the examples provided by an external supervisor, in the RL, an
is observed with an immediate reward rt+1 and the maximum
agent is able to learn from its own experience by directly
Q value corresponding to st+1, ( , ), is picked
interacting with the environment through actions, states, and
out using the method described below. The recorded Qt(st, at)
rewards (Fig. 1). The agent receives a reward each time when it
takes an action to transit from one state to another. The will be updated according to (1). The parameter ∈ (0,1] is
objective of the RL is to form a mapping from states to actions the learning rate, which determines how far the currently
so as to maximize the rewards [23]. estimated Qt(st, at) is adjusted toward the newly estimated Q
Most of the RL theories deal with the finite Markov decision value + ( , ). To ensure that the Q value
process (MDP). A finite MDP can be described by a tuple [S, function converges to the optimal value, all the state-action
A, , ], where S is the set of states, A is the set of actions, pairs should be updated. This means that the agent should try
is the state transition probability function that provides the all the actions in all states to find a trade-off between
exploration and exploitation [27]. One method to achieve this
probability of transition from a state s to another state s’ when
is to use the Boltzmann exploration as an action selection
policy, in which the action ai in the state s is chosen with the
following probability.
eQ ( s , ai )/τ
P e ωr r p( s, ai ) = (2)
 a eQ ( s,ai )/τi
where τ is a positive parameter called temperature, which

ω ∗
r controls the randomness of the exploration. A larger τ causes
Pe ωr∗
Pe the action selection to be more random; whereas a lower
ωr
temperature causes the high-value actions to be selected with a
ωr
greater chance. In practice, τ can be changed with time during
the learning process to balance exploration and exploitation.
Fig. 1. Block diagram of the proposed RL-based MPPT algorithm.
III. APPLICATION OF Q-LEARNING IN MPPT CONTROL

1.6
To apply the Q-learning algorithm to the MPPT control, 1.4 vw 5
Turbine power (MW)

three main items, namely, state space, action space, and reward, 1.2 vw 4
should be defined properly. As shown in Fig. 1, the agent (i.e., 1 Pj vw3
the controller of the WECS) observes the state ∈ of the 0.8 vw 2
environment, which represents the operating point of the 0.6
v w1
WECS, and chooses a discrete control action ( ∈ ) from the 0.4
action space based on the Q values of all possible actions using 0.2
Optimal power curve
the Boltzmann exploration. The WECS then enters into a new 0

ωi
0 0.2 0.4 0.6 0.8 1 1.2 1.4
operating point and the agent receives a reward rt+1 to update Wind turbine shaft speed (pu)
the value of the action that has been taken in the previous state.
Fig. 2. Typical wind turbine mechanical power-shaft speed characteristic
The goal of the agent is to extract as much energy as possible curves for different wind speeds and the optimum power curve.
from the wind. To achieve this, an action with a higher value
will be chosen with a greater possibility each time when the
selected appropriately: the learning period will be longer if their
agent makes a decision to choose an action. The performance of
values are larger; whereas small values will decrease the
the learning is largely affected by the parameters of the
learning accuracy.
Q-learning algorithm, such as the discount factor, learning rate
[28], and temperature. Thus, they should be selected properly. B. Action Space
A. State Space The action space for the controller of the WECS operating in
the speed control mode can be expressed by
The total mechanical power that a wind turbine is able to
A = {a | +Δωr , 0, −Δωr } (6)
capture from wind can be calculated by the following formula
where Δωr is the change of the speed control command and 0
1
Pm = ρ Avw3 C p (λ , β ) (3) means that no change is made and the previous speed control
2 command is used. The speed control command is updated by
where ρ is the air density, A = πR2 is the area swept by the
ωr ,t +1 = ωr ,t + a (7)
blades and R is the blade radius, vw is the wind speed, and Cp is
the power coefficient, which is determined by the TSR where ωr,t and ωr,t+1 are the previous and current speed control
λ=ωtR/vw and the blade pitch angle β, where ωt is the turbine commands, respectively. In each state the agent has three action
rotating speed. There is an optimal value λopt at which the options, namely, “increment” (+Δωr), “decrement” (−Δωr), and
turbine will extract the maximum power Pmax from wind. “stay” (0), as long as the limits of are not reached. When ωr,t
R 3C p max 3 reaches the lower (upper) limit of its range, the action should be
1 (4)
Pmax = ρ A ωt = K ωt 3 either +Δωr (−Δωr) or 0 to ensure that the operating point will
2 λopt
3
not go outside the limits. At each time step, the agent chooses
where Cpmax is the maximum power coefficient at λopt and K is an action from the action space to modify the speed control
an optimal parameter. command based on the measurement of the state ( , , , )
In the MPPT control, the pitch angle is usually maintained and the action selection policy used.
constant and the TSR is controlled at its optimal value λopt
during wind variations. Fig. 2 shows the shaft (rotor) speed C. Reward
versus mechanical power characteristics for different wind After taking an action, the agent receives a reward to
speeds of the wind turbine used for the simulation study of this evaluate the selected action. The reward is defined by
paper. For each wind speed, there is a shaft speed-power curve. +1, if Pe,t +1 − Pe,t > δ1
On the curve there is only one optimal rotor speed at which the 
rt +1 = 0, if Pe,t +1 − Pe,t ≤ δ1 (8)
maximum wind power is extracted. Therefore, a rotor 
speed-power pair is sufficient to represent a state, i.e., an −1, if Pe,t +1 − Pe,t < −δ1
operating point, of the WECS, based on which an action can be where , and , are the electrical output power at two
chosen. For example, if the agent detects that the operating successive time steps, and δ1 is a small positive number. A
point locates at A( , ) at wind speed vw2 by measuring the positive reward will be given to the agent if the selected action
electrical output power and rotor speed, it will choose an action leads to an increment of ; whereas a negative reward (a
to move its operating point to the left to gain more energy penalty) will be given if decreases. Sometimes, there may be
instead of moving to the right or remaining at the point A. In a small difference caused by measurement errors even the
real applications, the optimum relation of the generator rotor actual output powers at two successive time steps are the same.
speed and electrical output power is used, from which the In addition, it is difficult to detect the difference between
discretized state space is generated as follows: , and , when the operating point is near the MPP.
S = {s | skj = (ω r , k , Pe , j ), k ∈ [1, 2,..., N ], j ∈ [1, 2,..., M ]} (5) Therefore, a small positive number δ1 is used to set a bound. As
where N and M are the numbers of equally divided segments in in (8), when the difference between , and , is smaller
the entire ranges of and , respectively. The values of N than δ1, , and , are assumed to be identical.
and M have a great impact on the learning result and should be
D. Discount Factor, Learning Rate, and Temperature right of the state, a “decrement” action if the MPP locates on
The discount factor γ in (1) controls the discounted values of the left of the state, and a “stay” action if the MPP lies in the
the future rewards. In the Q-learning for the MPPT application, current state. The action with the highest value is the best for
the agent tries to maximize the total future rewards instead of that state, which should be selected to move the operating point
towards the MPP. Initially, when a state is detected, all the three
the reward received immediately.
actions should be selected randomly regardless of their action
To ensure the convergence of the Q-learning, a criterion
values. As the temperature τ decreases, the action with a higher
should be considered when selecting the learning rate in (1), value will be selected with a higher possibility. After a certain
namely, a larger (smaller) learning rate should be used for a period of learning, for example, when reaches a set value,
state that the agent has experienced less (more) times. In this
the action with the highest value will be selected.
paper, each state skj is associated with a learning rate , which For each state, if the “stay” action has the highest value, it
is calculated as follows based on a parameter representing means that the MPP for the specific wind speed locates in that
the number of times that the state has been experienced. state because either an “increment” or a “decrement” action
k1 will lead to a negative reward that decreases the action value.
lkj = . (9)
k2 + k3 N kj Therefore, each time when such a state is found, the
corresponding MPP ( and ) is recorded. The learned MPPs
The larger the parameter , the more likely the convergence
are used to form the optimum curve of and for the ORB
of the action value to the optimum for that state and the smaller
MPPT control of the WECS in the online application process.
the learning rate .
The flow chart of the proposed RL-based MPPT algorithm is
The temperature τ in (2) also decreases during the learning shown in Fig. 3 and the details are explained as follows.
process, which can also be controlled by using . Initially, 1) Q-table Initialization: Online learning is a process of
the agent is encouraged to explore in the action space. updating the learning experience represented by the action
Therefore, even the actions that receive negative rewards values stored in the Q-table. Since each state has three actions,
should be selected with certain random probabilities. For the total number of action values stored in the Q-table is
instance, when the WECS operates at the point A (Fig. 2) at the 3 × × . Better initial Q values make the Q-learning
wind speed , a “decrement” action will lead to a positive algorithm converge faster, thus requiring less time for online
reward if the wind speed does not change. However, this action learning. In real-world applications, to shorten the online
will receive a negative reward if the wind speed decreases learning time, a prelearning can be performed offline by
suddenly, which makes this “good” action a “bad” one. To simulations first, where the initial Q values can be simply set to
minimize the effect caused by wind speed variations, all the be zero. The prelearned Q values are then used to constitute an
actions should be selected randomly before the learned initial Q-table for the online learning algorithm.
experience could be exploited, which can be controlled by τ. 2) Implementation of the Q-learning Algorithm: After the
For example, τ can be expressed by Q-table is initialized, the agent starts to learn online based on
  N kj  the Q-learning algorithm. In each cycle of learning, , and
τ min + 1 −  (τ max − τ min ) , if N kj ≤ N max
τ =  N max  (10) , are acquired to determine the indices k and j of the state st
τ , if N kj > N max
(the location of st) in the Q-table, and the parameter is
 min increased by 1 to calculate the learning rate and temperature
where and are the minimum and maximum values of τ using (9) and (10), respectively. Then an action at is selected
temperature, respectively; is the number of times a state from the action space defined by (6) using the Boltzmann
has been visited after which the learned experience of the state exploration. The corresponding action value ( , ) is
seems to be “mature” and could be exploited. selected from the Q-table and will be updated by the end of the
IV. INTELLIGENT RL-BASED MPPT ALGORITHM current sampling interval. The sample interval should be larger
than the response time of the speed control loop to obtain a
The proposed intelligent RL-based MPPT algorithm consists correct electrical power measurement [20]. In the beginning of
of two processes: an online learning process and an online the next sampling interval, a new state st+1 is observed and the
application process, as shown in Fig. 1. In the online learning difference of the two successive electrical power measurements
process, a map from state to action is learned. The learned is used to calculate the reward rt+1 by (8). The maximum action
action values are stored in a Q-table, based on which, the value in st+1 is picked out and expressed as ( , ′).
relationship between the optimal and can be derived. In
The value of ( , ) is updated by (1) in the Q-table, which
the application process, the WECS is controlled using the
shows how the agent learns from its own experience in one
optimum relationship learned.
learning cycle.
A. The Learning Process The Boltzmann exploration is implemented as follows:
In the learning process, the agent learns how to behave at (1) Calculate the possibility (s, ) for selecting each action
each discrete state (operating point). For example, in each state, (i = 1, 2, 3) by(2).
three actions can be taken, as described in Section III-B. The (2) Randomly generate a number between 0 and 1.
agent will take an “increment” action if the MPP locates on the (3) Select an action based on : if is larger than (s, ),
the first action is selected; if is between (s, ) and
Pe
Fig. 4. Schematic of a grid-connected DFIG-based WECS.
where Kopt is the optimal parameter to be determined. If the

speed control is used in the WECS, the rotor speed command is
obtained by [16]
Pe
ωr = 3 (14)
K opt
Therefore, the WECS can be controlled by the fast and
high-efficiency ORB MPPT method in the application process.
Fig. 3. Flow chart of proposed intelligent RL-based MPPT algorithm.
V. SIMULATION RESULTS
A. Simulation System
(s, ) + (s, ), the second action is selected; the last
action is selected if is larger than (s, ) + (s, ). The proposed RL-based MPPT control algorithm is validated
3) Acquisition of the MPP: As aforementioned, the MPP for by simulation studies in PSCAD/EMTDC using a 1.5-MW
a specific wind speed locates in the state where the “stay” DFIG-based WECS, which is one type of the most commonly
action has the highest value. Therefore, each time such a state is used variable-speed WECSs. The schematic of the WECS is
detected, the corresponding MPP and will be recorded. shown in Fig. 4. The parameters of the DFIG are presented in
To ensure that the MPP is correct, the visited state should meet Table A.I of Appendix. The detailed model and equations of the
some criteria before and can be recorded. One criterion is DFIG can be found in [6]. The stator-flux-oriented vector
that all the three action values in the state should converge, control method is utilized to control the rotor-side converter
which is expressed as follows (RSC) for decoupled control of the speed and reactive power of
the DFIG [14]. The reactive power commands for the RSC and
Qt +1 ( st , at ,i ) − Qt ( st , at ,i ) ≤ δ 2 , (11) the grid-side converter are set to be zero during the simulation.
where is a small positive number. However, the agent may The parameters used in the RL algorithm in the simulation
be “mature” enough before the action values converge, i.e., the studies are listed in Table I.
learned experience can be used to extract the MPPs before (11)
B. Simulation Results
is satisfied. Therefore, another criterion is
N kj ≥ N max . (12) 1) Learning with Constant Wind Speed: The WECS is first
operated with a constant wind speed of 9 m/s to verify that the
defines how “mature” a state is. When (11) or (12) is proposed RL algorithm is capable of searching for the MPP in
satisfied, it indicates that the agent has gained sufficient the learning process. A 5-minute online learning is recorded, so
experience to judge whether the current state is a MPP the parameters τmin and τmax in (10), δ2 in (11), and in (12)
according to the action values of the state. are not needed. The simulation result of Cp is shown in Fig. 5.
The duration of the learning process can be chosen in two The agent starts to search for the maximum Cp at 0 s and arrives
ways. One way is to simply run the learning algorithm for a at the MPP in approximately 60 s. The exploration process can
preset time and record the MPPs that have been learned during be seen again around 100 s, from which the agent searches
this time. The other way is to record the number of MPPs again for the action that leads to more power production. From
learned. Once the number of the recorded MPPs reaches a approximately 130 s, the agent tries to stay at the MPP based on
predefined value, the online learning process will stop. the learned experience. Table II shows the action values for
some states selected from the Q-table after 300-s learning,
B. The Application Process where (0.93, 0.8625) and (0.95, 0.8625) are two states that are
From the learned experience (i.e., the updated action values) visited most frequently. The MPP locates in the state (0.93,
of the agent stored in the Q-table, the learned MPPs can be 0.8625) where the action “stay” has the largest value. In other
obtained. Then the curve fitting method is used to obtain the states, either “increment” or “decrement” has the largest value.
optimum relationship of and expressed by For instance, the value of “increment” is the largest in the state
Pe = K opt ωr3 . (13) (0.91, 0.8625), indicating that the rotor speed of the WECS
TABLE I 0.45 12
PARAMETERS OF THE RL ALGORITHM USED IN SIMULATION AND EXPERIMENT 11
0.44 10
Wind speed (m/s)

Parameters Values (constant or Values (learning of 9
step wind speed) MPPs) 0.43 8
Simulation/Experiment Simulation/Experiment
Cp
7
discount factor γ in (1) 0.75 0.75 0.42 Actual Cp 6
Optimal Cp 5
temperature τ in (2) 0.1 Equation (10) 0.41 Wind speed 4
τmin in (10) --- 0.08/0.1 3
0.4 2
τmax in (10) --- 0.8 0 60 120 180 240 300 360 420 480 540 600 660 720 780 840 900
Time (s)
N in (5) 25/30 25/30
M in (5) 46/50 46/50 Fig. 6. Simulation results of the WECS in the learning process of the RL-based
Δωr in (6) 0.02 p.u./10 rad/s 0.02 p.u./10 rad/s MPPT control during step changes in wind speed.
δ1 in (8) 0.0007 MW/0.8 W 0.0007 MW/0.8 W
δ2 in (11) --- 0.05 TABLE III
learning rate in (9) 0.3 k1=10, k2=25, k3=0.6 SELECTED STATES AND CORRESPONDING ACTION VALUES FROM Q-TABLE IN
in (12) --- 60/25 STEP-CHANGED WIND SPEED CONDITION
State Action value
0.45
Actual Cp ωr (p.u.) Pe (MW) Increment Decrement Stay
0.44 Optimal Cp 0.97 1.0375 0.0469 -0.3 0 7
0.99 1.0375 -0.348 0.007 0.2919 44
0.43 1.01 1.3875 -0.1223 -0.4425 -0.4718 21
Cp
1.09 1.3875 -0.3234 -0.4089 -0.1246 23

0.42
1.11 1.3875 -0.3 -0.1221 -0.3 11
0.41
0.4 wind speed conditions. The learned MPPs are then used to
0 50 100 150 200 250 300
Time (sec) generate the optimum ωr−Pe curve for the ORB MPPT control.
Fig. 5. Simulation result of the wind turbine power coefficient in the learning The online learning algorithm is tested for 18 hours and the
process of the RL-based MPPT control in a constant wind speed condition. simulation results for the last 6 hours are shown in Fig. 7,
during which the agent tries to follow the maximum Cp from the
TABLE II learned experience. However, due to the sudden,
SELECTED STATES AND CORRESPONDING ACTION VALUES FROM Q-TABLE IN
CONSTANT WIND SPEED CONDITION
large-magnitude changes in the wind speed, magnitude drops
are observed in the Cp curve. The learned states that contain
State Action value
MPPs are shown in Table IV. In this online learning test, the
ωr (p.u.) Pe (MW) Increment Decrement Stay
0.91 0.8625 0.7992 0 0 4
value of Nmax in (12) is set to be 60. As shown in Table IV, the
0.93 0.8625 0 -0.0256 0.0258 13 minimum value of Nkj is 60, which satisfies the learning
0.95 0.8625 -0.34 0.0058 0 18 stopping criterion (12) for each MPP. In practice, a smaller
0.97 0.8625 -0.2325 0.6570 0 6 value of Nmax, e.g., 20, can be used and, therefore, the online
0.99 0.8625 -0.2325 0.51 0 4
12
should increase to approach the MPP each time this state is
visited.
Wind speed (m/s)
10.5
2) Learning with Step-Changed Wind Speed: In this test, a
wind speed profile with step changes alternately between 9.5 9
m/s and 10.5 m/s every minute is used. A 15-minute online

7.5
learning process is recorded, so the parameters τmin and τmax in
(10), δ2 in (11), and in (12) are not needed. The results are 6
0 60 120 180 240 300 360
shown in Fig. 6. Initially, the agent is “naive,” even the MPPs Time (min)
have been found (e.g., at 120 s and 180 s), it still explores. The (a)
0.5
agent learns this wind speed condition in about 480 s; after that
it approaches directly to the MPP each time the wind speed 0.4
changes. This cannot be achieved by the traditional P&O 0.3

Actual Cp
method, which searches for the MPP all the time even the wind
Cp
Optimal Cp
0.2
speed has been experienced before. Some selected action
values are shown in Table III. It indicates that the two MPPs 0.1
locate in the state (0.99, 1.0375) and (1.09, 1.3875), 0

0 60 120 180 240 300 360
respectively. These results verify that the agent is able to learn Time (min)
from experience. (b)
3) Learning with Randomly Varying Wind Speed: The wind Fig. 7. Simulation results of the WECS in the learning process of the RL-based
speed is then varied randomly between 7.5 m/s and 10.5 m/s MPPT control under randomly varying wind speeds. (a) Wind speed and (b)
every 30 seconds for the agent to learn the MPPs for various wind turbine power coefficient.
IEE
EE TRANSAC
CTIONS ON IN
NDUSTRIAL ELECTRONIC
CS
leaarning process can be shorterr than 18 hours for this WEC CS. value wwell [Fig. 8(b)].. Compared to the theoreticall wind power
Thhe optimum ωr−Pe curve ob btained by currve fitting on the output calculated by (3), the outpuut electrical ppower of the
reccorded MPPs iss in the form off (13) or (14), where
w Kopt is 1.07. DFIG iis smoothed [[Fig. 8(c)]. Thhe deviations between the
4) Application Process: The obtained optim mum ωr−Pe curve actual and maximum m Cp and beetween the acctual output
is uused for ORB MPPT
M control in
i the applicatiion process. So ome electricaal power and tthe theoretical maximum winnd power are
typpical simulation results are shhown in Fig. 8,8 where the wind caused by fast wind speed variatioons while a rellatively slow
speeed [Fig. 8(a)] varies in the range
r between 6 m/s and 9 m/s,m responsse of the W WECS due to a large sysstem inertia.
whhich is differentt from that used
d for learning. It
I can be seen that
t Specificcally, when thee wind speed inncreases, the innstantaneous
fasst MPPT und der a randomlly varying wind condition is wind poower is largerr than the outpput electrical ppower. As a
achhieved by usinng the ORB con ntrol expressed d by (14). During result, thhe WECS acceelerates and thee extra wind poower is stored
thee wind variatio
on, the actual value
v of Cp traccks the maximu um in the shhaft as kinetic eenergy. On thee other hand, when the wind
speed ddecreases, the iinstantaneous w wind power is less than the
output electrical pow wer. As a result, the kinetiic energy is
TABL
LE IV releasedd to balance the input winnd power andd the output
SELLECTED STATES AN
ND CORRESPONDIN
NG ACTION VALUES FROM Q-TABLE
E IN electricaal power, leadiing to a deceleration of the WWECS.
RANDDOMLY VARYING WIND SPEED COND
DITION
State Action valuee VI. EXPERIMENTALL RESULTS

ωr (p.u.) Pe (MW)
( Incremennt Decrement Stay
0.85 0.6
6375 -0.0818
8 0.4189 0.9171 66 A. Expperimental Systtem
0.87 0.7
7375 0.1973 -0.3714 0.8598 70 Expeerimental studdies are perfoormed on a direct-drive
0.89 0.7
7375 -0.2221 0.1834 1.45 61
0.89 0.7
7625 0.0053 0.3768 1.2373 74
PMSG--based WECS S, which is the other typpe of most
0.91 0.7
7875 -0.0804
4 0.1409 0.912 130 commonnly used variaable-speed WE ECS, to furtherr validate the
0.91 0.8
8375 0.1652 -0.4408 1.2368 62 proposeed RL-based M MPPT control aalgorithm. Fig. 9 shows the
0.93 0.8
8625 0.3331 -0.2674 1.0841 89 schematatic of the expeerimental systeem with a 200--W emulated
0.93 0.9
9125 0.4785 0.3612 1.0191 60 PMSG--based WECS. A 200-W DC C motor powerred by a DC
0.95 0.9
9375 0.2611 0.2825 1.2001 65
source tthrough a DC ddrive is used too emulate the ddynamics of a
0.95 0.9
9875 0.3944 -0.5349 0.5595 77
wind tuurbine to drive the PMSG direectly. The pow wer generated
10 by the P PMSG is fed bback to the DC C source via a three-phase
convertter. The dual-looop vector conttrol scheme [7]] is applied to
9
control the rotor speedd of the PMSG.. The control allgorithms are
Wind speed (m/s)
8 implemmented on a dSP PACE 1104 reaal-time control board. All of

the meeasured quantiities are recorded in the ControlDesk
7
interfacced with the ddSPACE 11044 board and a laboratory
6
5
0 20 40 60 80 100 120
0
Time (s)
(aa)
0.5
Actual Cp
Optimal Cp
0.45
Cp
0.4
C
0.35
Fig. 9. Thhe schematic of thhe experimental system with a WECS

S emulator.
0.3
0 20 40 60 80 100 120
0
Tiime (s)
(b
b)
1.5
DFIG electrical power (MW)
Maximum wind power and
Electrical power
Wind power
1
0.5
0
0 20
0 40 60 80 100 120
0
Time (s)
(cc)
Figg. 8. Simulation results
r of the WEECS in the appliccation process of the
RL-based MPPT con ntrol. (a) Wind speeed, (b) wind turbine power coefficieent,
andd (c) maximum win nd power and DFIIG electrical powerr. Fig. 10. E
Experimental systeem setup.
computer (PC). The experimental system setup is shown in Fig. 9
10. The parameters of the DC motor and PMSG are listed in 8
Wind speed (m/s)

Table A.II of Appendix. The parameters of the RL MPPT 7
algorithm used in the experimental studies are listed in Table I.
6
B. Experimental Results
5
1) Learning with Constant Wind Speed: The experimental
4
result of Cp under a constant wind speed of 8 m/s is shown in 0 20 40 60
Time (s)
80 100 120
Fig. 11. It takes about 150 s for the agent to complete (a)
exploration and stay at the MPP based on the learned
experience.
2) Learning with Step-Changed Wind Speed: A wind speed
profile with step changes alternately between 7 m/s and 8 m/s
every 30 s is used to verify that the agent is able to learn from
experience; and the results are shown in Fig. 12. Initially, the
agent is “naive” and a large drop of Cp is observed during this
learning process. This is caused by a wrong-direction
exploration. However, the agent is able to change its action to (b)
the correct direction, which verifies its learning ability even 150
Wind power
from a wrong experience. After a 15-minute online learning
PMSG electrical power (W)

Maximum wind power and
Electrical power
process, the agent is able to obtain the optimal parameter Kopt 100
based on the learned MPPs. The Kopt for the optimum ωr
(rad/s)−Pe (W) curve obtained by curve fitting is 5.1×10-6.
50
3) The Application Process: The optimal Kopt is used in the
application process for the ORB MPPT control of the WECS.
During the experiment, the wind speed varies in the range 0
0 20 40 60 80 100 120
between 5 m/s and 8 m/s. The results in Fig. 13 again verify that Time (s)
the obtained ORB control has fast MPPT performance under (c)
various wind conditions. The actual value of Cp tracks the Fig. 13. Experimental results of MPPT control of the WECS in the application
process of the RL-based method. (a) Wind speed, (b) wind turbine power
maximum value well. Moreover, the electrical power of the coefficient, and (c) maximum wind power and PMSG electrical power.
PMSG tracks the theoretical maximum wind power well with
no obvious deviations owing to a small system inertia. These 4) Comparative Studies: To further demonstrate the
results also prove that the optimal Kopt learned in this advantage of the proposed RL-based MPPT method, it is
experiment is accurate. compared with the traditional P&O MPPT method for two
hours using a measured wind speed profile shown in Fig. 14(a).
0.5
Fig. 14(b) shows that the electrical power generated by the
0.45 PMSG using the proposed RL-based MPPT method is higher
0.4
Optimal Cp
than that using the P&O method, particularly when the wind
Actual Cp speed has abrupt changes. The ORB MPPT control obtained
Cp
0.35
from the RL has a fast MPPT capability under the varying wind
0.3
speed condition. On the contrary, the P&O method searches the
0.25 MPP from the beginning each time the wind speed changes,
0.2
0 50 100 150 200 250 300
which results in a slower convergence to the MPP. Fig. 14(c)
Time (s) compares the energies produced by the WECS using the two
Fig. 11. Experimental result of the power coefficient of the emulated wind different MPPT methods over the two hours. The results show
turbine in the learning process of the RL-based MPPT control in a constant that the WECS produces more energy when using the proposed
wind speed condition. RL-based MPPT method. At the end of the test, the WECS
using the RL-based MPPT method produces 0.4 MJ of energy,
0.7 9
which is 5.6% more than 0.379 MJ produced by the WECS
0.6 8 using the P&O MPPT method. The results also show that the
Wind speed (m/s)
0.5
7 longer the WECS with the RL-based MPPT control is operated,
6 the more energy it produces than the WECS with the P&O
Cp
0.4
5 MPPT control.
0.3 Actual Cp
Optimal Cp
4 The proposed RL-based MPPT method is further compared
0.2 Wind speed 3 with the TSR, ORB, and P&O MPPT methods in terms of
0.1 2 design requirement, online learning ability, and computational
0 100 200 300 400 500 600
Time (s) costs (time and memory). All of the methods are implemented
Fig. 12. Experimental results of the WECS emulator in the learning process of in MATLAB/Simulink running on the same desktop computer.
the RL-based MPPT control during step changes in wind speed. The results are compared in Table V. Compared with the other
9 TABLE V
COMPARISON OF DIFFERENT MPPT METHODS.
Wind speed (m/s)
8
7 MPPT Need prior Online Computational Required
method knowledge learning time (ms) memory
6 of WECS ability (kB)
5 TSR Yes No 0.042 0.483
ORB Yes No 0.003 0.178
4 P&O No No 0.015 0.181
0 20 40 60 80 100 120
Time (min) RL No Yes 0.003 0.178
(a)
PMSG electrical power (W)
150 algorithm has been designed for online learning of the

P&O
Proposed RL-based
controller of a WECS by updating the action values stored in a
100 Q-table, from which the MPPs are learned. The optimum ωr−Pe
curve has then been obtained from the learned MPPs and used
50 for fast ORB real-time MPPT control for the WECS. The
knowledge of wind turbine parameters or wind speed
0
information is not required because the RL has enabled the
0 20 40 60 80 100 120 agent to learn by directly interacting with the environment. The
Time (min)
proposed RL-based MPPT algorithm has been validated by
(b)
simulation results for a 1.5-MW DFIG wind turbine as well as
0.4 0.4 MJ experimental studies for a 200-W emulated PMSG wind
turbine.
Energy (MJ)
0.3
0.379 MJ
0.2 APPENDIX
P&O TABLE A.I
0.1
Proposed RL-based PARAMETERS OF THE DFIG-BASED WECS USED IN SIMULATION STUDY
0
0 20 40 60 80 100 120 Rated power 1.5 MW
Time (min) Rated stator voltage 690 V
(c) Rated dc-bus voltage 1.2 kV
Fig. 14. Comparison of the WECS controlled by the P&O MPPT method and Stator resistance 0.007 pu
the proposed RL-based MPPT method. (a) Wind speed, (b) electrical power Wound rotor resistance 0.009 pu
generated by the PMSG, and (c) energy produced by the WECS. Magnetizing inductance 2.9 pu
Stator leakage inductance 0.171 pu
three MPPT methods, the proposed method learns the wind Wound rotor leakage inductance 0.156 pu
turbine characteristic online without the need of the prior DFIG inertia constant 0.62 s
Turbine inertia constant 3.8 s
knowledge of the WECS. The online learning algorithm is
universally applicable to any WECSs. Another benefit of the TABLE A.II
proposed RL-based MPPT method is that when the wind PARAMETERS OF THE EMULATED PMSG-BASED WECS
turbine characteristic changes due to the aging of the system,
DC motor PMSG
the proposed method can relearn the wind turbine characteristic Rated speed 3500 RPM Rated speed 3000 RPM
to generate a new value of Kopt to adapt to the change. In the Rated power 200 W Rated power 200 W
experiment, the learning process of the RL-based MPPT Back EMF constant 8.7 V/kRPM Number of poles 8
method needs 40 kilobytes (kB) memory to store a 3×30×50 Stator resistance 0.39 Ω Back EMF constant 9.5 V/kRPM
matrix of action values and other parameters that will be used in Armature inductance 0.67 mH Stator resistance 0.233 Ω
d-axis inductance 0.275 mH
the next step, including the speed, power, state index, and Nkj in
q-axis inductance 0.364 mH
(12); and the computational time of the proposed online
learning algorithm in each step is only 0.045 ms. The memory REFERENCES
requirement and computational time of the proposed method [1] Y. Zhao, C. Wei, Z. Zhang, and W. Qiao, “A review on position/speed
are low, making it suitable for real system applications. After sensorless control for permanent magnet synchronous machine-based
the MPPs are learned, the optimal parameter Kopt is then wind energy conversion systems,” IEEE Journal of Emerging and
obtained and the RL-based MPPT control is switched from the Selected Topics in Power Electron., vol. 1, no. 4, pp. 203-216, Dec. 2013.
[2] Y. Xia, K.H. Ahmed, and B.W. Williams, “Wind turbine power
online learning process to the online application process, which coefficient analysis of a new maximum power point tracking technique,”
essentially is an ORB MPPT control. As shown in Table V, the IEEE Trans. Ind. Electron., vol. 60, no. 3, pp. 1122-1132, Mar. 2013.
computational costs (memory and computational time) of the [3] M.A. Abdullah, A.H.M. Yatim, C.W. Tan, and R. Saidur, “A review of
proposed RL-based MPPT control during the online application maximum power point tracking algorithms for wind energy systems,”
Renewable & Sustainable Energy Review, vol. 16, no. 5, pp. 3220-3227,
are the same as the ORB method and lower than the TSR and Jun. 2012.
P&O MPPT methods. [4] J. Chen, J. Chen, and C. Gong, “On optimizing the transient load of
variable-speed wind energy conversion system during the MPP tracking
VII. CONCLUSION process,” IEEE Trans. Ind. Electron., vol. 61, no. 9, pp. 4698-4706, Sep.
2014.
This paper has proposed a novel intelligent RL-based MPPT
algorithm for variable-speed WECSs. A model-free Q-learning
IEE
EE TRANSAC
CTIONS ON IN
NDUSTRIAL ELECTRONIC
CS
[5] H. Li, K.L. Shi, and P.G. McLaaren, “Neural-netw work-based sensorrless C hun Wei (S’12) reeceived a B.S. deggree in electrical
maximum wind d energy capture with compensated d power coefficieent,” enngineering from Beijing Jiaotonng University,
IEEE Trans. Ind d. Appl., vol. 41, no.
n 6, pp. 1548-155 56, Nov./Dec. 2005. Beeijing, China, in 2009, and an M M.S. degree in
[6] W. Qiao, W. Zh hou, J.M. Aller, an nd R.G. Harley, “WWind speed estimaation ellectrical engineeriing from North China Electric
based sensorless output maximizaation control for a windw turbine drivin
ng a Poower University, B Beijing, China, in 22012. Currently,
DFIG,” IEEE Trans.
T Power Electron., vol. 23, no. 3,
3 pp. 1156-1169, MayM hee is working tow ward the Ph.D. degree in the
2008. D epartment of Elecctrical and Compuuter Engineering
[7] W. Qiao, X. Yang,Y and X. Go ong, “Wind speed and rotor posiition att the University of N
Nebraska—Lincoln, Lincoln, NE,
sensorless contrrol for direct-drivee PMG wind turbin nes,” IEEE Trans. Ind.
I USA.
U
Appl., vol. 48, no.
n 1, pp. 3-11, Jan n-Feb. 2012. His research iinterests include wind energy
[8] A.G. Abo-Khallil and D.-C. Leee, “MPPT controll of wind generaation conversioon systems, poweer electronics, m motor drives, andd computational
systems based on estimated wind d speed using SVR R,” IEEE Trans. Ind.
I intelligennce.
Electron., vol. 55,
5 no. 3, pp. 1489 9-1490, Mar. 2008.
[9] M. Pucci and M.M Cirrincione, “Neeural MPPT contrrol of wind generaators
with induction n machines witho out speed sensorss,” IEEE Trans. Ind. I Z
Zhe Zhang (S’10) received a B B.S. degree in
Electron., vol. 58,
5 no. 1, pp. 37-47, Jan. 2011. ellectrical engineeering from X Xi’an Jiaotong
[100] M. N. Soltani, T. Knudsen, M. Svenstrup,
S R. Wissniewski, P. Brath h, R. U
University, Xi’an, China, in 2010. C Currently, he is
Ortega, and K. Johnson, “Estim mation of rotor efffective wind speed d: A w
working toward a Ph.D. degreee in electrical
comparison,” IE EEE Trans. Contrrol Systems Technology, vol. 21, no o. 4, enngineering at the U
University of Nebrraska—Lincoln,
pp. 1155-1167, Jul. 2013. L
Lincoln, NE, USA.
[11] R. Cardenas, R. R Pena, S. Alepuzz, and G. Asher, “Overview
“ of conntrol He worked as aan electrical enginneering intern at
systems for the operation of DFIGs in wind energy y applications,” IE
EEE R
Rockwell Automattion in summer 20014. His current
Trans. Ind. Elecctron., vol. 60, no. 7, pp. 2776-2798, Jan. 2013. reesearch interests include control oof wind energy
[122] H.Camblong, I. I Martinez de Alegria,
A M. Rodriiguez, and G. Ab bad, coonversion systems, power electronnics, and motor
“Experimental evaluation of wind w turbines maaximum power point drrives.
tracking controllers,” Energy Co onversion & Mana agement, vol. 47, no.
18/19, pp. 2846 6-2858, Nov. 2006.
[13] S. Morimoto, H. H Nakayama, M. Sanada, and Y. Takeda, “Sensorrless W
Wei Qiao (S’05–M M’08–SM’12) recceived a B.Eng.
output maximizzation control for variable-speed wiind generation sysstem annd M.Eng. degreees in electrical enngineering from
using IPMSG,” IEEE Trans. Ind. Appl., vol. 41, no.. 1, pp. 60-67, 200 05. Zhhejiang Universityy, Hangzhou, Chinna, in 1997 and
[144] R. Cardenas and a R. Pena, “Seensorless vector control of induction 20002, respectivelyy, an M.S. deegree in high
machines for variable-speed
v winnd energy applicaations,” IEEE Tra ans. peerformance compputation for enginneered systems
Energy Converssion, vol. 19, no. 1, 1 pp. 196-205, Maar. 2004. froom Singapore-MIT Alliance (SMA)) in 2003, and a
[15] A. Tapia, G. Taapia, J.X. Ostolaza, and J.R. Saenz, “Modeling
“ and conntrol Phh.D. degree in eleectrical engineerinng from Georgia
of a wind turbiine driven doubly fed induction gen nerator,” IEEE Tra ans. Innstitute of Technollogy, Atlanta in 20008.
Energy Converssion, vol. 18, no. 2, 2 pp. 194-204, Jun n. 2003. Since August 2008, he has bbeen with the
[166] A. Mirecki, X. Roboam,
R and F. Riichardeau, “Architeecture complexity and U
University of Nebbraska—Lincoln (UNL), USA,
energy efficienccy of small wind tu urbines,” IEEE Traans. Ind. Electron.,, vol. where he is currently an A Associate Professoor in the Departmeent of Electrical
54, no. 1, pp. 6660-670, Feb. 2007. and Compmputer Engineeringg. His research innterests include rennewable energy
[177] Y. Zou, M. Elbu uluk, and Y. Sozer, “Stability analysiis on maximum po ower systems, smart grids, coondition monitoriing, power electtronics, electric
points tracking (MPPT) method in n wind power systeem,” IEEE Trans. Ind.I machiness and drives, andd computational inntelligence. He iss the author or
Appl., vol. 49, no.
n 3, pp. 1129-113 36, May/Jun. 2013 3. coauthor of 3 book chapterss and more than 1550 papers in refereeed journals and
[18] B. Shen, B. Mw winyiwiwa, Y. Zhan ng, and B. T. Oo, “Sensorless
“ maxim mum conferencce proceedings.
power point traacking of wind by y DFIG using roto or position phase lock
l Dr. Qiiao is an Editor off the IEEE Transaactions on Energy Conversion, an
loop (PLL),” IE EEE Trans. Powerr Electron., vol. 24 4, no. 4, pp. 942-9951, Associateed Editor of IET Poower Electronics aand the IEEE Journnal of Emerging
Apr. 2009. and Seleccted Topics in Pow wer Electronics, annd the Correspondiing Guest Editor
[199] Q. Wang and L. Chang, “An intelligenti maximu um power extraction of a speciial section on Conddition Monitoring,, Diagnosis, Prognnosis, and Health
algorithm for innverter-based variaable speed wind tu urbine systems,” IE EEE Monitorinng for Wind Energgy Conversion Sysstems of the IEEE Transactions on
Trans. Power Electron.,
E vol. 19, no.
n 5, pp. 1242-1249, Sept. 2004. Industriall Electronics. He w
was an Associate E Editor of the IEEE Transactions on
[200] R. Datta and V.T.
V Ranganathan, “A method of traccking the peak po ower Industry AApplications fromm 2010 to 2013. Hee was the recipientt of a 2010 U.S.
points for a variiable speed wind energy
e conversion system,” IEEE Tra ans. National Science Foundatiion CAREER Aw ward and the 20100 IEEE Industry
Energy Converssion, vol. 18, no. 1, 1 pp. 163-168, Maar. 2003. Applicatioons Society Andreew W. Smith Outsstanding Young M Member Award.
[21] S.M. Raza Kazm mi, H. Goto, Hai-JJiao. Guo, and O. Ichinokura, “A no ovel
algorithm for faast and efficient speed-sensorless maximum
m power point
tracking in wind d energy conversio on systems,” IEEE Trans. Ind. Electrron., L
Liyan Qu (S’05–M M’08) received thhe B.Eng. (with
vol. 58, no. 1, pp.29-36,
p Jan. 2011 1. thhe highest distinnction) and M.Enng. degrees in
[222] E. Koutroulis an nd K. Kalaitzakis, “Design of a max ximum power track king ellectrical engineerring from Zhejiaang University,
system for wiind-energy-converrsion application,”” IEEE Trans. Ind. I H
Hangzhou, China, in 1999 and 20002, respectively,
Electron., vol. 53,
5 no. 2, pp. 486-4 494, Apr. 2006. annd the Ph.D. degrree in electrical enngineering from
[23] R. S. Sutton and d A. G. Barto, Reiinforcement Learn ning: An Introducttion. thhe University of Illinois at Urbaana–Champaign,
Cambridge, MA A: MIT Press, 1998 8. U
USA, in 2007.
[244] R. S. Sutton, “LLearning to predict by the methods off temporal differen nces,” From 2007 to 2009, she was an Application
Machine Learning, vol. 3, pp. 9-4 44, 1988. E
Engineer with Ansooft Corporation, Irrvine, CA, USA.
[25] C. Szepesvári, “Algorithms fo or reinforcement learning,” Synth hesis S ince January 20100, she has been withh the University
Lectures on Artiificial Intelligencee and Machine Learning, vol. 4, no. 1,1 pp. of Nebrasska—Lincoln (UN NL), where she is ccurrently an Assisttant Professor in
1-103, 2010. the Deparrtment of Electricaal and Computer EEngineering. Her reesearch interests
[266] C. J. C. H. Watkkins and P. Dayan, “Q-learning,” Machine Learning, vo ol. 8, include ennergy efficiency, rrenewable energy, numerical analysiis and computer
pp. 279-292, 19 992. aided desiign of electric macchinery and powerr electronic devices, dynamics and
[277] L. Bușoniu, R. Babuška
B and B. D.
D Schutter, “A com mprehensive survey of control oof electric machinnery, permanent-m magnet machines,, and magnetic
multi-agent rein nforcement learniing,” IEEE Transs. Systems, Man and materials..
Cybernetics. Part C: Applicatio ons and Reviews, vol. 38, no. 2, pp.
156-172, Mar. 2008.
2
[28] E. Even-Dar and a Y. Mansour, “Learning rates for Q-learning,” The
Journal of Machine Learning Ressearch, vol. 5, pp. 1-25, Dec. 2004.

Reinforcement Learning-Based Intelligent Maximum Power Point Tracking Control For Wind Energy Conversion Systems

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Reinforcement Learning-Based Intelligent Maximum Power Point Tracking Control For Wind Energy Conversion Systems

Загружено:

Авторское право:

Доступные форматы

This article has been accepted for publication in a future issue of this journal, but has not been

Reinforcement Learning-Based Intelligent

MPP tracking (MPPT) control algorithm forms an essential part

W IND power has been extensively developed in the last

where τ is a positive parameter called temperature, which

III. APPLICATION OF Q-LEARNING IN MPPT CONTROL

Turbine power (MW)

the Boltzmann exploration. The WECS then enters into a new 0

Fig. 4. Schematic of a grid-connected DFIG-based WECS.

where Kopt is the optimal parameter to be determined. If the

Wind speed (m/s)

1.09 1.3875 -0.3234 -0.4089 -0.1246 23

m/s and 10.5 m/s every minute is used. A 15-minute online

changes. This cannot be achieved by the traditional P&O 0.3

locate in the state (0.99, 1.0375) and (1.09, 1.3875), 0

State Action valuee VI. EXPERIMENTALL RESULTS

8 implemmented on a dSP PACE 1104 reaal-time control board. All of

Fig. 9. Thhe schematic of thhe experimental system with a WECS

computer (PC). The experimental system setup is shown in Fig. 9

10. The parameters of the DC motor and PMSG are listed in 8

Wind speed (m/s)

PMSG electrical power (W)

150 algorithm has been designed for online learning of the

Вам также может понравиться