Open AccessArticle

Deep Reinforcement Learning-Based Differential Game Guidance Law against Maneuvering Evaders

Axing Xi

and

Yuanli Cai

School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an 710049, China

Author to whom correspondence should be addressed.

Aerospace 2024, 11(7), 558; https://doi.org/10.3390/aerospace11070558

Submission received: 8 May 2024 / Revised: 28 June 2024 / Accepted: 4 July 2024 / Published: 6 July 2024

(This article belongs to the Special Issue Dynamics, Guidance and Control of Aerospace Vehicles)

Download

Browse Figures

Versions Notes

Abstract

To achieve the intelligent interception of different types of maneuvering evaders, based on deep reinforcement learning, a novel intelligent differential game guidance law is proposed in the continuous action domain. Different from traditional guidance laws, the proposed guidance law can avoid tedious manual settings and save cost efforts. First, the interception problem is transformed into the pursuit–evasion game problem, which is solved by zero-sum differential game theory. Next, the Nash equilibrium strategy is obtained through the Markov game process. To implement the proposed intelligent differential game guidance law, an actor–critic neural network based on deep deterministic policy gradient is constructed to calculate the saddle point of the differential game guidance problem. Then, a reward function is designed, which includes the tradeoffs among guidance accuracy, energy consumption, and interception time. Finally, compared with traditional methods, the interception accuracy of the proposed intelligent differential game guidance law is 99.2%, energy consumption is reduced by 47%, and simulation time is shortened by 1.58 s. All results reveal that the proposed intelligent differential game guidance law has better intelligent decision-making ability.

Keywords:

differential game guidance law; deep reinforcement learning; pursuit–evasion game; deep deterministic policy gradient

1. Introduction

With the development of intelligent information technology, advanced guidance laws play a crucial role in accurate interception missions. Furthermore, when confronting different types of maneuvering evaders, it is crucial for the pursuer to rapidly perceive the environment and generate a favorable, accurate, and effective guidance strategy. Therefore, intelligent guidance laws for intercepting maneuvering evaders have attracted the attention of many scholars.

Traditional guidance laws include proportional navigation guidance (PNG) [1], augmented proportional navigation guidance (APNG) [2], optimal guidance laws (OGLs) [3], and sliding mode control (SMC) guidance laws [4]. However, those methods are not intelligent and need many manual settings. Additionally, the evader is treated as a stationary object. As is well known, intercepting maneuvering evaders involves a game process known as the classical pursuit–evasion game. For solving the interception problem of maneuvering evaders, differential game (DG) theory shows its advantages, where the interception problem is transformed into finding the saddle point of Nash equilibrium. In [5], an intelligent guidance algorithm was proposed for effectively intercepting the maneuverable target by virtue of DG concepts. The engagement kinematics, in addition to the direct intercept condition, were developed with 2D engagement. In [6], linear quadratic differential game (LQDG) guidance laws were proposed for solving the two-pursuit versus single-evader problem. The interception strategy was derived from the Nash equilibrium strategy set of the game. Similarly, in [7], differential game guidance laws were proposed for a linear system. However, obtaining an analytic solution becomes nearly impossible as the system grows more complex.

To address the nonlinear problem, adaptive dynamic programming (ADP) techniques provide powerful tools to solve nonlinear DG problems. Value iteration and policy iteration are employed to solve differential game guidance laws (DGGLs). In [8], a data-driven value iteration (VI) algorithm was proposed to solve the adaptive continuous-time linear optimal output regulation, and the author designed an online value iteration algorithm to learn the feedback control gain. In [9], an online policy iteration algorithm was proposed to achieve infinite-horizon optimal design for nonlinear two-player zero-sum games. However, insufficient iterations within a fixed sampling time may lead to system instability. To address this problem, in [10], the author proposed a time-based neuro-dynamic programming (NDP) algorithm, where the previous history of system states and cost function approximations were considered to solve the iteration problem. In [11], three neural network approximators were designed to learn the cost function, and an online NDP algorithm was proposed to solve the two-player zero-sum game problem in a continuous-time (CT) system. However, these guidance laws are not autonomous intelligent strategies, indicating that guidance systems lack the attributes of intelligent decision-making systems.

With the development of ADP techniques, reinforcement learning (RL) is gaining more attention as an effective method for obtaining autonomous intelligent guidance strategies. In [12], reinforcement learning was applied to solve the differential game problem, and the minimax point was found using discrete iteration. In [13], the researchers investigated the use of reinforcement learning techniques, based on the Q-learning algorithm, to implement interception strategies. Moreover, in [14], based on Q-learning and the fuzzy inference system, a fuzzy logic controller was proposed for solving pursuit–evasion differential games. The proposed method could solve the maneuverability of the evader when it is unknown. In [15], Jun Jet Tai’s work on reinforcement learning algorithms demonstrates that PyFlyt enables the configuration of arbitrary UAV types, such as dog fighting in the open-source PyFlyt software (PyFlyt 4.0.0). However, these reinforcement learning algorithms were implemented by discretizing the continuous action domain, potentially leading to exponential growth in calculations. To address this problem, deep reinforcement learning (DRL), with its advantage in solving the continuous control problem, was considered. In [16], modified multiagent reinforcement learning was proposed to achieve the underwater target hunting task under the constraints of energetic flows and acoustic propagation delay. In [17], deep deterministic policy gradient (DDPG) techniques were applied in guidance applications, and an intelligent impact time guidance law with the field-of-view was proposed. The guidance gain was obtained by the DDPG framework, which maximizes the expected total reward. In [18], for solving the hypersonic pursuit–evasion game, an intelligent maneuver strategy was proposed based on the TD3 algorithm and a deep neural network; the proposed algorithm could explore potential maneuver manners. Similar work is discussed in reference [19]. Although these DRL guidance laws can solve the interception problem by maximizing the expected reward, the evader is regarded as part of the environment without intelligent decision. Furthermore, there exist almost no papers that study the differential game guidance law (DGGL) for solving the interception problem in a continuous domain by DRL.

This paper innovatively combines DRL technology with DG theory to solve the DDGL design in the continuous domain. An intelligent differential game guidance law (IDGGL) algorithm is proposed. The main contributions of this paper are emphasized as follows.

1. Unlike traditional guidance laws, the advantages of the proposed IDGGL are saving cost efforts and avoiding tedious manual settings. The guidance model is obtained directly from environmental interaction learning through reinforcement learning. It is an intelligent guidance strategy, which can save more simulation time.

2. The problem of differential game interception is transformed into a Markov game. In general guidance algorithms based on deep reinforcement learning, the evader’s strategy is not considered. Unlike traditional DRL algorithms, our method formulates the guidance problem as a differential game, which allows for more sophisticated strategies in adversarial scenarios. The proposed IDGGL algorithm aims to determine the minimax saddle point rather than maximize the return. In this way, the evader is traded as an intelligent agent. This is the first application for DRL to solve the differential game interception problem.

3. Unlike numerous research works on guidance algorithms based on deep reinforcement learning, in our paper, a complex reward function is considered; on the one hand, the designed reward function aligns better with practical applications; on the other hand, it makes the training process faster and the trained model more accurate. A reasonable reward function is designed, which emphasizes not only the terminal interception distance but also the energy consumption and interception distance in the interception process. Additionally, replay buffer stores, “soft” target networks, and normalization techniques are adopted to make the training process more efficient. Additionally, for practical application, the action space with an added noise process is considered.

The structure of this paper is organized as follows. In Section 2, the pursuit–evasion game problem and the core concepts and principles of RL are presented. The framework of the zero-sum differential Markov game and the proposed IDGGL are presented in Section 3. In Section 4, numerical experiments are carried out to evaluate the performance of the proposed IDGGL strategy. Section 5 presents some conclusions.

2. Problem Statement and Preliminaries

First, we present the formulation of the pursuit–evasion game. In this engagement, the pursuer tries to intercept the evader, while the evader tries to escape from the pursuer. Then, the core concepts and principles of RL are introduced.

2.1. The Formulation of the Pursuit–Evasion Game

In this subsection, the engagement scenario in a plane is considered in Figure 1. The X-Y plane denotes the Cartesian reference frame. The variables

V_{i} a n d a_{i}, i = (M, T)

represent the velocity and the lateral acceleration of the pursuer and the evader, respectively. The variables

γ_{M} a n d γ_{T}

denote the flight path angles of the pursuer and the evader, respectively.

u_{M} a n d u_{T}

are the control inputs of the two players. We define the pursuer–evader relative distance and the line of sight (LOS) angle as r_d and θ.

In the terminal guidance stage, the interception time is very short; without loss of generality, the speeds of all agents are a constant speed. Thus, the interception curve appears to be a straight line. Then, the nonlinear kinematics of the pursuit–evasion game can be formulated as follows:

V_{r} = \dot{r_{d}} = V_{T} c o s (γ_{T} - θ) - V_{M} c o s (γ_{M} - θ)

(1)

σ = \dot{θ} = {(V}_{T} s i n (γ_{T} - θ) - V_{M} s i n (γ_{M} - θ)) / r

(2)

where

V_{r}

represents the closing velocity, and

σ

donates the angular rate of the LOS.

In order to successfully intercept the evader, the concept of zero miss distance is used [5]. It can be defined as

\begin{matrix} z_{m i s s} (t) = \frac{r^{2} σ}{\sqrt{{\dot{r}}^{2} + r^{2} σ^{2}}} \end{matrix}

(3)

It can be found that when the angular rate of the LOS approaches 0, the pursuer can intercept the evader. In general, there always exist external disturbances. Therefore,

x = {[θ σ]}^{T}

are chosen as the states of the system [20]; by deriving Equation (2), accurate nonlinear dynamic equations can be obtained

\begin{matrix} \dot{x} = f (x) - g (x) (a_{m} + d_{m}) + k (x) (a_{t} + d_{t}) \end{matrix}

(4)

where

f (x) = {[0 - \frac{2 V_{r}}{r} σ]}^{T}

g (x) = {[0 \frac{c o s (γ_{M} - θ)}{r}]}^{T}

k (x) = {[0 \frac{c o s (γ_{T} - θ)}{r}]}^{T}

, and

d_{m}

and

d_{t}

are external disturbances. Generally, in practical applications, controllers are always affected by external disturbances, and considering the influence of external disturbances can make the training model more suitable for practical applications.

Traditionally, it is difficult to obtain the pursuer control command when the system has external disturbances. In this paper, the accurate nonlinear dynamics system is not required; we adopt the DRL technique to learn the interception strategy.

The first-order dynamics of the pursuer can be expressed as follows:

{\dot{x}}_{M} = V_{M} c o s γ_{T}

(5)

{\dot{y}}_{M} = V_{M} s i n γ_{T}

(6)

{\dot{γ}}_{M} = \frac{a_{m}}{V_{M}}

(7)

{\dot{a}}_{M} = \frac{u_{M} - a_{m}}{τ_{M}}

(8)

where

(x_{M}, y_{M})

and

a_{m}

are the position and the lateral acceleration of the pursuer, respectively.

τ_{M}

is a constant.

The first-order dynamics of the evader can be expressed as follows:

{\dot{x}}_{T} = V_{T} c o s γ_{T}

(9)

{\dot{y}}_{T} = V_{T} s i n γ_{T}

(10)

{\dot{γ}}_{T} = \frac{a_{t}}{V_{T}}

(11)

{\dot{a}}_{T} = \frac{u_{T} - a_{t}}{τ_{T}}

(12)

where

(x_{T}, y_{T})

and

a_{t}

are the position and the lateral acceleration of the evader, respectively.

τ_{T}

is a constant.

In the RL simulation, we utilized Equations (5)~(12) to achieve the motion of the pursuer and the evader. External disturbances were considered in the action space.

2.2. RL Framework

The basic principle of RL is that an agent interacts with the environment and learns a control policy, which can be described as a Markov Decision Process (MDP). The MDP consists of five elements

(S, A, P, r, γ)

, where

S

denotes a set of states,

A

is a set of actions,

P

represents the state transition probabilities,

r

is the reward function, and

γ

denotes the optimal discount rate. In the MDP, at each timestep t, the agent interacts with the environment, receives a state

s_{i} \in S

, takes an action

a_{i} \in A

, and obtains a reward

γ_{t}

. Then, a trajectory

{s_{0}, a_{0}, r_{1}, s_{1}, a_{1}, r_{2} \dots}

is generated.

The goal in RL of the agent is to learn a control policy

π : S \to P (A)

, which describes states to a transition probability over the action. In general, the control policy

π

can be learned by maximizing the return function, which is the sum of the future reward:

\begin{matrix} R = \sum_{i = t}^{T} γ^{i - t} r (s_{i}, a_{i}) \end{matrix}

(13)

The state-value function

V^{π} (s)

and action-value function

Q^{π} (s, a)

are used in many RL algorithms for obtaining the control policy

π

[19]:

\begin{matrix} V^{π} (s) = E_{π} (\sum_{i = 0}^{T} γ^{i} R (s_{i}) | s_{0} = S) \end{matrix}

(14)

\begin{matrix} Q^{π} (s, a) = E_{π} (\sum_{i = 0}^{T} γ^{i} R (s_{i}, a_{i}) | s_{0} = S, a_{0} = A) \end{matrix}

(15)

where

E_{π}

is the policy expected function. For the convenience of writing,

E_{π}

is written as E in the following section.

For solving the continuous control problem, DRL shows its advantages. The basic idea of DRL involves adopting neural networks (NNs) to approximate the action-value function and state-value function. Typical algorithms in this domain include deep deterministic policy gradient (DDPG) and TD3. The main idea of these algorithms is that the state-value function and action-value function are parameterized by NNs. In this paper, we consider that the pursuer is continuously controlled.

3. Deep Reinforcement Learning Formulation of Differential Games

In this section, the pursuit–evasion game problem is considered as a zero-sum differential game. By combining Markov games and DRL, the IDGGL is proposed. Our emphasis is on two agents interacting with the environment and learning the Nash equilibrium strategy.

3.1. The Framework of the Zero-Sum Differential Markov Game

In the traditional Markov Decision Process (MDP), an agent interacts with the environment and maximizes the reward. Based on the MDP, the zero-sum differential Markov game is described by a tuple

(S, A_{m}, A_{t}, O, P, r, γ)

, where

S

denotes the set of states.

A_{m}

denotes the pursuer’s action space.

A_{t}

represents the evader’s action space.

O

is the set of observations.

P

is the state transition probability

p (s_{t + 1} | s_{t}, a_{m}, a_{t}) .

R

denotes the reward function of the agents. The reward function

r (s_{t}, a_{m}, a_{t})

describes the rewards that the pursuer acquires based on the actions of the evader.

γ \in (0,1)

represents the discount rate. In this paper,

S \neq O

; only the relative distance and LOS are observable during the process.

The behavior of the player is determined by a policy

π (s)

, which maps each state to an optimal action

s \to P (A_{m}, A_{t}) .

The evader’s actions can affect the optimal policy. The control strategy of the evader can be obtained through an observer. Thus, a rational pursuer knows the evader’s entire policy. Considering the impact of the evader’s strategies can make the learning model more suitable for practical applications. It is important to note that the policy may be stochastic, reflecting the stochastic nature of actions chosen by the evader and the environment. The goal of DRL is to learn a policy for two agents, who interact with a stochastic environment while finding the saddle point of the return function. The sum of the discounted future reward is defined as the return function, as follows:

\begin{matrix} R_{t} = \sum_{i = t}^{T} γ^{i - t} r (s_{i}, a_{m i}, a_{t i}) \end{matrix}

(16)

where

γ \in [0,1]

represents the discount factor.

r (s_{i}, a_{m i}, a_{t i})

is the designed reward function, which is presented in Section 3.3.

The expected total return is defined as the value function as follows:

\begin{matrix} Q^{π} (s_{t}) = E (R_{t}| s_{t}) \end{matrix}

(17)

In the zero-sum differential Markov game, the pursuer tries to maximize the value function, while the evader has the contrary purpose that minimizes the value function.

\begin{matrix} J = {\underset{a_{t} \in A_{t} a_{m} \in A_{m}}{m a x m i n} Q}^{π} (s_{t}) \end{matrix}

(18)

According to the Bellman equation, the recursive relationship of the differential game value function yields

\begin{matrix} Q^{π} (s_{t}) = E [R (s_{t}, a_{m}, a_{t})] + γ Q^{π} (s_{t + 1}) \end{matrix}

(19)

Unlikely traditional DRL, the zero-sum differential Markov game includes the evader’s strategy; the optimal policy of two agents is implemented by solving the following optimal value function

\begin{matrix} π^{*} = (A_{m}^{*}, A_{t}^{*}) = \underset{π a_{t} \in A_{t} a_{m} \in A_{m}}{\arg max min} Q^{*} (s_{t}) \end{matrix}

(20)

where

π^{*}

is the optimal control variable of the pursuer and the evader. In the Markov game, the pursuer obtains the optimal control policy when the evader chooses the optimal control. In this paper, the pursuer learns the optimal interception policy when the evader chooses the optimal escape strategy. Thus,

π^{*}

represents the control variables of two players in the game process.

In the DRL pursuit–evasion game, the pursuer will definitely learn the optimal control policy. In other words, the DRL algorithm will converge to the optimal value function

Q^{*} (s_{t})

; the convergence of the DRL algorithm is given.

Theorem 1.

Under the framework of the zero-sum differential Markov game, the update rule of the value function is given as follows

\begin{matrix} Q_{n + 1} (s_{t}) = Q_{n} (s_{t}) + α_{s_{t}} (r + γ \underset{a_{t} \in A_{t} a_{m} \in A_{m}}{maxmin} Q_{n} (s_{t + 1}, a_{m + 1}, a_{t + 1}) - Q_{n} (s_{t})) \end{matrix}

(21)

where

α_{s_{t}}

is the step-size and satisfies

0 {\leq α}_{s_{t}} \leq 1

. Then, the DRL algorithm will converge to the optimal value function

Q^{*} (s_{t})

, which corresponds to the optimal control policy.

Proof.

First, we define the maximum error function as

\begin{matrix} ∆_{n} = \max_{} |Q_{n} (s_{t}) - Q^{*} (s_{t})| \end{matrix}

(22)

where

Q_{n} (s_{t})

is the n-th step value function.

Then, Equation (19) can be rewritten as

\begin{matrix} Q_{n + 1} (s_{t}) = (1 - α_{s_{t}}) Q_{n} (s_{t}) + α_{s_{t}} (r + γ \underset{a_{t} \in A_{t} a_{m} \in A_{m}}{maxmin} Q_{n} (s_{t + 1}, a_{m + 1}, a_{t + 1})) \\ = (1 - α_{s_{t}}) Q_{n} (s_{t}) + α_{s_{t}} {Γ Q}_{n} (s_{t}) \end{matrix}

(23)

where

Γ

represents the Bellman operator.

Similarly,

Q^{*} (s_{t})

can be written as

\begin{matrix} Q^{*} (s_{t}) = (1 - α_{s_{t}}) Q^{*} (s_{t}) + α_{s_{t}} Γ Q^{*} (s_{t}) \end{matrix}

(24)

It is clearly noted that

\begin{matrix} |{Γ Q}_{n} (s_{t}) - Γ Q^{*} (s_{t})| \leq γ |Q_{n} (s_{t}) - Q^{*} (s_{t})| \end{matrix}

(25)

Subtracting Equation (21) from Equation (22) and combining Equation (23), we have

\begin{matrix} |Q_{n + 1} (s_{t}) - Q^{*} (s_{t})| = |α_{s_{t}} ({Γ Q}_{n} (s_{t}) - Γ Q^{*} (s_{t})) + (1 - α_{s_{t}}) (Q_{n} (s_{t}) - Q^{*} (s_{t}))| \\ \leq α_{s_{t}} |{Γ Q}_{n} (s_{t}) - Γ Q^{*} (s_{t})| + (1 - α_{s_{t}}) |Q_{n} (s_{t}) - Q^{*} (s_{t})| \\ \leq (α_{s_{t}} γ + (1 - α_{s_{t}})) ∆_{n} \end{matrix}

(26)

Therefore, it can be concluded that the maximum

Q_{n + 1} (s_{t})

(α_{s_{t}} γ + (1 - α_{s_{t}}))

times

Q_{n} (s_{t})

. Let

∆_{0}

represent the initial maximum error between

Q_{0} (s_{t})

and

Q^{*} (s_{t})

. Then, upon the k-th iteration, the cumulative error can be obtained

\begin{matrix} |Q_{0} (s_{t}) - Q^{*} (s_{t})| \leq {(α_{s_{t}} γ + (1 - α_{s_{t}}))}^{k} ∆_{n} \end{matrix}

(27)

□

Due to the fact that

0 {\leq α}_{s_{t}} \leq 1

, it can be concluded that

0 \leq (α_{s_{t}} γ + (1 - α_{s_{t}})) \leq 1

. Therefore, when

k \to \infty,

we have

|Q_{0} (s_{t}) - Q^{*} (s_{t})| \leq 0

. Thus, the DRL algorithm will converge to the optimal value function

Q^{*} (s_{t})

. In other words, when the pursuer learns the optimal policy, the closed-loop system of the pursuit–evasion game is stable.

Directly optimizing the value function or action-value function can obtain the optimal policy. However, this approach requires accurate model information. In practical applications, implementing such methods with model uncertainties can be challenging. Fortunately, model-free RL algorithms relax this requirement and find the optimal policy. One method to find the optimal policy is iteratively evaluating the value function, which is approximated by NNs. A well-known algorithm for this approach is the Q-learning algorithm. However, iterative approaches are typically limited to discrete action spaces. Another method involves policy gradient algorithms, which learn a deterministic function. In this approach, the action-value function is updated by the gradient direction of the value function with respect to the action. A well-known algorithm in this category is the DDPG algorithm [21], which can be applied in the continuous action domain. Thanks to this property, our proposed algorithm may solve the zero-sum differential game problem based on the framework of DDPG. The implementation is explained in the next section.

3.2. The Proposed Algorithm

For solving the zero-sum differential Markov game problem, an actor–critic NN framework is adopted based on the DDPG algorithm. In the zero-sum differential Markov game, our aim is to find the saddle point, which corresponds to learning the Nash equilibrium. The action-value function

Q (s, a_{m}, a_{t})

describes the return of the pursuer acquiring the maximum reward while the evader obtains the minimum reward. This function is approximated using the NN

θ^{Q} (s, a_{m}, a_{t})

and parameterized by

Q

. The action-value function, commonly utilized in many RL algorithms, describes the expected return after taking actions

(a_{m}, a_{t})

in state

s_{t}

\begin{matrix} Q^{π} (s_{t}, a_{m}, a_{t}) = E [R_{t}| s_{t}, a_{m}, a_{t}] \end{matrix}

(28)

According to the Bellman equation, the recursive relationship of the action-value function yields

\begin{matrix} Q^{π} (s_{t}, a_{m}, a_{t}) = E [r (s_{t}, a_{m}, a_{t}) + γ E [Q^{π} (s_{t + 1}, a_{m + 1}, a_{t + 1})]] \end{matrix}

(29)

The action-value function is parameterized by

θ^{Q}

; based on the temporal difference (TD) error, the loss function can be defined as follows:

L (θ^{Q}) = E {[{(Q}^{π} (s_{t}, a_{t}, a_{t} | θ^{Q}) - y_{t})}^{2}]

(30)

where

\begin{matrix} y_{t} = r (s_{t}, a_{t}, a_{t}) + γ Q^{μ} (s_{t + 1}, {μ (s}_{t + 1}) | θ^{Q}) \end{matrix}

(31)

By minimizing the loss function, the parameter

θ^{Q}

is updated by using the gradient descent algorithm:

\begin{matrix} θ_{t + 1}^{Q} = θ_{t}^{Q} - α_{θ^{Q}} \nabla_{θ^{Q}} L (θ^{Q}) \end{matrix}

(32)

where

α_{θ^{Q}}

is the learning rate of the critic network.

The inner expectation can be avoided by taking a deterministic policy. Thereafter, Equation (27) becomes the following:

\begin{matrix} Q^{μ} (s_{t}, a_{m}, a_{t}) = E [r (s_{t}, a_{m}, a_{t}) + γ Q^{μ} (s_{t + 1}, {μ (s}_{t + 1}, a_{m + 1}, a_{t + 1}))] \end{matrix}

(33)

From Equation (31), it can be seen that the policy function depends only on the environment, which means

Q^{π}

can be learned by using off-policy and transitions. Thus, the critic NN maps states to a specific action, which is parameterized by

μ (s | θ^{μ})

. The update rule is to the expected return function

J

with respect to the actor parameters:

\begin{matrix} \nabla_{θ^{μ}} J = E [\nabla_{θ^{μ}} Q (s_{t}, a_{m}, a_{t} | θ^{Q}) |_{s = s_{t}, a_{m}, a_{t} = μ (s_{t}| μ^{θ})}] \\ = E [\nabla_{a_{m}, a_{t}} Q (s_{t}, a_{m}, a_{t} | θ^{Q}) |_{s = s_{t}, a_{m}, a_{t} = μ (s_{t})} \nabla_{θ^{μ}} μ (s| θ^{μ}) |_{s = s_{t}}] \end{matrix}

(34)

where

\nabla_{(\cdot)}

represents the partial derivative of parameter

(\cdot)

For the player, we expect maximizing the policy function. Therefore, the parameter

θ^{μ}

is updated by using the policy gradient:

\begin{matrix} θ_{t + 1}^{μ} = θ_{t}^{μ} + α_{θ^{μ}} \nabla_{θ^{μ}} J (θ^{μ}) \end{matrix}

(35)

where

α_{θ^{μ}}

is the learning rate of the actor network.

Remark 1.

In most optimization algorithms, samples are typically assumed to be independent and uniformly distributed. However, this assumption no longer holds when the samples are obtained from exploring sequentially in an environment. To address this problem, we adopt an experience replay buffer, as inspired by the DDPG algorithm. The finite replay buffer stores finite-sized transitions, denoted as

B_{t} = (s_{t}, a_{t}, o_{t}, r_{t}, s_{t + 1})

, which are sampled from the environment according to the sample policy. The actor and critic network can be updated by sampling a minibatch from the buffer. By using the experience replay buffer, the algorithm is an off-policy and benefits from learning across a set of uncorrelated transitions, rather than online. Additionally, minibatches are set in the learning process.

Remark 2.

Directly implementing the actor–critic NN can lead to instability, particularly in environments where the update of the critic network may diverge. To address this issue, a modification similar to the one used in the DDPG algorithm is employed, which involves using “soft” target updates. We create two copy target networks for calculating target values. The weights of target networks are updated by the following iteration:

\begin{matrix} θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}} \end{matrix}

(36)

\begin{matrix} θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}} \end{matrix}

(37)

where

τ

is a constant, with

τ ≪ 1

. In general, a smaller

τ

(e.g., 0.001) ensures stable and gradual updates, reducing the risk of oscillations and divergence. The two target networks of Equations (36) and (37) are copied from the actor–critic network, The initialization of the two target networks is random seed.

In DRL, the target network helps stabilize training by providing consistent target values for Q-network updates, reducing oscillations and divergence in the learning process. Similarly, the soft target network stabilizes training in DDPG by gradually updating the target network parameters with a small fraction of the main network’s parameters, reducing volatility and improving learning stability. The two target networks of Equations (36) and (37) are copied from the actor–critic network; the initialization of the two target networks is random seed.

Remark 3.

A major challenge in the learning process in continuous action spaces is exploration. The environment initializes random states, and the agent explores the environment by choosing probability action spaces. To address this issue, a random noise sample from a noise process is added in our exploration policy

N

. The action space can be rewritten as

\begin{matrix} a_{m}^{'} = a_{m} + N \end{matrix}

(38)

\begin{matrix} a_{t}^{'} = a_{t} + N \end{matrix}

(39)

For the convenience of training, we assume that the noise of the agent and all agents has the same distribution [22]. The noise satisfies the following:

\begin{matrix} N_{t} = N_{t - 1} + β (μ_{v} - N_{t - 1}) T_{s} + (0, σ_{v}) \sqrt{T_{s}} \end{matrix}

(40)

where

μ_{v}

is the mean of the noise,

σ_{v}

represents the exploration variance of the noise, and

β

is the mean attraction constant.

T_{s}

denotes the sampling time.

The pseudocode of the proposed algorithm for solving the zero-sum differential Markov game is as follows (Algorithm 1):

Algorithm 1. Pursuer interception strategy based on DDPG

1. Randomly initialize the actor and critic networks with weights

θ^{Q}

and

θ^{μ}

# Initialize actor–critic network

2. Initialize the target networks with weights

{θ^{Q^{'}} \leftarrow θ}^{Q}

and

{θ^{μ^{'}} \leftarrow θ}^{μ}

# Initialize target network

3. Initialize the experience buffer

B

# Initialize experience buffer

4. for episode

= 1

, Max Episode do # learning process

5. Initialize random noise

N

to the action for exploration # Initialize action noise

6. Initialize the initial state of the pursuer and the evader # Initialize environmental status

7. Obtain initial observation state

s_{1}

# Initialize observation status

8. for

t = 1, T

do # Action chosen

9. select action

a_{m}^{'} = a_{m} + N

based on current state and current policy

10. observe the evader selecting action

a_{t}^{'} = a_{t} + N

based on current state and current policy

11. Execute the action

a_{m}^{'}

and the evader execute the action

a_{t}^{'}

, then observe

r_{t}

and the new state

s_{t + 1}

12. Store the transition

(s_{t}, a_{m}, a_{t}, r_{t}, s_{t + 1})

in the experience buffer

B

13. sample a random minibatch of N transitions from

B

14. Calculate:

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, {μ^{'} (s}_{t + 1} |θ^{μ^{'}}) | θ^{Q^{'}})

# Calculate actor–critic network

15. Calculate the TD error and update the critic network using gradient descent:

L = \frac{1}{B} \sum_{i = 1}^{N} (y_{i} - Q {(s_{i}, a_{m i}, a_{t i}| θ^{Q})}^{2}

θ_{t + 1}^{Q} = θ_{t}^{Q} - α_{θ^{Q}} \nabla_{θ^{Q}} L (θ^{Q})

# Update critic network

16. update the actor network using policy gradient:

\nabla_{θ^{μ}} J = \frac{1}{B} \sum_{i = 1}^{N} \nabla_{θ^{μ}} μ (s| θ^{μ}) \nabla_{a} Q (s_{t}, a_{m}, a_{t} | θ^{Q})

{θ^{μ}}_{t + 1} = {θ^{μ}}_{t} + α_{θ^{μ}} \nabla_{θ^{μ}} J

# Update actor network

17. Update the target network:

θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}}

θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}

# Update target network

18. if the task is accomplished then

19. Terminate the current episode

20. end if

21. end for

22. end for # End learning

3.3. The IDGGL Design Based on the Proposed Algorithm

Based on the above section, we formulate the zero-sum differential Markov game problem in the DRL framework. The proposed IDGGL strategy can be learned by using the proposed algorithm. The basic structure of the system and algorithm flowchart are illustrated in Figure 2, which describes the specific implementation process. Next, the pursuit–evasion environment, states, action spaces, and the reward function are presented. Importantly, the reward function design is one of the key points.

A.: State Spaces

In the pursuit–evasion game scenario, the relative kinematic models of the pursuer and the evader are considered as state spaces, which can be designed as follows:

s_{t} = (r_{d}, θ, {\dot{r}}_{d}, \dot{θ})

(41)

The relative distance and the rate of the relative distance can directly reflect the success or failure of interception. The active radar seeker can provide information on the relative distance and LOS. Traditional control techniques like PNG assume that the LOS rate is known. By referring to the principle of PNG, the LOS and LOS rate are chosen as state spaces, which reflect the angle information and can ensure that the pursuer approaches the evader as parallel as possible. Thus, all states can be calculated and fully characterize the engagement states.

B.: Action Spaces

In the zero-sum differential Markov game, we consider two action spaces in the continuous space, which are the normal accelerations of the pursuer and the evader

a = (a_{m}, a_{t})

a_{m} \in ({- n}_{M} g, n_{M} g)

(42)

a_{t} \in ({- n}_{T} g, n_{T} g)

(43)

where

n_{M}

and

n_{T}

denote the limit coefficients of the acceleration. It can be seen that the accelerations of the two agents are limited to a certain range, which is beneficial for the control command.

C.: Reward Function

In order to successfully intercept the evader, a reasonable reward function not only affects the learning speed and feasibility but also reflects the interception efficiency of the pursuer. Therefore, a proper reward function design is crucial for learning the optimal interception guidance law. According to the pursuit–evasion information, the reward function is designed as follows.

(1) The terminal reward

r_{e}

. This term reflects whether the pursuer successfully intercepts the evader or not. This function is designed as

\begin{matrix} r_{e} = \{\begin{matrix} a_{1} & r_{d} < r_{D} \\ - 1 & r_{d} > r_{D} \end{matrix} \end{matrix}

(44)

where

r_{D}

represents the distance of successful interception. In most cases,

r_{D}

is also the interception radius, which is set as

r_{D} =

0.1 m.

a_{1}

is the interception reward.

(2) The relative distance reward

r_{z}

. This term means that the pursuer tried to intercept the evader. The closer the pursuer is to the evader, the higher the reward that will be obtained. This function is designed as

\begin{matrix} r_{z} = - k_{r} {(\frac{r_{d}}{r_{0}})}^{2} \end{matrix}

(45)

where

r_{0}

is the initial relative distance between the pursuer and the evader;

k_{r}

is a constant weight.

(3) The control effect reward

r_{a}

. This term takes into account the overload of the pursuer and encourages the pursuer to intercept the evader within the minimum energy consumption. This function is designed as

\begin{matrix} r_{a} = {- k}_{a} {(\frac{a_{m}}{a_{m m a x}})}^{2} \end{matrix}

(46)

where

a_{m m a x}

is the limited overload of the pursuer;

k_{a}

is a constant weight.

(4) The LOS reward

r_{v}

. This term refers to proportional navigation guidance (PNG), which ensures that the pursuer intercepts the evader as close as possible in parallel. This term also ensures that the evader is maintained within the pursuer’s LOS. The function is designed as

\begin{matrix} r_{v} = - k_{v 1} {(\frac{θ}{θ_{0}})}^{2} - k_{v 2} {(\frac{\dot{θ}}{{\dot{θ}}_{0}})}^{2} \end{matrix}

(47)

where

θ_{0}

is the initial LOS,

{\dot{θ}}_{0}

denotes the initial rate of the LOS, and

k_{v 1}

and

k_{v 2}

are constant weights.

In summary, the total designed reward function is

\begin{matrix} r (s_{i}, a_{m i}, a_{t i}) = r_{e} + r_{p} = r_{e} + (r_{z} + r_{a} + r_{v}) \end{matrix}

(48)

where

r (s_{i}, a_{m i}, a_{t i})

is the reward function in Equation (16).

r_{e}

is the terminal reward.

r_{p}

is the process reward.

The proposed reward function is a comprehensive function, which consists of interception accuracy, energy consumption, and interception distance.

Remark 4.

Since the process of DRL is stochastic in nature, the selected parameters of the reward function are not quantitative. In this paper, based on a lot of training results, the parameters of the reward function are selected as shown in Table 1.

In order to improve the training efficiency, the normalized tips are adopted in the training process. The normalized states and actions are defined as follows:

{\bar{r}}_{d} = \frac{r_{d}}{r_{0}}, \bar{θ} = \frac{θ}{θ_{0}}, {\bar{\dot{r}}}_{d} = \frac{{\dot{r}}_{d}}{{\dot{r}}_{0}}, \bar{\dot{θ}} = \frac{\dot{θ}}{{\dot{θ}}_{0}}, {\bar{a}}_{m} = \frac{a_{m}}{a_{m m a x}}, {\bar{a}}_{t} = \frac{a_{t}}{a_{t m a x}}

where

{(\cdot)}_{0}

represents the initial value of the variable

(\cdot)

, and

a_{m a x}

stands for the maximum acceleration of two agents.

D.: Neural Network Structure and Hyperparameter Setting

Inspired by the DDPG algorithm, the actor–critic NN is shown in Figure 3. Both networks have a similar structure, which is composed of four-layer fully connected neural networks. The input layer of the actor network (Figure 3a) is states; the output layer of the actor network is actions. The input layer of the critic network (Figure 3b) is states and actions; the output layer of the critic network is Q-value. Two hidden layers are considered, which are activated by the Relu function. The numbers of two hidden layers are set as 128 and 64, respectively. The ReLU function is defined as follows:

\begin{matrix} R e L U (x) = \{\begin{matrix} x x \geq 0 \\ 0 x < 0 \end{matrix} \end{matrix}

(49)

The output layer is activated by the Tanh function, which can be expressed as

\begin{matrix} T a n h (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}} \end{matrix}

(50)

From Equation (48), it can be seen that the output layer is limited in (−1,1). Therefore, the problem of action saturation can be avoided.

The details of the input layer, hidden layers, and output layer are shown in Table 2. All hyperparameters are listed in Table 3.

4. Simulation Results and Analysis

In the training process, to ensure diversity in random samples, we select several random positions, random flight angles, random LOS angles, and random accelerations. The details are given in Table 4. The speeds of the pursuer and evader are V_M = 120 m/s and V_T = 100 m/s, respectively. The simulation occurs during the terminal guidance phase. Additionally, the acceleration is limited to a reasonable range. The simulation scenario can refer to reference [22]. The experiment environments are tested on the PC platform with i7-9750H CPU @ 2.90 GHZ, 32 G DDR3, using Python (Python 3.7) and PyTorch (PyTorch 12.1).

4.1. Training Results

First, before training the RL model, we use an experience dataset to test the correctness of the simulation environment. In this experiment, the initial states of the pursuer and the evader are random. The results are presented in Figure 4 and Table 5.

Figure 4 shows the 50 interception processes where the pursuer and the evader are untrained, where the pursuer’s acceleration is random in

a_{m} \in ({- n}_{M} g, n_{M} g)

. It can be seen that the pursuer may fail to intercept the evader. To further illustrate this result, 1000 Monte Carlo tests are presented in Table 5. It can be found that the pursuer can successfully intercept the evader with 52.2% accuracy. Figure 4 and Table 5 show the training data, which are samples and untrained data. The results of Figure 4 and Table 5 reveal the correctness of the RL simulator.

Then, according to the above parameter setting, the training process is learned to test the proposed IDGGL strategy. The results are shown in Figure 5.

To make the results compact, the average reward and average steps of each episode are presented together in Figure 5a. The left y-coordinate represents the average reward, and the right y-coordinate represents the average steps. It can be noted that the average reward (blue line) of the proposed IDGGL can converge within 20,000 episodes. It can be seen that the average steps (orange line) of the proposed algorithm can converge around 60 steps, which means that the pursuer can intercept the evader within 65 steps. Figure 5b presents the interception distance of the learning process. It can be observed that the average interception distance can converge to a reasonable interception region. All results illustrate the effectiveness of the proposed IDGGL.

To further demonstrate the better performance of the designed reward function (Equation (46)), we designed a comparative experiment. In this experiment, we compared the terminal reward function, the process reward function, and the designed reward function for the learning process. The results are shown in Figure 6.

The comparison results of the average reward are shown in Figure 6. It can be found that the average reward of the designed reward function will converge within 20,000 episodes, while the average reward of the terminal reward function will converge within 37,000 episodes. Additionally, the average reward of the designed reward function is larger than the average reward of the terminal reward function. Both results illustrate that the design reward function has a faster training speed and better reward. The reason behind this observation is that the pursuer cannot learn the full information and may stay in certain positions. Furthermore, it can be found that only in the process reward function, the training process may not converge. The reason is that the pursuer does not know if the interception is successful or not.

4.2. Test Results

In the test experiment, to verify the effectiveness of the proposed IDGGL algorithm, four experiments are conducted.

4.2.1. The Effectiveness of the Proposed IDGGL and the Designed Reward Function

First, the learned model is used to prove the effectiveness of the proposed IDGGL. In this experiment, the pursuer adopts the trained model to intercept the evader. The results are shown in Figure 7.

Figure 7 shows the interception process of 50 games’ trained data between pursuers and evaders, where the pursuer’s acceleration is chosen by the trained model. It can be seen that the pursuer almost successfully intercepts the evader. This result reveals the effectiveness of the proposed IDGGL. Additionally, by comparing Figure 4 and Figure 7, it can be seen that the proposed algorithm can train a better interception model.

To further validate the effectiveness of the designed reward function, two reward functions are compared in 1000 Monte Carlo tests. A total of 1000 Monte Carlo results with random initialization conditions are presented in Figure 8 and Table 6.

Figure 8 shows 1000 Monte Carlo results with random initialization conditions for the designed reward function and the terminal reward function. It can be found that the pursuer almost successfully intercepts the evader with different reward functions. From Table 6, it can be found that the pursuer can successfully intercept the evader with 99.2% accuracy by the designed reward function.

4.2.2. Intercepting a Game Evader

In this experiment, the game process of the pursuer and the evader is test, where the pursuer adopts the proposed IDGGL, while the evader performs the corresponding optimal escape strategy. The initial positions of the pursuer and the evader are set at (0 m, 0 m) and (500 m, 0 m), respectively. Their initial flight path angles are α = 40° and β = 120°, respectively. The initial LOS is 0°. The interception results are presented in Figure 9.

Figure 9a shows the trajectory of the pursuer and the evader. It is a game process where the pursuer and the evader are unwilling to lose their own interests. Thus, the interception paths appear to be straight. The change in the relative distances of two agents is presented in Figure 9b. It can be clearly noted that both results indicate that the pursuer can intercept the evader successfully. Figure 9c,d show the performance curves of the closing speed and the LOS rate. Figure 9c reveals that the closing speed converges to zero, which implies that the pursuer intercepts the evader as soon as possible. Figure 9d presents that the LOS rate is maintained at zero, which implies that the pursuer intercepts the evader with a parallel manner. It also keeps up with the interception characteristics of PNG. This result indicates that the motion direction of the pursuer and the target remains unchanged, and the paths appear to be straight. The acceleration curves of two agents are presented in Figure 9e, which reveals that the proposed guidance algorithm provides smooth guidance command within the limited acceleration range. It can be seen from Figure 9e that the accelerations of the pursuer and the evader approach the maximum amplitude within 0.1 s. Therefore, it can be concluded that the evader tries its best to escape from the pursuer, while the pursuer tries its best to intercept the evader, which reflects that the game process is an accelerated interception process.

4.2.3. Intercepting Maneuvering Evaders

To further demonstrate the general applicability of the proposed IDGGL, two different maneuvering manners of the evader are considered. In case 1, the evader executes a square-wave maneuver with a magnitude of

10 g

, as shown:

a_{t} = 100 s g n (\sin (2 t + π / 2))

In case 2, the evader performs a sin-wave maneuver with a magnitude of

10 g

, as shown:

a_{t} = 100 s i n (2 t + π / 2)

The simulation results for this experiment are shown in Figure 10.

Figure 10a,b present the flight trajectories of the two agents and the change in the relative distance between two agents under different conditions. It can be seen that the proposed IDGGL can successfully intercept the evader with different maneuvering manners. Additionally, the relative distance between the two agents tends to zero, which also reveals that the evader is intercepted by the pursuer. Figure 10c,d show the acceleration of two agents within

10 g

. It can be seen that the proposed IDGGL can provide a smooth acceleration command for the pursuer within

10 g

4.2.4. Comparison for Different Guidance Laws

To further demonstrate the advantage of the proposed guidance law, the OGL and DGGL are compared with the proposed IDGGL, respectively. The simulation environment’s three methods are the same, which means that the information of the target’s motion

(\dot{r}, \dot{θ})

is known in the three methods. We define the control effort as

J = \int_{0}^{\infty} u^{T} (τ) u (τ) d τ

. It can quantify the advantages of the proposed IDGGL. The evader executes a sin-wave maneuver. The compared results are shown in Figure 11 and Table 7.

Figure 11a indicates the trajectories of three different guidance laws. It can be seen that all guidance laws can successfully intercept the evader. Traditional control commands (OGL, DG) require manual parameter settings, most of which are based on expert experience. The parameters of the proposed method are learned by DRL. Therefore, the parameters are optimal, and the proposed method can save simulation time. Figure 11b presents the accelerations of the pursuer with three guidance laws. It can be concluded that the smallest acceleration amplitude is our algorithm, the second is the OGL, and the largest is the DGGL. More important, the acceleration of the DGGL may exceed 10g, which is harmful to the system. Figure 11c shows the control effort of the pursuer. Table 7 summarizes the average result obtained from 100 Monte Carlo simulations. It can be noted that the control effort of our proposed method is the least compared with the other two guidance laws, which implies that our proposed guidance law saves energy consumption and is more intelligent. Moreover, the simulation time of the three guidance laws were compared. It can be found that the simulation time of our proposed method is the shortest. All in all, by comparing the three methods, it can reflect the advantages of the proposed method in terms of the interception accuracy, interception time, and cost energy of the pursuer. The proposed IDGGL algorithm can intercept evaders with a fast time and a lower control effort.

5. Conclusions

This paper proposes an intelligent differential game guidance law based on DRL. First, the interception problem is converted into finding the Nash equilibrium strategy. We proposed an algorithm for obtaining the optimal IDGGL strategy. Subsequently, a reasonable reward function, which includes the interception accuracy, the energy consumption, and the interception distance, is designed for the engagement. The simulation results demonstrate that the proposed IDGGL algorithm exhibits superior performance in terms of acceleration response, control efforts, and quick computation. Moreover, compared with traditional guidance laws, the IDGGL algorithm can confront the complex game environment and avoid tedious manual setting. Additionally, compared to other reinforcement learning guidance laws, the proposed IDGGL algorithm takes into account the effect of the evader. Importantly, by designing a more reasonable reward function, the proposed IDGGL algorithm has good performance in intercepting different maneuvering evaders. The intercepting accuracy is up to 99.2%. All in all, simulation experiments demonstrate the efficiency of the proposed IDGGL. However, the limitations and challenges of the proposed approach include the following: the escape strategies of the evader, real-time processing constraints, the impact of obstacles, complex scenarios. By using a state observer, employing advanced hardware, and using optimizing algorithms, the specific limitations can be solved. In the terminal guidance phase, when the pursuer and the evader reach a certain height, the proposed intelligent differential game guidance law can be used in practical applications. For implementing the proposed method in real-world systems, some modifications should be considered, which include adjustments for environmental variability, integration with existing systems, and considerations for robustness and reliability. For example, when the pursuer and evader are in a 2D scenario and the environmental state is known, the algorithm proposed can completely achieve the interception of the evader. In future work, more complex scenarios should be considered, such as changes in heading angle and the interceptor’s altitude, which would enhance the robustness and applicability of our guidance laws. Additionally, the impact of external disturbances should be considered, such as winds, which lead to the development of more resilient guidance laws capable of operating under various environmental conditions.

Author Contributions

Conceptualization, A.X. and Y.C.; Methodology, A.X.; Software, A.X.; Validation, A.X.; Formal Analysis, A.X.; Investigation, A.X.; Resources, A.X.; Data Curation, A.X.; Writing—Original Draft Preparation, A.X.; Writing—Review and Editing, A.X. and Y.C.; Visualization, A.X.; Supervision, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China project [62203349]; Innovation Zone Project [23-TQ01-04-ZT-01-011] and The APC was funded by Innovation Zone Project [23-TQ01-04-ZT-01-011].

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

I would like to express my gratitude to my supervisor who helped me during the writing of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guo, Y.; Li, X.; Zhang, H.; Cai, M.; He, F. Data-Driven Method for Impact Time Control Based on Proportional Navigation Guidance. J. Guid. Control Dyn. 2020, 43, 955–966. [Google Scholar] [CrossRef]
Franzini, G.; Tardioli, L.; Pollini, L.; Innocenti, M. Visibility Augmented Proportional Navigation Guidance. J. Guid. Control Dyn. 2018, 41, 987–995. [Google Scholar] [CrossRef]
Chen, X.; Wang, J. Optimal control based guidance law to control both impact time and impact angle. Aerosp. Sci. Technol. 2018, 84, 454–463. [Google Scholar] [CrossRef]
Harl, N.; Balakrishnan, S.N. Impact Time and Angle Guidance with Sliding Mode Control. IEEE Trans. Control Syst. Technol. 2011, 20, 1436–1449. [Google Scholar] [CrossRef]
Alqudsi, Y.S.N.; El-Bayoumi, G.M. Intercept algorithm for maneuvering targets based on differential geometry and lyapunov theory. INCAS Bull. 2018, 10, 175–192. [Google Scholar] [CrossRef]
Liu, Y.; Qi, N.; Tang, Z. Linear Quadratic Differential Game Strategies with Two-pursuit Versus Single-evader. Chin. J. Aeronaut. 2012, 25, 896–905. [Google Scholar] [CrossRef]
Fang, F.; Cai, Y.-L. Optimal cooperative guidance with guaranteed miss distance in three-body engagement. Proc. Inst. Mech. Eng. Part G J. Aerosp. Eng. 2016, 232, 492–504. [Google Scholar] [CrossRef]
Jiang, Y.; Gao, W.; Na, J.; Zhang, D.; Hämäläinen, T.T.; Stojanovic, V.; Lewis, F.L. Value iteration and adaptive optimal output regulation with assured convergence rate. Control Eng. Pract. 2022, 121, 105042. [Google Scholar] [CrossRef]
Vamvoudakis, K.G.; Lewis, F. Online solution of nonlinear two-player zero-sum games using synchronous policy iteration. Int. J. Robust Nonlinear Control 2011, 22, 1460–1483. [Google Scholar] [CrossRef]
Dierks, T.; Jagannathan, S. Online Optimal Control of Affine Nonlinear Discrete-Time Systems with Unknown Internal Dynamics by Using Time-Based Policy Update. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1118–1129. [Google Scholar] [CrossRef]
Yasini, S.; Sistani, M.B.N.; Karimpour, A. Approximate dynamic programming for two-player zero-sum game related to H ∞ control of unknown nonlinear continuous-time systems. Int. J. Control Autom. Syst. 2014, 13, 99–109. [Google Scholar] [CrossRef]
Harmon, M.E.; Baird, L.C.; Klopf, A.H. Reinforcement Learning Applied to a Differential Game. Adapt. Behav. 1995, 4, 3–28. [Google Scholar] [CrossRef]
Lee, D.; Bang, H. Planar evasive aircrafts maneuvers using reinforcement learning. In Intelligent Autonomous Systems 12; Lee, S., Cho, H., Yoon, K.J., Eds.; Advances in Intelligent Systems and Computing; Springer: Berlin/Heidelberg, Germany, 2013; Volume 193. [Google Scholar] [CrossRef]
Desouky, S.F.; Schwartz, H.M. Q(λ)-learning adaptive fuzzy logic controllers for pursuit-evasion differential games. Int. J. Adapt. Control Signal Process. 2011, 25, 910–927. [Google Scholar] [CrossRef]
Tai, J.J.; Wong, J.; Innocente, M.; Horri, N.; Brusey, J.; Phang, S.K. PyFlyt—UAV Simulation Environments for Reinforcement Learning Research. arXiv 2023, arXiv:2304.01305. [Google Scholar]
Wei, W.; Wang, J.; Du, J.; Fang, Z.; Ren, Y.; Chen, C.L.P. Differential game-based deep reinforcement learning in underwater target hunting task. IEEE Trans. Neural Netw. Learn. Syst. 2023, 13, 37889822. [Google Scholar] [CrossRef] [PubMed]
Wang, N.; Wang, X.; Cui, N.; Li, Y.; Liu, B. Deep reinforcement learning-based impact time control guidance law with constraints on the field-of-view. Aerosp. Sci. Technol. 2022, 128, 107765. [Google Scholar] [CrossRef]
Guo, Y.; Jiang, Z.; Huang, H.; Fan, H.; Weng, W. Intelligent Maneuver Strategy for a Hypersonic Pursuit-Evasion Game Based on Deep Reinforcement Learning. Aerospace 2023, 10, 783. [Google Scholar] [CrossRef]
Yan, T.; Jiang, Z.; Li, T.; Gao, M.; Liu, C. Intelligent maneuver strategy for hypersonic vehicles in three-player pursuit-evasion games via deep reinforcement learning. Front. Neurosci. 2024, 18, 1362303. [Google Scholar] [CrossRef]
Sun, J.; Liu, C. Finite-horizon differential games for missile–target interception system using adaptive dynamic programming with input constraints. Int. J. Syst. Sci. 2018, 49, 264–283. [Google Scholar] [CrossRef]
Wang, X.; Deng, Y.; Cai, Y.; Jiang, H. Deep Recurrent Reinforcement Learning for Intercept Guidance Law under Partial Observability. Appl. Artif. Intell. 2024, 38, 2355023. [Google Scholar] [CrossRef]
Tai, J.J.; Phang, S.K.; Wong, F.Y.M. COAA*—An optimized obstacle avoidance and navigational algorithm for UAVs operating in partially observable 2D environments. Unmanned Syst. 2022, 10, 159–174. [Google Scholar] [CrossRef]

Figure 1. Pursuit–evasion game geometry.

Figure 2. Basic structure of system and algorithm flowchart. (a) Basic structure of system. (b) Algorithm flowchart.

Figure 3. The structure of the actor–critic neural network. (a) Actor network, (b) critic network.

Figure 4. Training data.

Figure 5. Training results. (a) Learning curves of reward and steps. (b) Miss distance of training process.

Figure 6. The comparison results.

Figure 7. Random test results.

Figure 8. The Monte Carlo test results.

Figure 9. The results of intercepting a constant velocity evader. (a) Flight trajectories. (b) Relative range. (c) Closing speed. (d) LOS rate. (e) The acceleration of two agents.

Figure 10. Game scenarios of two cases. (a) Flight trajectory of two agents. (b) Relative distance of two cases. (c) Lateral acceleration of pursuer. (d) Lateral acceleration of evader.

Figure 11. Contrast results. (a) Game trajectories of three guidance laws. (b) Accelerations of pursuer. (c) Control efforts of pursuer.

Table 1. The parameters of the reward function.

Parameters	$a_{1}$	$k_{r}$	$k_{a}$	$k_{v 1}$	$k_{v 2}$
value	100	10	0.5	0.1	0.1

Table 2. The architecture of the actor–critic network.

Layer	Actor Network		Critic Network
Layer	Units	Activation Function	Units	Activation Function
Input	4 (states)	None	5 (states + actions)	None
Hidden layer 1	64	ReLU	64	ReLU
Hidden layer 2	16	ReLU	16	ReLU
Output	1	Tanh	1	Tanh

Table 3. Hyperparameters setting.

Parameters	Values
Maximum steps	1000
Maximum episodes	1000
Actor learning rate	0.0005
Critic learning rate	0.0005
Experience buffer	5000
Minibatch samples	128
Discount factor	0.99
Soft update coefficient	0.1
Gradient steps	−1
Action noise	0.1
Mean attraction constant	0.05
Sampling time	0.01
Policy	Mlp-policy

Table 4. Random initial condition of training process.

Parameters	Maximum	Minimum
$x_{M}$	0 m	100 m
$y_{M}$	0 m	100 m
$x_{T}$	400 m	500 m
$y_{T}$	0 m	100 m
$γ_{M}$	0°	160°
$γ_{T}$	120°	10°
LOS	5°	−5°
$τ_{M}, τ_{T}$	0.1 s	0.2 s
$n_{M}, n_{T}$	10	0

Table 5. The Monte Carlo result of the training process.

Successful	Fail	Monte Carlo Number
478	522	1000

Table 6. The details of the Monte Carlo results.

Reward Function	Successful	Fail	Monte Carlo Number
The designed reward function	992	8	1000
The terminal reward function	948	52	1000

Table 7. The performance summary of the three guidance laws.

Methods	Control Effort (J)	Simulation Time (s)	Monte Carlo Number
OGL	289.4	1.647	100
DGGL	264.2	0.883	100
Our method	152.4	0.058	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xi, A.; Cai, Y. Deep Reinforcement Learning-Based Differential Game Guidance Law against Maneuvering Evaders. Aerospace 2024, 11, 558. https://doi.org/10.3390/aerospace11070558

AMA Style

Xi A, Cai Y. Deep Reinforcement Learning-Based Differential Game Guidance Law against Maneuvering Evaders. Aerospace. 2024; 11(7):558. https://doi.org/10.3390/aerospace11070558

Chicago/Turabian Style

Xi, Axing, and Yuanli Cai. 2024. "Deep Reinforcement Learning-Based Differential Game Guidance Law against Maneuvering Evaders" Aerospace 11, no. 7: 558. https://doi.org/10.3390/aerospace11070558

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning-Based Differential Game Guidance Law against Maneuvering Evaders

Abstract

1. Introduction

2. Problem Statement and Preliminaries

2.1. The Formulation of the Pursuit–Evasion Game

2.2. RL Framework

3. Deep Reinforcement Learning Formulation of Differential Games

3.1. The Framework of the Zero-Sum Differential Markov Game

3.2. The Proposed Algorithm

3.3. The IDGGL Design Based on the Proposed Algorithm

4. Simulation Results and Analysis

4.1. Training Results

4.2. Test Results

4.2.1. The Effectiveness of the Proposed IDGGL and the Designed Reward Function

4.2.2. Intercepting a Game Evader

4.2.3. Intercepting Maneuvering Evaders

4.2.4. Comparison for Different Guidance Laws

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI