Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process

Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process
Yusuke Miyake [1][2] ,Ryuji Watanabe [2] and Tsunenori Mine [1] / Kyushu Univ [1]. Pepabo R&D Institute, GMO Pepabo, Inc [2]. July 2-4, 2024. The 48th IEEE International Conference on Computers, Software, and Applications (COMPSAC 2024)

2 Agenda 1. Introduction 2. Related works 3. Proposal
4. Evaluation 5. Conclusion

1. Introduction

4 Background • Selecting an optimal behavior from many
ones is crucial for practical applications. • The effectiveness of them cannot be known in advance. • Continuous comparative evaluation in an actual environment is essential. • However, it results in opportunity loss. → → IUUQTJDPOTDPN Short-term opportunity loss Long-term opportunity loss • The reduction of opportunity loss is considered as a multi-armed bandit (MAB) problem.

• The player must select one arm from multiple arms
to maximize the reward. • The reward is stochastically given based on the chosen arm. • The player needs to infer the reward distribution from the trial results. • To find the optimal arm, the player should balance exploitation and exploration. 5 Multi-Armed Bandit (MAB) Problem ʜ Select Reward Infer The term 'arm' is derived from the 'arm' of a slot machine. IUUQTJDPOTDPN

• -Greedy policy selects an arm uniformly at random
with a probability of (exploration) and selects the arm with the highest average reward at that time with a probability of (exploitation). ϵ ϵ 1 − ϵ 6 The simplest MAB policy argmaxl=1,L ̂ y(l),1 − ϵ ∀a ∈ A, ϵ/L Bandit A/B testing ∀a ∈ A, ϵ/L

7 Motivation of our Approach • Nonlinearity • In
practical applications such as e-commerce, user behavior and preferences often exhibit complex, nonlinear patterns that cannot be captured by simple linear models. • Nonstationarity • User preferences and environmental conditions in the real-world applications change over time. • Online performance • Response delays are detrimental to the user experience for the real-world applications. Nonlinearity Non stationarity Online performance

2. Related works

• Use of Kernel Function • Efficiently computes the
inner product in high-dimensional feature space using basis functions. • Application to Nonlinear MAB Problems • Widely explored for nonlinear MAB problems. • Challenge of GP Regression • Training time increases exponentially as the dataset grows. • Few policies have been developed to handle nonstationarity. k • A method to infer the distribution of a function as a stochastic process from data. • Effective in modeling nonlinear functions and prediction uncertainties. 9 Gaussian Process Regression K−1 K ∈ ℝN×N K−1 → k(xi , xj )

10 • GP regression model-based nonstationary and nonlinear policy
Weighted GP-UCB [Y. Deng 2022] K ZZ⊤ ≃ K ∈ ℝN×N Z ∈ ℝN×R Z⊤Z Z⊤Z ∈ ℝR×R ⋙ • Non stationarity • Focuses on new training data by using two new weight matrices. • Online performance • Using Random Fourier Features (RFF), compute the predictive distribution of GP Regression in the form of linear regression in -dimensional space, where . • Keeps the size of the computed inverse matrix constant, reducing computational complexity. • Limitation • Absence of a recursive learning algorithm. • Only partially mitigates the issue of escalating training time with a growing dataset. R R ⋘ N

11 • Nonstationary kernel Recursive Least Squares (RLS) method
• Non stationarity • Introduce a forgetting mechanism for old training data. • Online performance • Using Nyström approximation, compute the predictive distribution of GP Regression in the form of linear regression in -dimensional space, where . • Applies a linear RLS algorithm. • Limitation • The regularization effect decreases as the number of recursive computations increases. • The loss of estimation accuracy due to overfitting can be critical. R R ⋘ N NysKRLS [T. Zhang 2020] (Z⊤ΓZ + γMΛ)−1 γ γ z xN

3. Proposal

• Online nonstationary and nonlinear contextual MAB policy 13
Our Proposed Policy • Nonlinearity and nonstationarity • Introduce a forgetting mechanism in nonlinear GP regression model. • Online performance • Using RFF, compute the predictive distribution of GP Regression in the form of linear regression in -dimensional space, where . • Applies a linear RLS algorithm. • Key features • Fast decision-making with recursive learning. • Accurate error correction in predictive distribution. R R ⋘ N z P N−2,M−2 Q N−2,M−2 x N−1 y N−1 (M = 0) γ P N−1,M−1 Q N−1,M−1 z x N y N γ P N,M Q N,M z x * p(y N−1 ∣ x N−1 , X, y) p(y N ∣ x N , X, y) p(y * ∣ x * , X, y) (M = 0)

• A method to approximate kernel function using
samples from a probability distribution . k(xi , xj ) R′ = R/2 p(ω) 14 Random Fourier Features K K ∈ ℝN×N k(xi , xj ) • Linear Method • By decomposing the kernel function into inputs with -dimensional features, the original problem can be solved as a linear problem in -dimensional space. R R K ZZ⊤ ≃ K ∈ ℝN×N Z ∈ ℝN×R Z⊤Z Z⊤Z ∈ ℝR×R ⋙ k(xi , xj ) ≃ z(xi )⊤z(xj )

15 Comparative Learning Methods K′ K′
K′ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 GP learning GP with RFF learning K′ −1 Inv K′ −1 Inv K′ −1 Inv Z⊤Z + Λ Z⊤Z + Λ Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 (Z⊤Z + Λ)−1 Inv (Z⊤Z + Λ)−1 Inv (Z⊤Z + Λ)−1 Inv K′ ∈ ℝN×N Z⊤Z ∈ ℝR×R

• Incorporating exponential forgetting for past inputs and outputs enhances
estimation accuracy in nonstationary environments. • The proposed method assumes that more distant past data has larger observation errors , thus lower accuracy. ϵn 16 Forgetting Mechanism °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ conﬁdence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ conﬁdence Transition of Observation Error Distribution ϵn Past data has larger observation errors. Recent data has smaller observation errors. Predictive distribution for a nonstationary and nonlinear regression problem Past training data from fA Recent training data from fB Prediction fits the recent training data. = Quick adaptation to changing environments. μ

17 Comparative Learning Methods Nonstationary GP with RFF learning
Z⊤ΓZ + Λ Z⊤ΓZ + Λ Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + Λ)−1 Inv Z⊤ΓZ ∈ ℝR×R GP with RFF learning Z⊤Z + Λ Z⊤Z + Λ Z⊤Z + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 (Z⊤Z + Λ)−1 Inv (Z⊤Z + Λ)−1 Inv (Z⊤Z + Λ)−1 Inv Z⊤Z ∈ ℝR×R

• Recursive Least Squares (RLS) updates regression model parameters using
previous calculations and new observations. • Efficiency • The algorithm computes parameters efficiently from results up to time and observed values at time . • Reduced Computational Cost • By avoiding the need to compute the inverse matrix, the algorithm significantly reduces computational cost. N N + 1 18 Recursive Learning Mechanism

19 Comparative Learning Methods Batch learning Z⊤ΓZ +
Λ Z⊤ΓZ + Λ Z⊤ΓZ + Λ … z z z … ⋮ x1 x2 xN x1 x1 xN−1 (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + γΛ)−1 (Z⊤ΓZ + γ2Λ)−1 (Z⊤ΓZ + γMΛ)−1 γ γ γ … γ z z z … ⋮ x1 x2 xN x1 x1 xN−1 Recursive learning

• Estimation error arises in recursive learning due to the
forgetting effect is recursively applied to the regularization term. 20 Estimation Error in Recursive Learning °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x °4 °2 0 2 4 y Predictive distribution (ˆ µ00 and 1æ conﬁdence based on ˆ ß00) yA fA (x) yB fB (x) ˆ µ00 1æ conﬁdence ≠ Z⊤ΓZ + Λ z ⋮ xN x1 xN−1 (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + γMΛ)−1 γ γ z ⋮ xN x1 xN−1 • For MAB policies that run for long periods, the loss of estimation accuracy due to overfitting can be fatal. • Addressing this error causes a trade-off between accuracy and online performance. Error Estimation error of predictive distribution parameters for each M

• A novel recursive error correction method balances estimation accuracy
and online performance. 21 Error Correction Method = Z⊤ΓZ + Λ z ⋮ xN x1 xN−1 (Z⊤ΓZ + Λ)−1 Inv (Z⊤ΓZ + γ0Λ)−1 γ γ z ⋮ xN x1 xN−1 Error Correction 1. Inverse 2. Subtract error from 3. Inverse −1 −1 • High computational cost due to twice inversion of matrix. • Method executed at intervals based on acceptable estimation accuracy rather than every time. R × R °0.10 °0.05 0.00 0.05 0.10 Error of ˆ µ00 Estimation error of ˆ µ00 for each M M=0 M=200 M=400 M=600 °4 °3 °2 °1 0 1 2 3 4 x 0.0000 0.0005 0.0010 0.0015 0.0020 Error of ˆ ß00 Estimation error of ˆ ß00 for each M M=0 M=200 M=400 M=600 No estimation error when . M = 0

4. Evaluation

• Nonstationary and nonlinear contextual MAB simulation • The arm's
reward follows a normal distribution , with the mean corresponding to the context as shown below. • The banded curve moves leftward over time (one full rotation in 4000 trials). 𝒩 (μ, σ2) μ xt = (xt,d )1≤d≤2 23 Simulation Setup The arm is always a(1) t μ = μ1 μ1 = 0.1,μ2 = 0.0,μ3 = 1.0,σ2 = 0.01,δ = 0.8,ρ = 4000 The magnitudes of the means are . To maximize the expected reward, it is necessary to select the corresponding arm within the band curve and choose arm otherwise. μ2 < μ1 ≪ μ3 a(1)

24 Baseline Policies Nonlinear Nonstationary Recursive learning Policy Note
✓ ✓ ✓ RW-GPB (Proposal) Evaluated with multiple correction intervals τ to compare error correction effects. ✓ ✓ GP+UCB (Weighted, RFF) - State-of-the-art - Reduced learning time with RFF ✓ ✓ GP+UCB (Weighted) ✓ ✓ GP+UCB (Sliding Window) ✓ ✓ GP+UCB (Restarting) ✓ GP+UCB ✓ ✓ Decay LinUCB Constant exploration scale is used for all policies to clarify the effect of each policy's regression model. β = 1

• The proposed RW-GPB policy achieves higher cumulative rewards and
shorter execution time than the state-of-the-art GP+UCB (Weighted, RFF) policy. • Compared to GP+UCB (Weighted, RFF), • RW-GPB ( ) reduces the execution time by 71% and equal rewards. • RW-GPB ( ) reduces the execution time by 92% and more rewards. • The GP+UCB (Weighted) policy without RFF had the highest reward and the longest execution time. • Improving the approximation accuracy of the kernel function is also essential. τ = 1 τ = 40 25 Simulation Results: Trade-off Analysis

• Accumulated errors reduce the cumulative reward and that the
frequency of error correction has virtually no effect on execution time. • Error correction should be performed aggressively. • Interestingly, the cumulative reward is improved for than for the most accurate . • This result indicates that a slight increase in exploration frequency may be helpful in nonstationary environments. τ = 4 τ = 1 26 Simulation Results: Trade-off Analysis 750 800 850 900 950 Cumulative rewards 102 103 Trials per second ø = 1600 ø = 800 ø = 400 ø = 100 ø = 1 ø = 40 ø = 4 Cumulative rewards - Trials per second trade-oÆ RW-GPB GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)

• The proposed RW-GPB policy keeps the execution time constant.
• The policy without recursive learning increases the execution time linearly. • In addition, the policy without RFF increases it exponentially. 27 Simulation Results: Computation Time 500 1000 1500 2000 2500 3000 3500 4000 Number of trials 0 20 40 60 80 Cumulative execution time (Sec) Cumulative execution time RW-GPB (ø = 4) GP+UCB (Sliding Window) GP+UCB (Weighted) GP+UCB (Weighted, RFF)

5. Conclusion

• RW-GPB Policy • We introduced RW-GPB, a new online
policy for nonstationary nonlinear contextual MAB problems, which balances accuracy and online performance. • RW-GPR Model • Our novel RW-GPR model, equipped with a practical error correction method, effectively implements the proposed policy. • Experimental Results • Experimental results demonstrate that RW-GPB significantly reduces computational time while maintaining cumulative reward in simulations. 29 Conclusion

• Meta-Recommender System • We aim to implement and evaluate
a meta-recommender system that autonomously optimizes the selection of recommendation algorithms using the proposed policy. • Client-Side Agents • Future research will explore applying this lightweight policy to client-side agents for solving complex autonomous tasks on resource-constrained devices. • Real-World Effectiveness • We expect the proposed policy to enhance the effectiveness of autonomous systems across various real-world scenarios. 30 Future Work

Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process

Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process

monochromegane

More Decks by monochromegane

Other Decks in Research

Featured

Transcript