Near-continuous time Reinforcement Learning for continuous state-action spaces

Lorenzo Croissant; Marc Abeille; Bruno Bouchard

Pré-Publication, Document De Travail Année : 2023

Near-continuous time Reinforcement Learning for continuous state-action spaces

(1, 2, 3, 4) , (1) , (2, 3)

1
2
3
4

Lorenzo Croissant

Fonction : Auteur correspondant
PersonId : 1079547

Connectez-vous pour contacter l'auteur

Criteo AI Lab

CEntre de REcherches en MAthématiques de la DEcision

Université Paris Dauphine-PSL

IA coopérative : équité, vie privée, incitations

Marc Abeille

Fonction : Auteur
PersonId : 1072166

Criteo AI Lab

Bruno Bouchard

Fonction : Auteur
PersonId : 1038139
ORCID : 0000-0002-4716-1253

CEntre de REcherches en MAthématiques de la DEcision

Université Paris Dauphine-PSL

Résumé

We consider the Reinforcement Learning problem of controlling an unknown dynamical system to maximise the long-term average reward along a single trajectory. Most of the literature considers system interactions that occur in discrete time and discrete state-action spaces. Although this standpoint is suitable for games, it is often inadequate for mechanical or digital systems in which interactions occur at a high frequency, if not in continuous time, and whose state spaces are large if not inherently continuous. Perhaps the only exception is the Linear Quadratic framework for which results exist both in discrete and continuous time. However, its ability to handle continuous states comes with the drawback of a rigid dynamic and reward structure. This work aims to overcome these shortcomings by modelling interaction times with a Poisson clock of frequency $\varepsilon^{-1}$, which captures arbitrary time scales: from discrete ($\varepsilon=1$) to continuous time ($\varepsilon\downarrow0$). In addition, we consider a generic reward function and model the state dynamics according to a jump process with an arbitrary transition kernel on $\mathbb{R}^d$. We show that the celebrated optimism protocol applies when the sub-tasks (learning and planning) can be performed effectively. We tackle learning within the eluder dimension framework and propose an approximate planning method based on a diffusive limit approximation of the jump process. Overall, our algorithm enjoys a regret of order $\tilde{\mathcal{O}}(\varepsilon^{1/2} T+\sqrt{T})$. As the frequency of interactions blows up, the approximation error $\varepsilon^{1/2} T$ vanishes, showing that $\tilde{\mathcal{O}}(\sqrt{T})$ is attainable in near-continuous time.

Mots clés

Reinforcement Learning Control Online Learning

Domaines

Intelligence artificielle [cs.AI] Optimisation et contrôle [math.OC] Statistiques [math.ST]

Fichier principal

paper.pdf (558.56 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Lorenzo Croissant : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04196722

Soumis le : mardi 5 septembre 2023-14:36:51

Dernière modification le : samedi 27 avril 2024-03:17:01

Dates et versions

hal-04196722 , version 1 (05-09-2023)

Identifiants

HAL Id : hal-04196722 , version 1
ARXIV : 2309.02815

Citer

Lorenzo Croissant, Marc Abeille, Bruno Bouchard. Near-continuous time Reinforcement Learning for continuous state-action spaces. 2023. ⟨hal-04196722⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

X GENES CNRS INRIA UNIV-DAUPHINE ENSAE INSMI CEREMADE CREST ENSAI INRIA2 TDS-MACS PSL X-CREST IP_PARIS GS-COMPUTER-SCIENCE

18 Consultations

43 Téléchargements

Near-continuous time Reinforcement Learning for continuous state-action spaces

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager