inria-00164003, version 1
Combining Online and Offline Knowledge in UCT
Sylvain Gelly 1David Silver a, 2
International Conference of Machine Learning (2007)
Résumé : The UCT algorithm learns a value function online using sample-based search. The T D(lambda) algorithm can learn a value function offline for the on-policy distribution. We consider three approaches for combining offline and online value functions in the UCT algorithm. First, the offline value function is used as a default policy during Monte-Carlo simulation. Second, the UCT value function is combined with a rapid online estimate of action values. Third, the offline value function is used as prior knowledge in the UCT search tree. We evaluate these algorithms in 9 × 9 Go against GnuGo 3.7.10. The first algorithm performs better than UCT with a random simulation policy, but surprisingly, worse than UCT with a weaker, handcrafted simulation policy. The second algorithm outperforms UCT altogether. The third algorithm outperforms UCT with handcrafted prior knowledge. We combine these algorithms in MoGo, the world's strongest 9 × 9 Go program. Each technique significantly improves MoGo's playing strength.
- a – University of Alberta
- 1 : TAO (INRIA Futurs)
- INRIA – CNRS : UMR8623 – Université Paris XI - Paris Sud
- 2 : University of Alberta, Canada
- University of Alberta
- Domaine : Informatique/Intelligence artificielle
Informatique/Informatique et théorie des jeux
Informatique/Apprentissage
- inria-00164003, version 1
- http://hal.inria.fr/inria-00164003
- oai:hal.inria.fr:inria-00164003
- Contributeur : Sylvain Gelly
- Soumis le : Jeudi 19 Juillet 2007, 13:51:04
- Dernière modification le : Jeudi 19 Juillet 2007, 16:51:10