inria-00607121, version 2
Query Induction with Schema-Guided Pruning Strategies
Joachim Niehren 1, 2Jérôme Champavère
1Rémi Gilleron
1, 3Aurélien Lemay
1, 2
Journal of Machine Learning Research 14 (2013) 927−964
Résumé : Inference algorithms for tree automata that define node selecting queries in unranked trees rely on tree pruning strategies. These impose additional assumptions on node selection that are needed to compensate for small numbers of annotated examples. Pruning-based heuristics in query learning algorithms for Web information extraction often boost the learning quality and speed up the learning process. We will distinguish the class of regular queries that are stable under a given schema-guided pruning strategy, and show that this class is learnable with polynomial time and data. Our learning algorithm is obtained by adding pruning heuristics to the traditional learning algorithm for tree automata from positive and negative examples. While justified by a formal learning model, our learning algorithm for stable queries also performs very well in practice of XML information extraction.
- 1 : Laboratoire d'Informatique Fondamentale de Lille (LIFL)
- CNRS : UMR8022 – Université Lille I - Sciences et technologies – Université Lille III - Sciences humaines et sociales – INRIA
- 2 : LINKS (INRIA Lille - Nord Europe)
- INRIA – CNRS : UMR8022 – Université Lille I - Sciences et technologies – Université Lille III - Sciences humaines et sociales
- 3 : MAGNET (INRIA Lille - Nord Europe)
- INRIA – CNRS : UMR8022 – Université Lille I - Sciences et technologies – Université Lille III - Sciences humaines et sociales
- Domaine : Informatique/Apprentissage
- Mots-clés : grammatical inference – RPNI – tree automata – XML information extraction
- Versions disponibles : v1 (17-01-2013) v2 (02-04-2013)
- inria-00607121, version 2
- http://hal.inria.fr/inria-00607121
- oai:hal.inria.fr:inria-00607121
- Contributeur : Joachim Niehren
- Soumis le : Vendredi 29 Mars 2013, 20:46:25
- Dernière modification le : Vendredi 10 Mai 2013, 01:36:20