SlideShare a Scribd company logo
Lucene And Solr Document
Classification
Alessandro Benedetti, Software Engineer, Sease Ltd.
Alessandro Benedetti
● Search Consultant
● R&D Software Engineer
● Master in Computer Science
● Apache Lucene/Solr Enthusiast
● Semantic, NLP, Machine Learning Technologies passionate
● Beach Volleyball Player & Snowboarder
Who I am
● Classification
● Lucene Approach
● Solr Integration
● Demo
● Extensions
● Future Work
Agenda
“Classification is the problem of identifying
to which of a set of categories
(sub-populations) a new observation
belongs, on the basis of a training set of data
containing observations (or instances)
whose category membership is known. “
Wikipedia
Classification
● E-mail spam filter
● Document categorization
● Sexually explicit content detection
● Medical diagnosis
● E-commerce
● Language identification
Real World Use Cases
● Supervised learning
● Labelled training samples
● Documents modelled as
feature vectors
● Term occurrences as features
● Model predicts unseen documents
label
Basics Of Text Classification
Apache Lucene
Apache LuceneTM
is a high-performance, full-featured text search engine library
written entirely in Java.
It is a technology suitable for nearly any application that requires full-text search,
especially cross-platform.
Apache Lucene is an open source project available for free download.
● Lucene index has complex data structures
● Lot of organizations have already indexes in place
● Pre existent data can be used to classify
● No need to train a model from a separate training set
● From training set to Inverted index
Apache Lucene For Classification
● Advanced configurable text analysis
● Term frequencies
● Term positions
● Document frequencies
● Norms
● Part of speech tags and custom payload
Apache Lucene For Classification
● Given an index with labelled documents
● Each document has a class field
● Given an unknown document in input
● Given a set of relevant fields
● Search the top K most similar documents
● Fetch the classes from the retrieved documents
● Return most occurring class(es)
● Class ranking in retrieved documents is important !
K Nearest Neighbours
● KNN uses Lucene More Like This
● Lucene query component
● Extract interesting terms* from the input document fields
● Build a Lucene query
● Run the query against the search index
● Resulting documents are “the similar documents”
* an interesting term is a term :
- occurring frequently in the seed document (high term frequency)
- but quite rare in the corpus (high inverted document frequency)
More Like This
Assumptions
● Term occurrences are probabilistic independent features
● Terms positions are irrelevant ( bag of words )
Calculate the probability score of each available class C
● Prior ( #DocsInClassC / #Docs )
● Likelihood ( P(d|c) = P(t1, t2,..., tn|c) == P(t1|c) * P(t2|c) * … * P(tn|c))
Where given term t
P(t|c) = TF(t) in documents of class c +1 /
#terms in all documents of class c + #docs of class c
Assign top scoring class
Naive Bayes Classifier
● Documents are the Lucene unit of information
● Documents are a map field -> value
● Each field may be analysed differently
(different tokenization and token filtering)
● Each field may have a different weight for the classification
(affecting differently the similarity score)
Document Classification
Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project.
Its major features include powerful full-text search, hit highlighting, faceted
search and analytics, rich document parsing, geospatial search, extensive
REST APIs as well as parallel SQL.
Apache Solr
Index Time Integration - SOLR-7739
● Ingest the document
● Assign the class
● Set the class as a field value
● Index the document
Request Handler Integration (TO DO) - SOLR-7738
Return an assigned class :
● Given a text and a field
● Given an input document
● Given an indexed document id
Solr Integration
● Pipeline of processors
● Each single document flows
through the chain
● Each processor is executed once
● Last processor triggers the
update command
Update Request Processor Chain
● Update Component
● Configurable Singleton Factory
● Single instance per request thread
● Process a single Document
● SolrCloud compatible*
* Pre processor / Post processor
Update Request Processor
● Access the Index Reader
● A Lucene Document Classifier is instantiated
● A class is assigned by the classifier
● A new field is added to the original Document, with the class
● The document goes through the next processing steps
Classification Update Request Processor
...
<initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
<lst name="defaults">
<str name="df">text</str>
<str name="update.chain">classification</str>
</lst>
</initParams>
...
Solrconfig.xml - Update Handler
...
<updateRequestProcessorChain name="classification">
<processor class="solr.ClassificationUpdateProcessorFactory">
...
</processor>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
...
Solrconfig.xml - Chain configuration
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">title^1.5,content,author</str>
<str name="classField">cat</str>
<str name="algorithm">knn</str>
<str name="knn.k">20</str>
<str name="knn.minTf">1</str>
<str name="knn.minDf">5</str>
</processor>
N.B. classField must be stored
Solrconfig.xml - K nearest neighbour classifier config
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">title^1.5,content,author</str>
<str name="classField">cat</str>
<str name="algorithm">bayes</str>
</processor>
N.B. classField must be Indexed (take care of analysis)
Solrconfig.xml - Naive Bayes classifier config
● Lucene >= 6.0
● Solr >= 6.1
● Classification needs a training set ->
An index with initially human assigned classes is required
Solr Classification - Important Notes
● Sci-Fi StackExchange dataset
● Roughly 18.000 questions and answers
● Roughly 6.000 tagged
● 70 % Training Set + 30% test set
Solr Classification - Demo
● Index the training set documents
(this is our ground truth)
● Index the test set
(classification will happen automatically at indexing time)
● Evaluate the test set
(a simple java app to verify that the automatically assigned classes
are consistent with what expected)
Solr Classification - Demo
● True Positive : Predicted class == actual class
● False Positive : Predicted class != actual class
● True Negative : Not predicted class != actual class
● False Negative : Not predicted class == actual class
Precision = TP / TP+FP
Recall = TP / TP+FN
Solr Classification - System Evaluation Metrics
● Index the training set documents
(this is our ground truth)
● Index the test set
(classification will happen automatically at indexing time)
● Evaluate the test set
(a simple java app to verify that the automatically assigned classes
are consistent with what expected)
Solr Classification - Demo
MaxOutputClasses 1
[System Global Accuracy]0.5095676824946846
[System Globel Recall]0.2686846038863976
TP{star-wars}59
FP{star-wars}75
FN{star-wars}7
[Precision (of predicted)]{star-wars}0.44029850746268656
[Recall for class)]{star-wars}0.8939393939393939
TP{harry-potter}147
FP{harry-potter}137
FN{harry-potter}3
[Precision (of predicted)]{harry-potter}0.5176056338028169
[Recall for class]{harry-potter}0.98
Solr Classification - Demo - Full Dataset
MaxOutputClasses 5
[System Global Accuracy]0.20481927710843373
[System Globel Recall]0.5399850523168909
TP{star-wars}66
FP{star-wars}400
FN{star-wars}0
[Precision (of predicted)]{star-wars}0.14163090128755365
[Recall for class)]{star-wars}1.0
TP{harry-potter}150
FP{harry-potter}584
FN{harry-potter}0
[Precision (of predicted)]{harry-potter}0.20435967302452315
[Recall for class]{harry-potter}1.0
Solr Classification - Demo - Full Dataset
MaxOutputClasses 1
[System Global Accuracy]0.9907407407407407
[System Globel Recall]0.6750788643533123
TP{star-wars}64
FP{star-wars}0
FN{star-wars}2
[Precision (of predicted)]{star-wars}1.0
[Recall for class)]{star-wars}0.9696969696969697
TP{harry-potter}150
FP{harry-potter}2
FN{harry-potter}0
[Precision (of predicted)]{harry-potter}0.9868421052631579
[Recall for class]{harry-potter}1.0
Solr Classification - Demo - Partial Dataset
MaxOutputClasses 5
[System Global Accuracy]0.24259259259259258
[System Globel Recall]0.8264984227129337
TP{star-wars}66
FP{star-wars}52
FN{star-wars}0
[Precision (of predicted)]{star-wars}0.559322033898305
[Recall for class)]{star-wars}1.0
TP{harry-potter}150
FP{harry-potter}48
FN{harry-potter}0
[Precision (of predicted)]{harry-potter}0.7575757575757576
[Recall for class]{harry-potter}1.0
Solr Classification - Demo - Partial Dataset
Multi classes support
● Class field may be multi valued
● Assign multiple classes
● Not only the top scoring but top N (parameter)
Split human/auto assigned classes
● classTrainingField
● classOutputField
Default : use the same field
Solr Classification - Extensions SOLR-8871
Classification Context Filtering
● Reduce the document space to consider ->
reduce the training set
● Useful when only a subset of the index may be interesting for
classification
● Consider only the human labelled documents as training data
Solr Classification - Extensions SOLR-8871
Individual Field Weighting
● When classifying, each field has a different importance
e.g.
title vs content
● Set a different boost per field
● Knn compatible
● Bayes compatible
Solr Classification - Extensions SOLR-8871
● Numeric Field Support (Knn)
(Euclidean distance based)
● Lat lon support (Knn)
(geo distance based)
● SolrCloud support
(use the entire sharded index as training set)
Solr Classification - Future Work
Questions ?
● Special thanks to Tommaso Teofili,
Apache committer who followed the developments and made possible the
contributions.
● And to the
Audience :)
Ad

More Related Content

What's hot (20)

Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
kloia
 
Odoo Online platform: architecture and challenges
Odoo Online platform: architecture and challengesOdoo Online platform: architecture and challenges
Odoo Online platform: architecture and challenges
Odoo
 
Tools for Solving Performance Issues
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance Issues
Odoo
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
StreamNative
 
Real Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and KafkaReal Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and Kafka
Daria Litvinov
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
Alexey Grigorev
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Kevin Weil
 
Low Code Integration with Apache Camel.pdf
Low Code Integration with Apache Camel.pdfLow Code Integration with Apache Camel.pdf
Low Code Integration with Apache Camel.pdf
Claus Ibsen
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
confluent
 
REST-API overview / concepts
REST-API overview / conceptsREST-API overview / concepts
REST-API overview / concepts
Patrick Savalle
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Sease
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
CQRS and Event Sourcing
CQRS and Event Sourcing CQRS and Event Sourcing
CQRS and Event Sourcing
Inho Kang
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
동현 강
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
kloia
 
Odoo Online platform: architecture and challenges
Odoo Online platform: architecture and challengesOdoo Online platform: architecture and challenges
Odoo Online platform: architecture and challenges
Odoo
 
Tools for Solving Performance Issues
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance Issues
Odoo
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
Araf Karsh Hamid
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
StreamNative
 
Real Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and KafkaReal Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and Kafka
Daria Litvinov
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Kevin Weil
 
Low Code Integration with Apache Camel.pdf
Low Code Integration with Apache Camel.pdfLow Code Integration with Apache Camel.pdf
Low Code Integration with Apache Camel.pdf
Claus Ibsen
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
confluent
 
REST-API overview / concepts
REST-API overview / conceptsREST-API overview / concepts
REST-API overview / concepts
Patrick Savalle
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Sease
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
CQRS and Event Sourcing
CQRS and Event Sourcing CQRS and Event Sourcing
CQRS and Event Sourcing
Inho Kang
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
동현 강
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 

Similar to Apache Lucene/Solr Document Classification (20)

PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
Satoshi Nagayasu
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
OpenSource Connections
 
advancedR.pdf
advancedR.pdfadvancedR.pdf
advancedR.pdf
RukhsanaTaj2
 
Advanced R cheat sheet
Advanced R cheat sheetAdvanced R cheat sheet
Advanced R cheat sheet
Dr. Volkan OBAN
 
Basic R Learning
Basic R LearningBasic R Learning
Basic R Learning
Kumar P
 
Advanced r
Advanced rAdvanced r
Advanced r
Dieudonne Nahigombeye
 
The ELK Stack - Launch and Learn presentation
The ELK Stack - Launch and Learn presentationThe ELK Stack - Launch and Learn presentation
The ELK Stack - Launch and Learn presentation
saivjadhav2003
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
Karwin Software Solutions LLC
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
cadejaumafiq
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
Kais Hassan, PhD
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Apache solr
Apache solrApache solr
Apache solr
Dipen Rangwani
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
Lucidworks
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
Android Database
Android DatabaseAndroid Database
Android Database
Dr Karthikeyan Periasamy
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
Ismail Mayat
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
Satoshi Nagayasu
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
OpenSource Connections
 
Basic R Learning
Basic R LearningBasic R Learning
Basic R Learning
Kumar P
 
The ELK Stack - Launch and Learn presentation
The ELK Stack - Launch and Learn presentationThe ELK Stack - Launch and Learn presentation
The ELK Stack - Launch and Learn presentation
saivjadhav2003
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
cadejaumafiq
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
Kais Hassan, PhD
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
Lucidworks
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
Ismail Mayat
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 
Ad

More from Sease (20)

Hybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank FusionHybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Sease
 
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache SolrBlazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Sease
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Hybrid Search With Apache Solr
Hybrid Search With Apache SolrHybrid Search With Apache Solr
Hybrid Search With Apache Solr
Sease
 
Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
Sease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
Sease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
Sease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
Sease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
Sease
 
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank FusionHybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Sease
 
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache SolrBlazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Sease
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Hybrid Search With Apache Solr
Hybrid Search With Apache SolrHybrid Search With Apache Solr
Hybrid Search With Apache Solr
Sease
 
Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
Sease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
Sease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
Sease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
Sease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
Sease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
Sease
 
Ad

Recently uploaded (20)

Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdfGoogle DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
derrickjswork
 
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT StrategyRisk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
john823664
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Alan Dix
 
AI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological ImpactAI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological Impact
SaikatBasu37
 
Dark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanizationDark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanization
Jakub Šimek
 
Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.
marketing943205
 
Best 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat PlatformsBest 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat Platforms
Soulmaite
 
Breaking it Down: Microservices Architecture for PHP Developers
Breaking it Down: Microservices Architecture for PHP DevelopersBreaking it Down: Microservices Architecture for PHP Developers
Breaking it Down: Microservices Architecture for PHP Developers
pmeth1
 
How Top Companies Benefit from Outsourcing
How Top Companies Benefit from OutsourcingHow Top Companies Benefit from Outsourcing
How Top Companies Benefit from Outsourcing
Nascenture
 
AI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptxAI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptx
Shikha Srivastava
 
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
ICT Frame Magazine Pvt. Ltd.
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptxIn-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
aptyai
 
Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Middle East and Africa Cybersecurity Market Trends and Growth Analysis Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Preeti Jha
 
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
UXPA Boston
 
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdfICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
Eryk Budi Pratama
 
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Vasileios Komianos
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
UXPA Boston
 
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdfGoogle DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
derrickjswork
 
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT StrategyRisk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
john823664
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Alan Dix
 
AI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological ImpactAI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological Impact
SaikatBasu37
 
Dark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanizationDark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanization
Jakub Šimek
 
Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.Is Your QA Team Still Working in Silos? Here's What to Do.
Is Your QA Team Still Working in Silos? Here's What to Do.
marketing943205
 
Best 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat PlatformsBest 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat Platforms
Soulmaite
 
Breaking it Down: Microservices Architecture for PHP Developers
Breaking it Down: Microservices Architecture for PHP DevelopersBreaking it Down: Microservices Architecture for PHP Developers
Breaking it Down: Microservices Architecture for PHP Developers
pmeth1
 
How Top Companies Benefit from Outsourcing
How Top Companies Benefit from OutsourcingHow Top Companies Benefit from Outsourcing
How Top Companies Benefit from Outsourcing
Nascenture
 
AI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptxAI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptx
Shikha Srivastava
 
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
ICT Frame Magazine Pvt. Ltd.
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptxIn-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
aptyai
 
Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Middle East and Africa Cybersecurity Market Trends and Growth Analysis Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Middle East and Africa Cybersecurity Market Trends and Growth Analysis
Preeti Jha
 
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
UXPA Boston
 
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdfICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
Eryk Budi Pratama
 
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Vasileios Komianos
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
UXPA Boston
 

Apache Lucene/Solr Document Classification

  • 1. Lucene And Solr Document Classification Alessandro Benedetti, Software Engineer, Sease Ltd.
  • 2. Alessandro Benedetti ● Search Consultant ● R&D Software Engineer ● Master in Computer Science ● Apache Lucene/Solr Enthusiast ● Semantic, NLP, Machine Learning Technologies passionate ● Beach Volleyball Player & Snowboarder Who I am
  • 3. ● Classification ● Lucene Approach ● Solr Integration ● Demo ● Extensions ● Future Work Agenda
  • 4. “Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. “ Wikipedia Classification
  • 5. ● E-mail spam filter ● Document categorization ● Sexually explicit content detection ● Medical diagnosis ● E-commerce ● Language identification Real World Use Cases
  • 6. ● Supervised learning ● Labelled training samples ● Documents modelled as feature vectors ● Term occurrences as features ● Model predicts unseen documents label Basics Of Text Classification
  • 7. Apache Lucene Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download.
  • 8. ● Lucene index has complex data structures ● Lot of organizations have already indexes in place ● Pre existent data can be used to classify ● No need to train a model from a separate training set ● From training set to Inverted index Apache Lucene For Classification
  • 9. ● Advanced configurable text analysis ● Term frequencies ● Term positions ● Document frequencies ● Norms ● Part of speech tags and custom payload Apache Lucene For Classification
  • 10. ● Given an index with labelled documents ● Each document has a class field ● Given an unknown document in input ● Given a set of relevant fields ● Search the top K most similar documents ● Fetch the classes from the retrieved documents ● Return most occurring class(es) ● Class ranking in retrieved documents is important ! K Nearest Neighbours
  • 11. ● KNN uses Lucene More Like This ● Lucene query component ● Extract interesting terms* from the input document fields ● Build a Lucene query ● Run the query against the search index ● Resulting documents are “the similar documents” * an interesting term is a term : - occurring frequently in the seed document (high term frequency) - but quite rare in the corpus (high inverted document frequency) More Like This
  • 12. Assumptions ● Term occurrences are probabilistic independent features ● Terms positions are irrelevant ( bag of words ) Calculate the probability score of each available class C ● Prior ( #DocsInClassC / #Docs ) ● Likelihood ( P(d|c) = P(t1, t2,..., tn|c) == P(t1|c) * P(t2|c) * … * P(tn|c)) Where given term t P(t|c) = TF(t) in documents of class c +1 / #terms in all documents of class c + #docs of class c Assign top scoring class Naive Bayes Classifier
  • 13. ● Documents are the Lucene unit of information ● Documents are a map field -> value ● Each field may be analysed differently (different tokenization and token filtering) ● Each field may have a different weight for the classification (affecting differently the similarity score) Document Classification
  • 14. Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Apache Solr
  • 15. Index Time Integration - SOLR-7739 ● Ingest the document ● Assign the class ● Set the class as a field value ● Index the document Request Handler Integration (TO DO) - SOLR-7738 Return an assigned class : ● Given a text and a field ● Given an input document ● Given an indexed document id Solr Integration
  • 16. ● Pipeline of processors ● Each single document flows through the chain ● Each processor is executed once ● Last processor triggers the update command Update Request Processor Chain
  • 17. ● Update Component ● Configurable Singleton Factory ● Single instance per request thread ● Process a single Document ● SolrCloud compatible* * Pre processor / Post processor Update Request Processor
  • 18. ● Access the Index Reader ● A Lucene Document Classifier is instantiated ● A class is assigned by the classifier ● A new field is added to the original Document, with the class ● The document goes through the next processing steps Classification Update Request Processor
  • 19. ... <initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse"> <lst name="defaults"> <str name="df">text</str> <str name="update.chain">classification</str> </lst> </initParams> ... Solrconfig.xml - Update Handler
  • 20. ... <updateRequestProcessorChain name="classification"> <processor class="solr.ClassificationUpdateProcessorFactory"> ... </processor> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain> ... Solrconfig.xml - Chain configuration
  • 21. <processor class="solr.ClassificationUpdateProcessorFactory"> <str name="inputFields">title^1.5,content,author</str> <str name="classField">cat</str> <str name="algorithm">knn</str> <str name="knn.k">20</str> <str name="knn.minTf">1</str> <str name="knn.minDf">5</str> </processor> N.B. classField must be stored Solrconfig.xml - K nearest neighbour classifier config
  • 22. <processor class="solr.ClassificationUpdateProcessorFactory"> <str name="inputFields">title^1.5,content,author</str> <str name="classField">cat</str> <str name="algorithm">bayes</str> </processor> N.B. classField must be Indexed (take care of analysis) Solrconfig.xml - Naive Bayes classifier config
  • 23. ● Lucene >= 6.0 ● Solr >= 6.1 ● Classification needs a training set -> An index with initially human assigned classes is required Solr Classification - Important Notes
  • 24. ● Sci-Fi StackExchange dataset ● Roughly 18.000 questions and answers ● Roughly 6.000 tagged ● 70 % Training Set + 30% test set Solr Classification - Demo
  • 25. ● Index the training set documents (this is our ground truth) ● Index the test set (classification will happen automatically at indexing time) ● Evaluate the test set (a simple java app to verify that the automatically assigned classes are consistent with what expected) Solr Classification - Demo
  • 26. ● True Positive : Predicted class == actual class ● False Positive : Predicted class != actual class ● True Negative : Not predicted class != actual class ● False Negative : Not predicted class == actual class Precision = TP / TP+FP Recall = TP / TP+FN Solr Classification - System Evaluation Metrics
  • 27. ● Index the training set documents (this is our ground truth) ● Index the test set (classification will happen automatically at indexing time) ● Evaluate the test set (a simple java app to verify that the automatically assigned classes are consistent with what expected) Solr Classification - Demo
  • 28. MaxOutputClasses 1 [System Global Accuracy]0.5095676824946846 [System Globel Recall]0.2686846038863976 TP{star-wars}59 FP{star-wars}75 FN{star-wars}7 [Precision (of predicted)]{star-wars}0.44029850746268656 [Recall for class)]{star-wars}0.8939393939393939 TP{harry-potter}147 FP{harry-potter}137 FN{harry-potter}3 [Precision (of predicted)]{harry-potter}0.5176056338028169 [Recall for class]{harry-potter}0.98 Solr Classification - Demo - Full Dataset
  • 29. MaxOutputClasses 5 [System Global Accuracy]0.20481927710843373 [System Globel Recall]0.5399850523168909 TP{star-wars}66 FP{star-wars}400 FN{star-wars}0 [Precision (of predicted)]{star-wars}0.14163090128755365 [Recall for class)]{star-wars}1.0 TP{harry-potter}150 FP{harry-potter}584 FN{harry-potter}0 [Precision (of predicted)]{harry-potter}0.20435967302452315 [Recall for class]{harry-potter}1.0 Solr Classification - Demo - Full Dataset
  • 30. MaxOutputClasses 1 [System Global Accuracy]0.9907407407407407 [System Globel Recall]0.6750788643533123 TP{star-wars}64 FP{star-wars}0 FN{star-wars}2 [Precision (of predicted)]{star-wars}1.0 [Recall for class)]{star-wars}0.9696969696969697 TP{harry-potter}150 FP{harry-potter}2 FN{harry-potter}0 [Precision (of predicted)]{harry-potter}0.9868421052631579 [Recall for class]{harry-potter}1.0 Solr Classification - Demo - Partial Dataset
  • 31. MaxOutputClasses 5 [System Global Accuracy]0.24259259259259258 [System Globel Recall]0.8264984227129337 TP{star-wars}66 FP{star-wars}52 FN{star-wars}0 [Precision (of predicted)]{star-wars}0.559322033898305 [Recall for class)]{star-wars}1.0 TP{harry-potter}150 FP{harry-potter}48 FN{harry-potter}0 [Precision (of predicted)]{harry-potter}0.7575757575757576 [Recall for class]{harry-potter}1.0 Solr Classification - Demo - Partial Dataset
  • 32. Multi classes support ● Class field may be multi valued ● Assign multiple classes ● Not only the top scoring but top N (parameter) Split human/auto assigned classes ● classTrainingField ● classOutputField Default : use the same field Solr Classification - Extensions SOLR-8871
  • 33. Classification Context Filtering ● Reduce the document space to consider -> reduce the training set ● Useful when only a subset of the index may be interesting for classification ● Consider only the human labelled documents as training data Solr Classification - Extensions SOLR-8871
  • 34. Individual Field Weighting ● When classifying, each field has a different importance e.g. title vs content ● Set a different boost per field ● Knn compatible ● Bayes compatible Solr Classification - Extensions SOLR-8871
  • 35. ● Numeric Field Support (Knn) (Euclidean distance based) ● Lat lon support (Knn) (geo distance based) ● SolrCloud support (use the entire sharded index as training set) Solr Classification - Future Work
  • 37. ● Special thanks to Tommaso Teofili, Apache committer who followed the developments and made possible the contributions. ● And to the Audience :)
  翻译: