Using different models: Detect Objects and Label them
Object Detection, Classification, and Captioning
1. Using "Image classification", predict which class(es) (i.e. items) belong to it. Out is 'score and 'Label", 'Box',
2. Using 'Visual question answering (VQA)", ask a question, relevant to the image and get the answer
3. Using 'CLIP', get a higher score for a class label, which is more relevant
4. Using 'VisionEncoderDecoderModel ', get reasonable image captioning results
5. YOLO
[{'label': 'racer, race car, racing car', 'score': 0.4454664885997772}, {'label': 'grille, radiator grille', 'score': 0.11773999780416489}, {'label': 'beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon', 'score': 0.1036883071064949}, {'label': 'cab, hack, taxi, taxicab', 'score': 0.06767823547124863}, {'label': 'pickup, pickup truck', 'score': 0.04381086304783821}]
question = "how many person are there?"
Predicted answer: 1
Detected object: elephant with confidence level of 0.7756929993629456n
Reference links: