Vision Extraction Concepts

Understanding how the vision extraction system analyzes property photos, the technology behind it, and the design decisions that affect performance and accuracy.

How Vision Extraction Works

The system uses computer vision models to analyze property photos in three stages:

1. Image Description Generation

When a property photo is processed:

Image is fetched from its URL (or loaded from storage)
Sent to Ollama vision model (Qwen2-VL by default)
Model generates a natural language description
Description is stored in the database
Description is converted to a searchable embedding

Example Output:

"Modern kitchen featuring granite countertops,
stainless steel appliances, and pendant lighting
over a large island with bar seating"

2. Room Type Classification

The model identifies the room type shown in each photo:

Kitchen
Living Room
Bedroom
Bathroom
Exterior
Dining Room
Office
Garage

This classification enables filtering searches by room type and organizing property photos by category.

3. Feature Tag Extraction

Specific features are extracted and categorized:

Materials: granite countertops, hardwood floors, tile backsplash Views: mountain views, city skyline, waterfront Styles: modern, traditional, craftsman Appliances: stainless steel, gas range, double oven

These tags become searchable attributes that power queries like “homes with mountain views and hardwood floors.”

Semantic Search Enablement

Vision extraction powers two types of semantic search:

Description-Based Search

Text embeddings of generated descriptions

Enables queries like: “open concept kitchen with island”

Visual Similarity Search

Image embeddings of photos themselves

Enables queries like: “find similar properties” based on photo appearance

Architecture Decisions

Why Ollama?

The system uses Ollama instead of cloud vision APIs because:

Privacy — Property photos never leave your infrastructure
Cost — No per-image API charges
Control — Choose models and configure resources
Speed — Local processing without network latency

Why Qwen2-VL?

Qwen2-VL is the default model because it:

Generates detailed, accurate descriptions
Performs well on interior/exterior property photos
Balances quality and processing speed
Supports both CPU and GPU acceleration

You can substitute other Ollama vision models if needed.

Batch Processing vs. Real-Time

Vision extraction runs as batch jobs rather than real-time processing because:

Resource Intensity — Processing thousands of images would overload the server during normal operations
Control — Admins can schedule jobs during off-peak hours
Monitoring — Progress tracking and error handling for large batches
Flexibility — Choose to process all properties or specific subsets

Trade-Offs

Description Embeddings vs. Visual Embeddings

Aspect	Description Embeddings	Visual Embeddings
Processing Speed	Fast (text-based)	Slower (image-based)
Search Type	Text queries only	Image similarity + text
Storage Size	Smaller	Larger
Use Case	”kitchen with island"	"find similar homes”

Recommendation: Start with description embeddings. Add visual embeddings if users need image similarity features.

Performance Characteristics

Processing Speed

Typical processing rates (varies by hardware):

Hardware	Images/Minute	Properties/Hour
CPU Only	10-15	30-40
GPU (RTX 3090)	60-80	200-250

Each property typically has 10-30 photos.

Resource Usage

CPU

High utilization during processing

Scales with concurrency setting

Memory

Model loaded into RAM/VRAM

Qwen2-VL: ~4-8GB depending on quantization

Network

Minimal (fetching images from URLs)

Mostly local processing

Storage

Text descriptions: ~200 bytes/image

Embeddings: ~3KB/image (768-dim float32)

Accuracy vs. Speed

Concurrency affects both:

Low concurrency (1-2) — Slower but more stable, less resource contention
Medium concurrency (4-8) — Balanced throughput and stability
High concurrency (16+) — Faster but may overwhelm server resources

Recommendation: Start with 4, increase if server has capacity.

Why This Matters for Search

Without vision extraction, users can only search by:

Address
Price
Bedrooms/bathrooms
Square footage

With vision extraction, users can search by:

“home with mountain views”
“modern kitchen with granite countertops”
“hardwood floors throughout”
“properties similar to this one” (visual similarity)

This transforms the search experience from structured data lookup to natural language understanding.

Limitations

What Vision Models Can’t Do

Identify specific brands — “Viking appliances” may be tagged as “stainless steel appliances”
Measure dimensions — Can’t determine exact square footage from photos
Detect quality — Can describe “granite countertops” but not assess material grade
Read text reliably — May not extract address numbers or signage

False Positives

Models occasionally:

Misclassify room types (home office tagged as bedroom)
Over-generalize features (all wood labeled “hardwood floors”)
Miss subtle details (overlook small appliances)

Mitigation: Review extraction results periodically, especially for high-value properties.

Future Enhancements

Potential improvements to the vision system:

Multi-Model Ensemble — Combine multiple vision models for better accuracy
Active Learning — Allow admins to correct tags and retrain
Real-Time Processing — Analyze new listings immediately upon sync
Custom Feature Detection — Train models to recognize region-specific features

Extract Property Features — How to run extraction jobs
Vision Extraction Reference — Configuration and metrics