Skip to content

Vision Extraction Concepts

Understanding how the vision extraction system analyzes property photos, the technology behind it, and the design decisions that affect performance and accuracy.


The system uses computer vision models to analyze property photos in three stages:

When a property photo is processed:

  1. Image is fetched from its URL (or loaded from storage)
  2. Sent to Ollama vision model (Qwen2-VL by default)
  3. Model generates a natural language description
  4. Description is stored in the database
  5. Description is converted to a searchable embedding

Example Output:

"Modern kitchen featuring granite countertops,
stainless steel appliances, and pendant lighting
over a large island with bar seating"

The model identifies the room type shown in each photo:

  • Kitchen
  • Living Room
  • Bedroom
  • Bathroom
  • Exterior
  • Dining Room
  • Office
  • Garage

This classification enables filtering searches by room type and organizing property photos by category.

Specific features are extracted and categorized:

Materials: granite countertops, hardwood floors, tile backsplash Views: mountain views, city skyline, waterfront Styles: modern, traditional, craftsman Appliances: stainless steel, gas range, double oven

These tags become searchable attributes that power queries like “homes with mountain views and hardwood floors.”


Vision extraction powers two types of semantic search:

Description-Based Search

Text embeddings of generated descriptions

Enables queries like: “open concept kitchen with island”

Visual Similarity Search

Image embeddings of photos themselves

Enables queries like: “find similar properties” based on photo appearance


The system uses Ollama instead of cloud vision APIs because:

  1. Privacy — Property photos never leave your infrastructure
  2. Cost — No per-image API charges
  3. Control — Choose models and configure resources
  4. Speed — Local processing without network latency

Qwen2-VL is the default model because it:

  • Generates detailed, accurate descriptions
  • Performs well on interior/exterior property photos
  • Balances quality and processing speed
  • Supports both CPU and GPU acceleration

You can substitute other Ollama vision models if needed.

Vision extraction runs as batch jobs rather than real-time processing because:

  1. Resource Intensity — Processing thousands of images would overload the server during normal operations
  2. Control — Admins can schedule jobs during off-peak hours
  3. Monitoring — Progress tracking and error handling for large batches
  4. Flexibility — Choose to process all properties or specific subsets

Description Embeddings vs. Visual Embeddings

Section titled “Description Embeddings vs. Visual Embeddings”
AspectDescription EmbeddingsVisual Embeddings
Processing SpeedFast (text-based)Slower (image-based)
Search TypeText queries onlyImage similarity + text
Storage SizeSmallerLarger
Use Case”kitchen with island""find similar homes”

Recommendation: Start with description embeddings. Add visual embeddings if users need image similarity features.


Typical processing rates (varies by hardware):

HardwareImages/MinuteProperties/Hour
CPU Only10-1530-40
GPU (RTX 3090)60-80200-250

Each property typically has 10-30 photos.

CPU

High utilization during processing

Scales with concurrency setting

Memory

Model loaded into RAM/VRAM

Qwen2-VL: ~4-8GB depending on quantization

Network

Minimal (fetching images from URLs)

Mostly local processing

Storage

Text descriptions: ~200 bytes/image

Embeddings: ~3KB/image (768-dim float32)

Concurrency affects both:

  • Low concurrency (1-2) — Slower but more stable, less resource contention
  • Medium concurrency (4-8) — Balanced throughput and stability
  • High concurrency (16+) — Faster but may overwhelm server resources

Recommendation: Start with 4, increase if server has capacity.


Without vision extraction, users can only search by:

  • Address
  • Price
  • Bedrooms/bathrooms
  • Square footage

With vision extraction, users can search by:

  • “home with mountain views”
  • “modern kitchen with granite countertops”
  • “hardwood floors throughout”
  • “properties similar to this one” (visual similarity)

This transforms the search experience from structured data lookup to natural language understanding.


  • Identify specific brands — “Viking appliances” may be tagged as “stainless steel appliances”
  • Measure dimensions — Can’t determine exact square footage from photos
  • Detect quality — Can describe “granite countertops” but not assess material grade
  • Read text reliably — May not extract address numbers or signage

Models occasionally:

  • Misclassify room types (home office tagged as bedroom)
  • Over-generalize features (all wood labeled “hardwood floors”)
  • Miss subtle details (overlook small appliances)

Mitigation: Review extraction results periodically, especially for high-value properties.


Potential improvements to the vision system:

  1. Multi-Model Ensemble — Combine multiple vision models for better accuracy
  2. Active Learning — Allow admins to correct tags and retrain
  3. Real-Time Processing — Analyze new listings immediately upon sync
  4. Custom Feature Detection — Train models to recognize region-specific features