Description-Based Search
Text embeddings of generated descriptions
Enables queries like: “open concept kitchen with island”
Understanding how the vision extraction system analyzes property photos, the technology behind it, and the design decisions that affect performance and accuracy.
The system uses computer vision models to analyze property photos in three stages:
When a property photo is processed:
Example Output:
"Modern kitchen featuring granite countertops,stainless steel appliances, and pendant lightingover a large island with bar seating"The model identifies the room type shown in each photo:
This classification enables filtering searches by room type and organizing property photos by category.
Specific features are extracted and categorized:
Materials: granite countertops, hardwood floors, tile backsplash Views: mountain views, city skyline, waterfront Styles: modern, traditional, craftsman Appliances: stainless steel, gas range, double oven
These tags become searchable attributes that power queries like “homes with mountain views and hardwood floors.”
Vision extraction powers two types of semantic search:
Description-Based Search
Text embeddings of generated descriptions
Enables queries like: “open concept kitchen with island”
Visual Similarity Search
Image embeddings of photos themselves
Enables queries like: “find similar properties” based on photo appearance
The system uses Ollama instead of cloud vision APIs because:
Qwen2-VL is the default model because it:
You can substitute other Ollama vision models if needed.
Vision extraction runs as batch jobs rather than real-time processing because:
| Aspect | Description Embeddings | Visual Embeddings |
|---|---|---|
| Processing Speed | Fast (text-based) | Slower (image-based) |
| Search Type | Text queries only | Image similarity + text |
| Storage Size | Smaller | Larger |
| Use Case | ”kitchen with island" | "find similar homes” |
Recommendation: Start with description embeddings. Add visual embeddings if users need image similarity features.
Typical processing rates (varies by hardware):
| Hardware | Images/Minute | Properties/Hour |
|---|---|---|
| CPU Only | 10-15 | 30-40 |
| GPU (RTX 3090) | 60-80 | 200-250 |
Each property typically has 10-30 photos.
CPU
High utilization during processing
Scales with concurrency setting
Memory
Model loaded into RAM/VRAM
Qwen2-VL: ~4-8GB depending on quantization
Network
Minimal (fetching images from URLs)
Mostly local processing
Storage
Text descriptions: ~200 bytes/image
Embeddings: ~3KB/image (768-dim float32)
Concurrency affects both:
Recommendation: Start with 4, increase if server has capacity.
Without vision extraction, users can only search by:
With vision extraction, users can search by:
This transforms the search experience from structured data lookup to natural language understanding.
Models occasionally:
Mitigation: Review extraction results periodically, especially for high-value properties.
Potential improvements to the vision system: