Multimodal Reference Resolution
- Context by object type: “show photo of the hotel”
- Deictic: “Find distance from here to here”, “this one”
- Positional context: Write “photo?” on hotel
- Visual context: “Photo of the [visible] hotel”
- Database queries: “show photo of the hotel in Menlo Park”
- Discourse: “No, the other one”
- User disambiguation through prompting: “Which hotel?”