Situated Language and Active Vision

From MindModelingWiki

Jump to: navigation, search

[edit] Summary

In a general sense, modeling efforts in this research thread attempt to answer the question, "How do words relate to things in the world?" When people have conversations in particular situations they frequently use words like "that one" or "the leftmost". During their conversations, people become increasingly dependent on a shared understanding of the context. For sayings such as "Describe the one closest to me..." to have any meaning, people involved in a conversation need to understand how specific words map to abstract notions of proximity, perspectives, and space. To develop shared awareness of situations, people typically situate their language; they tie words to objects in the context. When they hear such contextually dependent situated phrases as "closest to me", people actively use knowledge and reasoning to recognize the thing related to the words.

When BOINC volunteers run work units from this project, the ACT-R model they download and run attends to phrases describing objects in visual contexts and then makes decisions about which objects are being referred to. The model uses visual attention, knowledge about spatial relations, and simple reasoning to identify referents.

Ultimately, the modeling efforts in this research thread explore how people use spatial language and spatial reasoning skills to communicate and share understanding of situations.

[edit] Model Details

The ACT-R model of referent disambiguation consists of two relatively independent process threads. One of the threads is dedicated to the processing of linguistic input. It uses the immediate recovery and integration of information to construct and modify a representation of the referring expression being communicated via the linguistic input. Another thread is dedicated to the visual exploration of the task context. This thread uses sets of productions implementing five active vision processes.

Active vision plays a central role in the model. Active vision is based on five composable visual operations:

Figure 1:  Visualization of the interaction between the major threads in the ACT-R model of reference disambiguation.  Three blue arrows are used to highlight the reliance of the INDEX, FILTER, and RELATE visual operations to the current representation of the expression.
Figure 1: Visualization of the interaction between the major threads in the ACT-R model of reference disambiguation. Three blue arrows are used to highlight the reliance of the INDEX, FILTER, and RELATE visual operations to the current representation of the expression.
  1. INDEX: a process that enables visual features in the visual context to interleave with the vision system’s pre-attentive filtering mechanisms. For example, an INDEX process might bias the visual system to attend to objects of a specific color. INDEX provides task-relevant candidate SHIFT destinations. When INDEX is used to determine viable SHIFT destinations, scan strategies based on combinations of bottom-up and top-down influences result.
  2. SHIFT: a primitive operation during which the current location of visual attention is moved. Shifts are frequently directed to “indexible” locations and are correlated with eye movements.
  3. MARK: an indexical reference construction operation. To maintain an understanding of a situation exceeding the spatial scope of the process focus, components of the visual context outside the focus must somehow be represented and related to focus. The marking process is responsible for this type of reference constructions. MARK is involved in scene integration and therefore is employed in most recognition situations.
  4. FILTER: an operation that narrows context by removing represented objects from the spatial representation.
  5. RELATE: an operation that determines which (if any) objects represented in the spatial representation meet relational constraints.

Many of the productions enabling these primitive visual operations employ ACT-R’s p* syntax and can therefore be thought of as context independent rules that produce context dependent macro operations through a dynamic re-configuration of declarative knowledge. When the Language Processing and Visual/Spatial Processing threads initially interact to successfully perform the modeled task, competition for access to the retrieval and imaginal buffers leads to a rate limitation. With time, production composition leads to rules that ossify the interleaving of the rules from the two threads. The produced productions are created in "safe" success contexts and they actually lead to more robust and more efficient processing. Initially, the model is unwilling to always act on the basis of expectations—sometimes the model waits until it hears confirmation of its expectations before acting to identify the referent. Changes to production utilities and the production of new more efficient rules—rules that side-step confirmation—transition the model away from cautious/slow behavior. Blocked composition and skill acquisition lead to the behavioral change.

Data-driven and knowledge-driven visual information seeking processes interleave through composable visual routines. The relationships between on-going perceptual processes and current goals are explicit in the model. The model:

  1. Provides an explanation of learning in active vision.
  2. Illustrates how relationships between memory and goal support active vision.
  3. Provides an explanation of how active vision relates to both perceptual and motor actions.
  4. Describes how people use situated language and active vision to relate words to objects.
Personal tools