Developing a neurocomputational model of eye movements during Visual search
The need for a computationally explicit model of eye movements in a search task is an extremely important step toward understanding the Visual routines and base representations underlying search behavior. Ongoing work in our lab is attempting to extend a well-established saliency map conception of search (meaning that items in a search display are processed in proportion to their similarity to the target) to include real-world objects and eye movement behavior. Such extensions to real-world search are not trivial and require an interdisciplinary effort to be successful. The computer vision community has a great deal of experience in representing real-world objects, but far less experience in the behavioral techniques needed to test these representational schemes. The cognitive psychological community has elaborate methods for describing complex behavior, but far less experience in the formal representation of real-world objects. As a result of these mutual limitations, no computationally explicit theory of eye movements during real-world search had been validated by behavioral data, and no behaviorally explicit theory of oculomotor search had been implemented as a computational model. In a collaboration with Rajesh Rao, Mary Hayhoe, and Dana Ballard, we developed a computational model of Visual search to explain the pattern of oculomotor behavior reported in Zelinsky et al. (1997). By combining image processing techniques from computer vision with biological constraints identified in the computational neuroscience community, this interdisciplinary model represents arbitrarily complex Visual patterns as high-dimensional vectors of feature properties (i.e., colors, orientations, spatial scales, etc.). A simple Visual routine consisting of the sequential coarse-to-fine application of spatial filters then causes simulated gaze to move toward the target. We tested this model by collecting eye movement data from human observers searching for real-world targets, then inputting these same scenes to the model and comparing the simulated sequence of saccades and fixations to the human behavioral data. The results revealed a qualitative similarity between the Zelinsky et al. (1997) pattern of results and the simulated gaze patterns generated by the model (Rao et al., 1996, 2002). More recent work conducted at SBU has modified and extended this model in several key respects. First, the base representation used by Rao et al. (2002) assumed a uniform clarity to the scene being viewed regardless of where gaze was positioned in the image. Humans, however, have a fovea that limits high Visual acuity to only the region of the image that we are looking at directly with our eyes. In order to bring the model and human representational constraints into closer agreement, we created for the model a simplified simulated retina. The information available from each fixation is therefore acuity constrained much like human vision, requiring the model now to move its simulated fovea over the scene to acquire new information as it searches for a target. We also abandon the Visual routine used in the Rao et al. (2002) model in favor of a more dynamic method of driving gaze to the search target. As in the earlier model, this approach also uses filter-based image processing techniques to represent real-world targets and search displays, then compares these target and display representations to derive a salience map indicating likely target candidates. However, rather than applying a hard-wired coarse-to-fine filtering scheme, the target of a simulated saccade is now determined by the spatial average of activity on this map, with this average changing over time as a moving threshold removes those salience map points offering the least evidence for the target. As a result of this threshold pruning points from the salience map, a sequence of eye movements is produced that eventually aligns simulated gaze with the model's best guess as to the target's location. We are currently testing this routine by comparing the simulated oculomotor scanpaths to the scanpaths of human observers viewing the same displays and searching for the same targets. Preliminary findings reveal considerable spatio-temporal agreement between these gaze patterns, both at an aggregate level (e.g., general tradeoffs between saccade latency and accuracy) as well as in the behavior of individual observers (Zelinsky, 1999a, 2000a, 2000b, 2002, 2003a, 2003b)
Research Philosophy
Each time we engage in a moderately complex task, we likely enlist the help of an untold number of simpler visuo-motor operations that exist largely outside of our conscious awareness. Consider for instance the steps involved in preparing a cup of coffee. For the sake of simplicity, assume that the coffee has already been brewed and is waiting in the pot, and that all of the essential accessories, an empty cup, a spoon, a carton of
cream, and a tin of sugar, are sitting on a countertop in front of you. What is your first step toward accomplishing this goal? The very first thing that you might do is to move your eyes to the handle of the coffee pot, followed shortly thereafter by the much slower movement of your preferred hand to the same target. Because the coffee pot is hot and the handle is relatively small, this change in fixation is needed to guide your hand to a safe and useful place in which to grasp the object. After lifting the pot, your eye may then dart over to the cup. This action is needed, not only to again guide the pot to a very specific point in space directly over the cup, but also to provide feedback to the pouring operation so as to avoid a spill. After sitting the pot back on the counter (an act that may or may not require another eye movement), your gaze will likely shift to the spoon. Lagging shortly behind this behavior may be simultaneous movements of your hands, with your dominant hand moving toward the sugar tin and your non-preferred hand moving to the spoon. The spoon is a relatively small and slender object that again requires assistance from foveal vision for grasping; the tin is a rather bulky and indelicate object that does not require precise Visual information to inform the grasping operation. Once the spoon is in hand and the lid to the tin is lifted, gaze can then be directed to the tin in order to help scoop out the correct measure of sugar. To ensure that the spoon is kept level, a tracking operation may be used to keep your gaze on the loaded spoon as it moves slowly to the cup. After receiving the sugar, and following a few quick turns of the spoon, your coffee would finally be ready to drink (see Land et al., 1998, for a similarly framed example).
eye movements and visual cognition