TORCHBEARER:
A MULTI-PIPELINE APPROACH TO LANDMARK-BASED NAVIGATION
by
Fredric Muller Vollmer
A thesis submitted in partial fulfillment
of the requirements for the degree
of
Master of Science
in
Computer Science
MONTANA STATE UNIVERSITY
Bozeman, Montana
July 2018
c©COPYRIGHT
by
Fredric Muller Vollmer
2018
All Rights Reserved
ii
DEDICATION
This thesis has truly been one of the greatest challenges, if not the greatest
challenge, I have faced so far. But to say it was a result solely of my own labor would
be immensely far from the truth. The end result is due to the support, understanding
and love of so many people in my life, without whom I would never be writing this
today.
To my wife, Annie, who has stood by my side through thick and thin, who has
allowed me to devote so much of the time that belongs to us to this work. To her I
promise to make up the time, and then some.
To my parents, Jan and Dick, whose relentless help with whatever passion I
might be pursuing has led to opportunities I am incredibly fortunate to have had.
While I didn’t always realize it, their passion for science, ingenuity and worldly
understanding was always driving me towards this point. They are truly the only
role models, the only inspiration I will ever need.
To all of my family: Chris and Lori, Nana and Opa; Gwen, Jim, Carey and
Mark; Jo and Michael, Dorian, Gina and Carl. Thank you for being a part of my
life.
iii
ACKNOWLEDGEMENTS
First, a huge debt of gratitude is owed to my advisor and committee chair, Dr.
Mike Wittie. Without his guidance on technical issue and writing, this project could
not have been completed. I must also thank Dr. Laura Stanley, whose expertise in
human factors and driving research shaped the goals and evaluation methodologies
of this work.
The Torchbearer Mobile App, without which this project could never have been
put into the hands of drivers, was given a great deal of time and code by Brendan
Smith.
The dataset of Google Streetview images was meticulously annotated to include
bounding boxes and labels by Cole Homan.
iv
VITA
Fredric Muller Vollmer was born in Deming, Washington on August 25th, 1991,
to Jan and Henry Vollmer. He attended Mount Baker Senior High School in Deming,
Washington. In 2015, he received a Bachelor of Science degree in Economics with a
Statistics minor from Montana State University in Bozeman, Montana.
vTABLE OF CONTENTS
1. INTRODUCTION ........................................................................................1
2. BACKGROUND...........................................................................................5
Distraction and Cognitive Load In the Context of Driving...............................5
Sources of Distraction............................................................................5
Landmarks in Navigation ..............................................................................6
Landmark Saliency: What Makes A Good Landmark......................................7
Visual Saliency......................................................................................8
Semantic Saliency..................................................................................8
Structural Saliency .............................................................................. 10
Prior Art: Automated Landmark Detection.................................................. 11
Electronic Navigation Aids .......................................................................... 12
Google Maps....................................................................................... 13
Waze .................................................................................................. 13
3. ARCHITECTURE...................................................................................... 14
Architectural Overview ............................................................................... 14
Orchestration ............................................................................................. 15
Task Implementation .................................................................................. 17
Polling for Tasks ................................................................................. 18
Task Execution ................................................................................... 19
Submitting Results .............................................................................. 19
Worker Deployment and Operations............................................................. 20
Route Manager........................................................................................... 21
POST /route ...................................................................................... 21
GET /maneuverpoint/landmark........................................................... 23
User Interface............................................................................................. 23
Street-level Imagery .................................................................................... 26
Human Input ............................................................................................. 27
Getting Meaningful Answers ................................................................ 29
Worker Qualification .................................................................... 30
Sampling ..................................................................................... 31
Majority Verification .................................................................... 32
Pipelines .................................................................................................... 32
Pipelines at a High Level ..................................................................... 33
Saliency .............................................................................................. 36
The Human Approach .................................................................. 36
The Machine Approach ................................................................ 38
vi
TABLE OF CONTENTS – CONTINUED
Description ......................................................................................... 41
The Human Approach .................................................................. 41
The Machine Approach ................................................................ 42
Data-driven Approach .................................................................. 42
Object Detection Approach .......................................................... 44
Finding Landmarks in Saliency Maps ................................................... 47
Quantifying Landmark Uniqueness ....................................................... 55
Word2Vec.................................................................................... 56
Pipeline Specifics ........................................................................................ 58
Machine-Machine ................................................................................ 58
Human-Machine .................................................................................. 64
Machine-Human .................................................................................. 67
Human-Human ................................................................................... 71
4. RESULTS .................................................................................................. 74
Pipeline Comparison ................................................................................... 74
Marginal Cost ..................................................................................... 75
Execution Time................................................................................... 78
End-to-End Execution Time ......................................................... 78
Execution Time By Task ..................................................................... 80
Machine-Machine ......................................................................... 80
Machine-Human........................................................................... 80
Human-Machine........................................................................... 82
Human-Human ............................................................................ 82
Selected Landmark Overlap ................................................................. 83
Field Experiments ...................................................................................... 85
Experimental Design ........................................................................... 86
Peripheral Detection Task............................................................. 89
Gravitational Force Events ........................................................... 91
Surveys ....................................................................................... 92
Discussion........................................................................................... 98
Threats to Validity ..................................................................................... 98
5. CONCLUSION......................................................................................... 100
Future Work............................................................................................. 100
REFERENCES CITED.................................................................................. 103
vii
TABLE OF CONTENTS – CONTINUED
APPENDICES .............................................................................................. 109
APPENDIX A : Field Experiment Route and Landmarks ........................... 110
APPENDIX B : Human Subjects Consent Form......................................... 113
APPENDIX C : NASA-TLX Survey.......................................................... 116
APPENDIX D : Mechanical Turk Sample Qualification Exam..................... 118
viii
LIST OF TABLES
Table Page
4.1 Mean Intersection Over Union of Selected Landmark......................... 84
4.2 Counterbalanced Latin Squares Design ............................................. 90
4.3 Gravitational Force Event Thresholds (Naturalistic
Teenage Driving Study [56])............................................................. 92
4.4 Kruskal-Wallis analysis of variance by pipeline for
NASA-TLX survey.......................................................................... 95
4.5 Kruskal-Wallis analysis of variance by pipeline for
landmark survey ............................................................................. 96
A.1 Leg 1: Instructions and Landmarks By Pipeline .............................. 112
ix
LIST OF FIGURES
Figure Page
3.1 A high-level view of the Torchbearer system. .................................... 15
3.2 The Torchbearer mobile application for spoken nav-
igation instructions.......................................................................... 24
3.3 The general structure of a Torchbearer pipeline................................. 34
3.4 The positions of street-level images relative to a
maneuver point. .............................................................................. 35
3.5 Left: a maneuver point image. Right: a correspond-
ing saliency map generated by SalNet............................................... 41
3.6 Determining landmark position for data-driven de-
scription approach. We consider landmarks within
the 50-foot inner radius to have a position of “at”,
and those within the 100-foot outer radius to have
a position of “”after”. For example, landmark L in
this diagram would have a position of “after”.................................... 44
3.7 Left: a street-level image, with two stop signs
and a building as potentially salient landmarks.
Center: the corresponding saliency map, generated
by SalNet. Right: the saliency map overlaid atop
the street-level image....................................................................... 47
3.8 The result of applying Otsu Thresholding to the
saliency map. White areas (having a value of 255)
represent areas of saliency. ............................................................... 51
3.9 The saliency map after applying both Otsu Thresh-
olding and morphological opening. While difficult to
see at a small scale, several spots of white noise were
removed.......................................................................................... 51
3.10 The results of the morphological closing step; as
the particular saliency map does not have any non-
salient holes within a salient region the process had
no visible effect. .............................................................................. 52
xLIST OF FIGURES – CONTINUED
Figure Page
3.11 Dilation Mn: the parts of the image known to
be non-salient are in black (values of 0). Notice
that the salient (white) regions are slightly enlarged
compared to the results of the previous step. .................................... 52
3.12 Distance transformation D: the center points of the
salient regions are exactly white (255), as they are
the farthest from a non-salient (black) pixel. ..................................... 53
3.13 Threshold Ms, the white areas (values of 255)
represent the areas of the saliency map we have high
confidence are salient....................................................................... 53
3.14Mu, the result of subtracting the matrix of known
background areas from the matrix of known fore-
ground areas. the white areas (values of 255)
represent the unknown areas between salient and
non-salient (background) regions. ..................................................... 54
3.15Mlabeled, where dark blue is known non-salient back-
ground, purple is unknown, and yellow, green and
turquoise are each a specific known salient region. ............................. 54
3.16Mw, the result of the watershed algorithm. The grey
region is non-salient background, and each of the
colored regions is a distinct salient region. ........................................ 55
3.17 The final salient bounding boxes. ..................................................... 55
3.18 The pipeline structure of the Machine-Machine pipeline..................... 58
3.19 Left: a landmark saliency map, with bounding
boxes of salient regions. The intersection between
the relative bearing parallel and vertical middle is
within a salient region (shaded), and identifies the
landmark within the saliency matrix. Right: A
bird’s eye view of an intersection. Our street-level
images are a rectilinear projection of a spherical
image covering a 90 degree field of view. ........................................... 62
3.20 The pipeline structure of the Human-Machine pipeline. ..................... 64
xi
LIST OF FIGURES – CONTINUED
Figure Page
3.21 The pipeline structure of the Machine-Human pipeline. ..................... 67
3.22 The pipeline structure of the Human-Human pipeline........................ 71
4.1 Left: The Google Streetview image of the intersec-
tion of Mission and Cesar Chavez in San Francisco,
part of the SF test set. Right: A map view
of this intersection. The grey line is a polyline
representative of the selected route leading into the
intersection. To find the bearing value for the
Torchbearer maneuver point we calculate the angle
w.r.t. due north between the two points outlined in black. ................. 75
4.2 Marginal cost by pipeline................................................................. 76
4.3 End-to-end execution time by pipeline.............................................. 78
4.4 Execution time by task (Machine-Machine pipeline) .......................... 81
4.5 Execution time by task (Machine-Human pipeline)............................ 81
4.6 Execution time by task (Human-Machine pipeline)............................ 82
4.7 Execution time by task (Human-Human pipeline) ............................. 83
4.8 The intersection (right) and union (center) of a pair
of hypothetical bounding boxes (left). The black
area selection represents the area of the given metric. ........................ 84
4.9 The route driven by subjects through Bozeman,
Montana. Each color represents a different leg. Each
leg is navigated using a different pipeline. ......................................... 87
4.10 PDT response time by pipeline ........................................................ 91
4.11 PDT miss rate by pipeline ............................................................... 91
4.12 Gravitational force events by pipeline ............................................... 93
4.13 NASA-TLX scores by sub-scale ........................................................ 94
4.14 Landmark effectiveness survey scores................................................ 97
xii
LIST OF FIGURES – CONTINUED
Figure Page
A.1 The test route driven by subjects in Bozeman,
Montana. Subject drive each leg using a different
pipeline for navigation. .................................................................. 111
xiii
LIST OF ALGORITHMS
Algorithm Page
3.1 Creating a saliency map from human input....................................... 39
xiv
ABSTRACT
The task of navigation adds cognitive distraction to the already demanding
task of driving. Most popular navigation aids provide verbal directions based
solely on distances and street names, but the inclusion of landmark descriptions
in these instructions can improve navigation performance, decrease unsafe driving
behaviors and reduce cognitive load. Current approaches to selecting landmarks
and building landmark-based instructions rely on a single source of data, thereby
limiting the set of potential landmarks, or use a single factor in choosing the best
landmark, failing to account for all characteristics that make a landmark suitable
for navigation. We develop a multi-pipeline system that leverages both human
(crowd-sourced) input and machine-based approaches to find, describe and choose
the best landmark. Additionally, we develop a mobile application for the delivery of
navigation instructions based on landmarks. We evaluate the cost and performance
differences between these pipelines, as well as study the effect of landmark navigation
prompts on cognitive load, safe driving behavior and driver satisfaction via an in situ
experiment.
1INTRODUCTION
In 2016, there were nearly 35,000 deaths resulting from motor vehicle crashes [18]
in the United States. Yet despite the danger of driving, automobile transportation
remains an integral part of people’s daily lives: in that same year, Americans drove
a collective 3.17 trillion miles [18].
A large majority of automobile fatalities are the consequence of driving under the
influence, adverse weather conditions, or speeding. However, in 2016, 16 percent of all
vehicle crashes were the result of driver distraction [19]. Tasks, which a driver must
perform in conjunction with operating a vehicle (secondary tasks), impose cognitive
load, which in turn leads to the driver being distracted from vehicle operation.
Distraction leads to dangerous driving behavior, such as hard braking, manifested as
sharp changes in longitudinal acceleration, or sudden steering corrections, resulting
in sharp lateral acceleration [23].
Some secondary tasks, such as texting or applying makeup, are best refrained
from altogether. However, other secondary tasks are requisite to the primary task
of driving from origin to destination. The use of electronic, turn-by-turn navigation
aids, such as Google Maps, is one such task: while it has been shown to produce a
significant cognitive load [43], it is a valuable tool, which allows drivers to efficiently
reach a destination. Indeed, in-car navigation is a common task; 67-percent of smart
phone users indicate that they use their device for this purpose [58]. Be it utilizing
an alternate route to work to avoid construction, trying to find a new restaurant, or
getting from the airport to a hotel in a never-before-visited city, the real-time auditory
directions offered by navigation aids have done away with the need for a driver to
2take her eyes off the road to glance at a paper map or digital map display [61].
By reducing the cognitive load induced by navigation aids, drivers will be enabled
to exhibit safer vehicle operation characteristics while still enjoying the benefits of
turn-by-turn navigation.
Instructions delivered by the most popular navigation aids generally consist of
street names and numeric distances, requiring the driver to perform a visual search for
small street name signs and to estimate driven distances. The addition of landmark
descriptions could lessen this cognitive load, for example ”turn right at the Dairy
Queen” instead of ”turn right in 600 feet”. A salient landmark, here ”Dairy Queen”,
provides more obvious information than the numeric distance. Even if a person is
driving in a city previously unknown to them, the distinctive appearance of a Dairy
Queen can distinctly identify a turn. Previous research has suggested that if electronic
navigation aids could include relevant landmarks in their instructions, the cognitive
load of the driver could be decreased [8].
Including landmarks in navigation instructions requires several computational
frameworks. First, a method for locating candidate landmarks, or physical features
located near a maneuver point. Second, a means to lexically describe a landmark, in
a detailed manner, which allows the driver to easily recognize it. Lastly, an approach
for determining the best landmark out of a set of candidates—the landmark which is
most recognizable to the driver.
Current approaches to automated landmark-based navigation are limited, many
being restricted to pedestrian scenarios, others relying on pre-compiled sets of
landmarks and still others using only point-of-interest datasets for selection, without
incorporating visual analysis of maneuver points. We present Torchbearer, a system
which leverages multiple approaches, or pipelines, to locate candidate landmarks,
provide lexical descriptions of the same and determine which landmark is best-suited
3to be included as part of a verbal navigation instruction delivered to a driver at
a particular maneuver point. Given the coordinates of an origin and destination,
Torchbearer leverages standard pathfinding algorithms to find the least-cost (fastest)
route. For each point, where the end user will need to perform a driving maneuver,
such as a turn or merge, Torchbearer determines the landmark best suited for helping
the end user locate that point. Torchbearer then builds a verbal instruction, consisting
of the street name, distance, description of maneuver to be executed, and description
of the landmark, delivered to the driver via an audio-based mobile application. The
system extends existing navigation technology to offer landmark-based navigation
assistance.
Torchbearer’s novelty comes from its hybrid, pipeline-based approach: we use
four distinct pipelines to find landmarks and select the most suitable for a given
maneuver point. First, a fully human-based approach, which uses crowdsourcing to
find landmarks near a location, select that which is best suited for navigation, and
generate a description of the landmark. Second, a human in the loop approach,
which uses a state-of-the-art saliency detection algorithm to find the most obvious,
easiest-to-see landmark, but leverages crowdsourcing to generate a description of that
landmark. Third, a pipeline that uses a database of local businesses and points
of interest, as well as a deep learning-based object detection algorithm, to find
landmarks, and utilizes crowdsourcing to select the optimal one. And lastly, a fully-
automated pipeline which uses the saliency-detection algorithm for finding the most
visible, easiest to spot landmark and the point-of-interest data source, and object
detection algorithm to describe that landmark.
Torchbearer differs from existing solutions in three principal aspects. First, its
pipeline-based approach uses and analyzes several landmark selection methodologies
interchangeably. Second, it incorporates multiple landmark features into its selection
4process–visual, data-based and human recognition; this allows Torchbearer to consider
a wider range of landmark types than previous systems. Additionally, Torchbearer
relies only on publicly available data sources which have very wide geographic coverage
across the United States; some existing work relies on expensive data sources such as
laser range mapping.
The Torchbearer system is designed to reduce drivers’ cognitive load, reduce
erratic driving behavior, and lessen perceived workload. We evaluate the system
using a standard Peripheral Detection Task (PDT) to measure cognitive load and
the NASA Task Load Index survey to measure perceived workload. Additionally, we
monitor extreme gravitational force occurrences, as an indicator of driving behavior
associated with distraction. We also survey subjects on their perception of landmark
goodness and ease of navigation. To provide insight into the costs and benefits of
particular pipelines, we also provide an analysis of pipeline performance, examining
cost, runtime and result similarity.
Torchbearer presents a completely automated solution to selecting and describing
landmarks for use in navigation instructions, using multiple pipelines of varying
approaches capable of selecting a wide range of landmark types ranging from road
infrastructure, to buildings, to businesses. While we fail to find significant reductions
in cognitive load, erratic driving behavior or perceived cognitive load in our small-
scale field study, Torchbearer can serve as a robust platform off of which to incorporate
other algorithmic or human-based landmark selection ideologies.
5BACKGROUND
Distraction and Cognitive Load In the Context of Driving
While driving is a dangerous endeavour due to a wide array of factors, including
environmental, human and vehicle equipment related circumstances, a significant
contributor is driver distraction, which accounts for 16 percent of vehicle accidents
[19]. Distraction, in the context of driving, is the diversion of attention away from
the task of safely and efficiently operating the vehicle, onto some secondary task [49].
If we consider the driving task to consist of applying lateral (right and left steering)
and longitudinal (braking and forward acceleration), then distraction is dangerous
primarily because it inhibits the driver’s ability to quickly and accurately apply these
actions in response to changing situations in the environment [45].
Sources of Distraction
Broadly, a source of distraction is classified as in-vehicle or out-of-vehicle.
Out-of-vehicle distractions include visually abnormal occurrences such as police
actions, accidents, or billboards [14]. In-vehicle distractions can be further refined as
technology-based or non-technologically based. Talking with a passenger, applying
makeup, eating, or smoking all pose a potential non-technological distraction.
Technological distractions are receiving rapidly increasing academic attention due to
the rising penetration of in-vehicle information systems (IVIS) and smartphones [5].
IVIS pose a significant issue in regards to distraction, as they often require the driver
to look at a screen, or interact with the system in some way, creating both a visual
and cognitive distraction [7]. Cognitive distraction results in unsafe driving behavior,
including steering errors (lane departures), increased variability in accelerator
position, and the sharp breaking due to a shorter window in which to respond to
6a change in the environment [31]. Mobile devices, such as smartphones, lead to driver
distraction via the introduction of a physical (holding and tapping/swiping) visual
and cognitive load upon the driver One study estimates an increase in reaction time
to a pedestrian crossing the path of travel of 204 percent when the driver attempts
to text and drive. [11].
Navigation systems, implemented via IVIS, or a mobile device, represent a
unique form of distraction in that the interaction with the system (supplying a
destination, looking at a map, listening to instructions) presents one secondary
task, while the execution of the system’s instructions (scanning for upcoming turns)
presents another. Together these tasks can cause the driver to disengage from the
environment [33]. This disengagement leads to an increase in reaction time while
using a navigation system, which is more pronounced for navigation apps that have a
visual interface than those which are entirely audio-based [25]. The task of entering
an address using a touch screen poses a particular problem, with one study finding a
increase in the standard deviation of lateral vehicle position of 60 percent. [60].
Landmarks in Navigation
Mainstream navigation aids tend to heavily utilize distance-to-street-name
instructions, which require the driver to conceptualize distances and perform a visual
search for small road signs. [8]. Humans, on the other hand, tend to provide navigation
instructions using landmarks [63]. One study found that instructions provided by a
passenger, which were primarily landmark-based, resulted in fewer navigation errors,
shorter trip duration, lower perceived workload and a higher quality of driving as
rated by an expert, leading to the conclusion that the inclusion of landmarks in
automated navigation instructions could be beneficial [9].
Lovelace [34] examines the components of good navigation instructions for both
7familiar and unfamiliar routes. They found that in general more information provided
in an instruction resulted in higher perceived quality. Additionally, they found that
the inclusion of landmarks, both at maneuver points and intermittently along the
route, significantly increased perceived route quality.
Golledge [21] asserted that landmarks can aid in the navigation task because
they serve as both global reference point, allowing the driver to mentally organize the
space he is traveling through, and also as a sort of marker for decisions (maneuver)
points. Indeed, the substitution of landmark-based instructions for distance-based
instructions has been shown to decrease navigation error count and improve driver
confidence [37]. Interestingly, while the quality of landmarks did have a significant
effect on these measures, both good and poor landmarks were significantly better than
distances alone [37]. Completing a study in a real traffic environment, another work
found that the use of landmarks (as opposed to distance) resulted in fewer glances
at the navigation aid’s display and better driving performance as measured by lane
departure count and improper turn signal use. [36].
Landmark Saliency: What Makes A Good Landmark
Saliency is to the property of being particularly noticeable, prominent or
important [54]. A landmark is a physical feature that serves as a point of reference
within the environment; it is distinctive from its surroundings to such a degree that
it is easily recognizable and represents an exact point in space. Because of this
importance of uniqueness, the saliency of a landmark is not a function of the attributes
of an individual landmark but rather how distinctive those features are relative to
nearby objects. Indeed, being a good, salient landmark is a relative property [48].
Landmarks can be broadly classified as global, visible form the entire route and
relevant throughout, or local, important to a specific maneuver point (turn). Driving
8directions do not usually include global landmarks [59]. Local landmarks are best for
navigation, and are most useful to the driver near decision (maneuver) points [32].
Saliency is represented by a tripartite typology, where three distinct dimensions,
visual, semantic and structural, compose the overall saliency of a landmark [59].
Visual Saliency
Visually saliency is analogous with visual attractiveness. In general, visual
saliency is based on behavior observed in most vertebrates, in which they alter
their gaze so as to focus more attention on relevant details in a scene while ignoring
unimportant areas [24]. A region within in the scene, or a specific object in the space,
is salient if it receives a significant portion of attention. In the context of navigation,
a landmark is visually salient if it has sharp contrast with its surroundings and is
prominent (easily in view) from the driver’s location [59].
Reubel and Winter [48] show that the visual saliency of a landmark is calculated
by comparing several physical properties. (Of course, the calculated value for a
landmark has no meaning until compared to that of a nearby landmark–saliency
is relative.) The facade area represents the total physical area that is visible to
the driver. (Essentially, the bigger the landmark, the better.) The oddity of the
shape also plays a role; the larger the deviation between the shape of the landmark’s
silhouette and a rectangle, the more visually attractive it is. Color is the final factor,
specifically how different the landmark’s color is from the surroundings.
Semantic Saliency
Sorrows and Hirtle [59] define what they coin a cognitive landmark, a land-
mark whose meaning, history or cultural importance makes it prominent in the
environment. Such a landmark has an atypical level of importance relative to its
surroundings, possibly in spite of a typical level of visual attraction. The house of
9a university president, for example, likely has a high degree of semantic saliency
due to its significance in the community, even if visually it may be quite similar to
surrounding homes.
Reubel and Winter [59] refine the notion of a cognitive landmark to obtain a
more formalized definition of semantic saliency. Specifically, they include a Boolean
value for whether, or not, a landmark has historical or cultural significance to the area.
Additionally, they include a Boolean value for whether the landmark his discernible
commercial semantics, that is, is it a business of a type people are familiar with (such
as coffee houses or grocery stores.)
Duckham and Winter [46] expand this definition by suggesting that the semantic
saliency of a landmark is also a function of its ubiquity. The ubiquity of a landmark
is important, they argue, cultural significance is less meaningful to people unfamiliar
with a given area, as what is culturally significant to the area may be unknown to
them. Accounting for ubiquity in the semantic saliency measure accounts for the fact
that the more instances of a landmark there are, the more widespread its significance
is. As an example, consider a 50-year old local burger joint situated near a McDonald’s
that opened a year ago: while the cultural and historical significance is much higher
for the burger joint at the local level, the ubiquity of the McDonald’s belies its much
higher significance on the global level.
Geosocial data streams, such as FourSquare, Facebook and Google Places also
have the potential to provide semantic saliency information. Quesnot and Roche [47]
argue that geosocial data, which encodes information about who visits a landmark,
can offer valuable insight into the importance of that landmark. If a large number of
people frequently visit a landmark, it is likely to be more important than one which
receives few visitors. It essentially acts as a proxy for cultural significance, with the
enhancement that it provides a quantitative, real-time measure.
10
Uniqueness is also an important component of semantic saliency [10]. Just as a
green house is visually salient among a group of red houses, a library is semantically
salient among a group of restaurants. The uniqueness of a landmark’s intended
purpose within its surroundings is an important consideration [10].
Structural Saliency
The final tenant of landmark saliency is structural saliency, which broadly refers
to the pertinence of a landmark in the context of its location in the physical space of
its surroundings [59]. At a more applied level, a landmark is structurally salient if its
location (relative to the route) is easy to conceptualize cognitively and linguistically
[29].
Klippel and Winter [29] developed the first formal syntax for structural saliency.
They provide a hierarchy of structural saliency in terms of a landmark’s position
in relation to the intersection where a turn is to occur. While the hierarchy is
extremely thorough, the key takeaway is that it is best for a landmark to be located
on the corner of an intersection where a turn is to occur. The location of such a
landmark is easy to describe linguistically: “turn left after the McDonald’s” or “turn
left before the McDonald’s”, depending on whether the landmark is on the near or
far side of the intersection. If a landmark is located significantly before, or after, the
entire intersection, then it becomes difficult to summarize into an instruction, and
potentially even more difficult for a driver to conceptualize. Instructions such as “at
the intersection after where the McDonald’s is” are more complex both linguistically
and conceptually. Roser [52] offers empirical evidence, based on an ergonomic study
in a virtual environment, which supports this hierarchy.
11
Prior Art: Automated Landmark Detection
Multiple approaches have been implemented in attempts to automatically select
landmarks for navigation, spanning a wide range of goals, working definitions of
landmark saliency and data sources. Much work has also been done in the context
of pedestrian navigation, to a greater extent than has been done for vehicle-based
navigation.
Hile et al [28] leverage a dataset of geotagged images to generate landmarks for
pedestrian walking instructions. For a given path a pedestrian will walk, a database
of points of interest is used to select and annotate an image. The photograph, along
with the description and navigation instruction, are displayed on the user’s device.
Selection criteria is based on the proximity of a landmark to the user’s path of travel,
as well as how closely the angle of the photograph matches the heading the user is
traveling.
Beharee and Steed [6] also used geotagged images to provide navigation aid to
pedestrians, but selected a series of landmark photos to show along each leg of the
route. Proximity to the route was used as the selection criteria. Landmarks were
not given lexical descriptions. A between-subjects study revealed that in areas not
familiar to the subject, the addition of photographs to the navigation application
allowed subjects to arrive at target destinations in less time than when with textual
directions alone.
In another application targeted at pedestrians, Wenig et al [62] developed a
system for finding global landmarks that can be used to orient the user. For example,
a user looking for a destination in Paris might be given instructions in terms of
the relative location of the Eiffel Tower. Global landmarks are used based on the
authors’ argument that local landmarks are difficult to select accurately. Candidate
12
landmarks for a given region are predefined; the best landmark is chosen based on level
of visibility throughout the entire route to be traveled. The visibility of a landmark
at a given point is determined in a binary fashion using Google Street View images
and a deep neural network. The authors show that this approach leads to greater
confidence and more accurate cognitive map building among subjects.
Elias and Brenner [15] use visual saliency to select landmarks for driving-based
navigation instructions. Using a Geographic Information System (GIS) dataset,
the authors mine candidate landmarks (always buildings), where a landmark is a
candidate if it has some unique or distinctive feature compared to its surroundings.
Features examined include building use or purpose, land use type and building
extremities, such as outbuildings or carports. The best landmark is chosen based
on how visible it is to the driver as she approaches; this is determined using a three-
dimensional aerial laser scanning model of the area and modeling the the area of
a landmark which is within the drivers cone of sight. The system does not offer
detailed landmark descriptions, and was not evaluated by a human-based experiment.
Torchbearer provides meaningful landmark descriptions via human and algorithmic
input, and we perform a small-scale but thorough field study with human subjects.
Electronic Navigation Aids
There currently exist a number of commercial, as well as academic or open-
source, electronic navigation platforms. Most provide only distance-based instruc-
tions, but some prototypes do exist which incorporate some form of landmark
descriptions (especially among systems designed for pedestrian use.)
13
Google Maps
Google Maps is a mobile application for iOS and Android devices which
is capable of providing turn-by-turn driving directions between an origin and
destination point. Users provide the destination via voice or keyboard, and can enter
addresses, coordinates or points-of-interest. The app provides primarily distance-
based instructions, complete with street names. Some instructions will use road
topology to describe the maneuver point, such as ”turn left at the end of the road.”
Routing is based on finding the shortest travel time, and includes traffic and
construction delays in its optimization.
As of mid-2018, Google Maps has, reportedly, begun to include landmarks into
its spoken directions [17]. It is not yet a documented feature, and has been enabled
on only a small number of devices. It remains unclear what types of landmarks it
incorporates and what methods it uses for selection [17].
Waze
Waze, owned by Google since 2013 [55], provides turn-by-turn navigation
instructions in a similar manner to Google. Waze is novel because, along with a
base of OpenStreetMap data, Waze considers travel time, police traps, construction
delays and other data from its users, which it incorporates into its map and routing
decisions. Spoken instructions consist of distances and street names.
14
ARCHITECTURE
Architectural Overview
The goal of the Torchbearer system is simple: given the latitude, longitude and
approach bearing of a maneuver point, render a string describing the most salient
landmark at that location. The problem which Torchbearer solves the problem
expressed by
f(lat, long, bearing) −→ description (3.1)
where f is some method of landmark selection and description. The Torchbearer
system provides multiple implementations of f , which we call pipelines. Each pipeline
consists of an ordered set of tasks, T . Each task ti ∈ T accepts some input from the
previous task ti−1 and returns some output to be input to the next task ti+1—the
obvious exceptions being the first task, which takes a tuple (lat, long, bearing) as
input, and the last task, which outputs the selected landmark—the final result of
the pipeline. Each task progressively solves a small part of the landmark selection
problem, such that at the end of the pipeline Torchbearer has computed a lexical
description of the most suitable landmark. It is natural to consider a pipeline as a
composition of functions:
P = tn−1 . . . t1(t0(lat, long, bearing)) −→ description (3.2)
where n =| T |. As an implementation detail, it is important to note that tasks can be
performed in parallel if they all take the same input form the previous task. Examples
of such parallelization are shown in the descriptions of each specific pipeline.
The Torchbearer system receives input from a client mobile application in the
15
Route
Manager
Pipeline Orchestrator
﴾origin, 
 destination﴿
Client  
Mobile  
Application
Mabox  
Routing API
Saliency
Detection
Landmark
Description
Best Landmark
Selection
maneuver 
points 
(origin,  
destination)
maneuver  
points 
X0 Y0 X1 Y1 X2 Y2
description 
Landmark 
Database 
(Maneuver points, 
descriptions) 
t0 t1 t2
Figure 3.1: A high-level view of the Torchbearer system.
form of a (origin, destination) tuple. After gathering a set of maneuver points from
the Mapbox Routing API, the Orchestrator receives a list of (latitude, longitude, bearing)
tuples corresponding to each maneuver point for which a landmark description should
be computed. The Orchestrator manages the execution of each Task in the pipeline,
and returns the final selected landmark to be saved in a database, where it can be
queried by the mobile client. We discuss each component of the system in further
detail below.
Orchestration
In order to implement the function composition approach discussed above, each
pipeline requires a system to progress a maneuver point through each task in the
pipeline. We call such a system the pipelines Orchestrator. The Orchestrator is the
manifestation of the pipeline, in the sense that it is solely responsible for the intake
of new maneuver points to be processed and overseeing the ordered execution of each
task for that maneuver point.
The Orchestrator is centralized service which acts as a specialized message
16
broker. For each task t in the pipeline, the Orchestrator maintains a FIFO queue qt of
maneuver points for processing through that task. Queue items are tulpes containing
the unique identifier of the maneuver point, a token representing the specific task
instance and the input to the given task, It.
A task worker, described in the next section, polls the Orchestrator for a task in
need of completion; if such a task is available the Orchestrator pops it the from queue
and returned to the worker. (We discuss polling in greater detail in the following
section.) When the worker has the completed the task, it sends the results back to
the Orchestrator, which then adds a new item to the queue corresponding to the next
task t+, including the output of t (which is the input to t+). If the execution of the
task results in an error, the worker sends the details of the error to the Orchestrator,
which then halts further execution of the pipeline for that maneuver point.
The Orchestrator supports parallel execution of tasks, by placing a maneuver
point into two queues simultaneously, and pausing progression of the pipeline until
both tasks complete. The input X to the next task t + 1 is then the union of the
outputs of the n parallel tasks:
Xt+1 = Y0 ∪ Y1 · · · ∪ Yn (3.3)
An Orchestrator also maintains a database of pipeline state, including, for each
maneuver point, the output of each task, or error details if one occurred. The
execution for a specific point can be traced or monitored throughout the pipeline
by querying this database.
17
Task Implementation
A task receives a tuple X as its input and yields a tuple Y as its output. A tasks
output must be inclusive of its inputs, that is, X ⊂ Y . Let p+ = Y \X, then p+ is the
context contribution of a given task—the information which that state has added to
the pipeline’s overall knowledge, or state. For example, a task devoted to describing
landmarks might receive a list of candidate landmarks as input, and output both that
list and a list of computed descriptions for each landmark.
It is important to note that each task has a binding contract in terms of its input
and output, based on where it sits in the pipeline. For example, task t1 must accept
input corresponding to the output of task t0, and must provide output corresponding
to the input to task t2. This contract presents a not insignificant constraint in regards
to rearranging tasks within a pipeline: even if t1(t2) == t2(t1) (that is, the order in
which the pair of tasks is executed is not important) the two tasks could not change
positions in the ordered set of tasks for the pipeline unless their inputs and outputs
were identical.
A task is solved by a worker. A worker is an independent, isolated computational
entity which is responsible for the execution of a specific task. There can be any
number of (identical) workers for a task active at one time, essentially functioning as a
cluster, allowing for multiple instances of the given task to be executed simultaneously.
(Each instance of a task will only be run on one worker.) For example, multiple worker
instances for the Landmark Description task could run at the same time, allowing for
parallel execution.
Workers are stateless: in order to complete its task, a worker relies only
on the input it receives from the Orchestrator, without regard to previous tasks
completed by other workers. To describe a landmark, for example, the Landmark
18
Description worker does not require any information about other landmarks within
the Torchbearer system. Workers have no awareness of the context in which they do
their jobs, in the sense that workers can handle tasks from multiple pipelines and do
not care about the order in which they are asked to handle tasks.
A worker performs three essential functions: first, to find new tasks needing
execution, it polls the pipeline Orchestrator. Second, it carries out the computational
operations needed to complete the task, utilizing inputs from the Orchestrator. Lastly,
it returns outputs to the Orchestrator upon successful completion of the task, or alerts
the Orchestrator of a failure.
Polling for Tasks
The first function of a worker is to poll pipeline Orchestrators for tasks in need of
completion. A worker will ask only for the task or tasks it is capable of executing. It is
important to note that the Orchestrator utilizes a pull mechanism for task assignment,
as opposed to a push mechanism: rather than Orchestrators routing tasks to specific
Workers, each Worker is responsible for finding its own work by asking Orchestrators
for available jobs. While an Orchestrator serves as a manager for tasks, having state
related to the precise status of all pending tasks in its pipeline, it does not serve as
a manager for Workers. Indeed, no component of the Torchbearer system maintains
state related to the Worker pool. Workers can be stood up or can fail without
disruption.
The polling mechanism runs within its own thread and runs in a continuous loop.
Each iteration of the loop consists of the following: to initiate the polling sequence,
a Worker sends an HTTP GET request to the Orchestrator. If the Orchestrator has
any tasks which the worker is capable of completing, it immediately responds with
a payload containing a list of tuples, where each tuple contains inputs and a unique
19
identifier corresponding to each specific task. If no such task is currently available,
the Orchestrator holds the request open for up to 60 seconds, waiting for a task
to become available. As soon as a task becomes available the payload is sent and
the request ended. If no task becomes available during the 60 second window, the
Orchestrator terminates request with an empty response.
When a worker receives a response from the polling request, it first checks to see
if the response contains a payload (indicating that at least one task was received.) If
the response contains a payload, it spawns a thread for each received task to process
(complete) the given task, passing in the input and unique identifier received from
the payload. If the response does not contain a payload, the current iteration of the
loop completes, and the process repeats with a new polling request immediately being
invoked.
Task Execution
The execution step is responsible for solving or completing the workers task.
The majority of this steps procedure depends on the task, and is discussed for each
task in depth later on. It is important at this juncture to understand only that the
execution step for a given task runs asynchronously in its own thread, spawned by
the polling thread and being provided with both the inputs to the task and unique
identifier of the task. The runnable routine of this thread consists of a program which
will accept the tasks inputs and yield the tasks outputs–that is, it solves or completes
the task.
Submitting Results
If the task completes successfully, the task execution thread sends an HTTP
POST request to the Orchestrator, consisting of the task token as well as the yielded
output. If any error occurred during execution, the worker sends an HTTP POST
20
request to the Orchestrator containing the task token, the error message and any
additional data about the error, such as a stack trace.
Worker Deployment and Operations
Torchbearer is a microservice-based system; workers have complete flexibility
in implementation. Besides conforming to the input/output contract specific for the
given task, Torchbearer is entirely agnostic to how a worker completes its task and
where (on what machine) it does so. This flexibility provides incredible power in
terms of optimizing compute resources and designing solutions which are best-suited
to a particular task. Workers can implement solutions, in any language, run on
any operating system and run on hardware suited to their particular demands. For
example, we implement a task for looking up a location in GIS database in Scala,
and, due to its lightweight computational demands, it runs on a single-core machine
with 256 MB of RAM. On the other hand, we implement a deep neural network-based
computer vision task in Python, and run it on a multi-core machine with a 3072-core
GPU and 16 GB of RAM.
In order to facilitate this level of microservice-based independence, Torchbearer
we implement Torchbearer workers as Docker containers and run them on Amazon’s
Elastic Container Service (ECS). A Docker container enables a worker to define
the exact specifications for a virtual machine, and ECS runs this container on an
appropriate hardware node. The container is a self-contained bundle consisting of
the virtual machine definition and the binary for the worker program (the code which
actually handles the task.)
By containerizing workers and running them on a container management service
such as ECS we also gain the ability to horizontally and vertically scale compute
resources at the task/worker level. We can run multiple instances of each container
21
simultaneously, and we can adjust the number of instances in real time according
to changing demands for a service—this provides us with horizontal scalability. For
example, a task utilized by every pipeline will, in the steady-state, require more
instances than one which is used by only a single pipeline. Since the demand for
Torchbearers services is in constant flux (in general, a higher number of active users
corresponds to a higher load requirement) the ability to add and remove instances
of a worker as the demand for that task fluctuates is highly important to the cost-
effectiveness and efficiency of the system.
Route Manager
The Route Manager (RM) service is the contact point between the Torchbearer
backend and users (via the client mobile application, discussed below.) While the RM
service is not directly responsible for solving the landmark description problem, RM
serves as the gateway for client applications wishing to use the Torchbearer system.
RM exposes a public-facing Application Programming Interface (API) consisting of
the following endpoints:
POST /route
A client (generally an end user’s mobile phone) calls this endpoint both to
determine the shortest-path route to a destination as well as to initiate landmark
processing in the Torchbearer system. This endpoint accepts an origin tuple consisting
of (latitude, longitude) (generally, the user’s current location) as well as a destination
tuple of the same form, and returns the shortest-path route in the form a list of
maneuver points. A maneuver point is a tuple consisting of the latitude, longitude,
and bearing of the maneuver, the unique ID of the maneuver point within the
Torchbearer system and and an instruction to be spoken to the user as they near
22
that maneuver point. If immediately available, the tuple also includes the landmark
description computed by the specified pipeline; we discuss this in more detail shortly.
When RM receives this type of request, it must first determine the shortest-path
route between the origin and destination points. We use Mapbox, a third-party service
which offers a public street routing API. While there are no special requirements for
the routing algorithm Torchbearer uses, Mapbox was chosen for its unique trait of
including approach bearings for each maneuver point. Another routing service, or a
custom solution, could be integrated into Torchbearer in place of Mapbox, so long as
it accepts origin and destination coordinates as input and returns a list of maneuver
points, each consisting of latitude, longitude, approach bearing and maneuver type
(right turn, left turn, merge, etc.)
Once the route has been determined, RM queries the Torchbearer database for
each maneuver point. If the maneuver point exists in the Torchbearer system, and has
already been processed by the specified pipeline, the computed landmark description
is returned in the response. If the maneuver point does not exist, RM inserts a record
for it into the database, and initiates processing by sending an HTTP POST request
to the Orchestrator of the desired pipeline. The list of maneuver points are then
returned to the client.
It is important to note that while this endpoint will immediately return all
maneuver points for a route, some maneuver points will be in a processing state (by
the given pipeline). If a point is still processing, RM returns it without a landmark
description, and the client will need to ask RM for an updated description at later
time using the GET /maneuverpoint/landmark endpoint.
23
GET /maneuverpoint/landmark
This endpoint accepts a maneuver points unique identifier and a pipeline as input
and returns a description of the landmark computed for that maneuver point using
that pipeline, if one is available. This endpoint is used for checking if Torchbearer has
completed processing a maneuver point after the initial route has been returned. For
example, if a client navigation application did not yet know the landmark description
for an upcoming maneuver point, it might query this endpoint immediately prior to
speaking a navigation instruction, to see if a landmark description was now available.
User Interface
Users interact with Torchbearer via a native mobile application, developed for
both iOS and Android devices. The primary screen of the application is shown in
Figure 3.2. The application has two principal functionalities: first, the ability to
search for a destination and submit the route to the Torchbearer system, second,
the delivery of spoken turn-by-turn navigation instructions containing the landmark
description-based instructions created by one of Torchbearers pipelines.
Usage of the application during a typical navigation consists of the following
flow: first, the user selects the pipeline they wish to use for computing instructions.
By default, the application selects the machine-machine pipeline, which we discuss in
a subsequent section. Second, the user enters a desired destination using the keyboard
or text-to-speech capability of her device. The destination can be an address, business,
point-of-interest, or general area, such as a city. Using geocoding services provided by
Google, the application determines the most relevant geographical coordinates for the
destination description entered by the user. The geocoding service takes into account
the provided description and the user’s current location in determining the most likely
24
Figure 3.2: The Torchbearer mobile application for spoken navigation instructions.
destination. The app displays the address derived by the geocoding service to the
user, and asks her to confirm its correctness.
After confirmation from the user, the application submits an HTTP POST
request to Torchbearer’s Route Manager service, which returns a list of instruction
tuples for the route. Each tuple consists of the coordinates of the maneuver point
as well as the instruction string for the app to ”speak” upon approaching the turn.
While this response is returned immediately, the processing of the route (to determine
landmark descriptions) is asynchronous. If a maneuver point has already been
processed by the specified pipeline, its full instruction can be immediately returned in
the response, but for points not yet analyzed by the pipeline, only the street name can
be returned. As such, the initial route received by the application may not contain
complete instructions, that is, instructions inclusive of landmark descriptions, for all
maneuver points.
At this point, the application delivers a spoken instruction to the user for the
25
first maneuver point in the route, and the user begins driving. This begins the check-
speak-repeat loop: at a distance of one-half mile from the proximate maneuver point,
the application checks whether it received a landmark description for the maneuver
point from Route Manager, as part of the initial route request. If not, it sends
an HTTP GET request to Route Manager seeking an updated instruction. If the
processing for the maneuver point is now complete, Route Manager responds with
the updated instruction.
At a distance of one-half mile, and again at one-quarter mile from the maneuver
point, the application alerts the user to the upcoming maneuver via a spoken direction
of the form “in one-quarter mile, [action] at the [landmark] onto [street]” where
action is a predefined description of the maneuver to be performed, such as turn
right or merge, landmark is the landmark description computed by Torchbearer, and
street is the name of the street onto which the maneuver will take the user. In the
case where no landmark description is available, either because the pipeline did not
finish processing the maneuver point in time or was unable to compute a description,
the “at the [landmark]” portion of the instruction is omitted. At a distance of 25-feet
the application will speak an instruction of the form “[action] at the [landmark] onto
[street]”. When the user passes through the maneuver point, executing the maneuver,
the check-speak-repeat iteration for the current maneuver is complete. The relative
distances at which the app speaks directions were selected in a best effort to maintain
parity with the Google Maps navigation app. While we do not consider vehicle speed
in timing when to deliver the ”turn now” instruction, this would be a relevant area
for future work.
While the check-speak-repeat routine is the same for alerting the user of arrival at
their final destination as for intermediate maneuver points, the delivery varies slightly:
immediately after completing the second-to-last maneuver (the last maneuver being
26
arrival at the destination) the application speaks an instruction of the form “In
[distance] your destination is the [landmark] on the [side] of the street” where side
is either “left” or “right”.
Upon arrival at the destination, the application speaks one last instruction of
the form “you have arrived at your destination. It’s the [landmark] on the [side].”
The arrival event completes the navigation session, and the application returns to a
point from which the user can enter a new destination and begin a new session.
The mobile application is implemented using the React Native framework [16], a
cross-platform, JavaScript-based library. This framework allows for a single codebase
across both iOS and Android, and while it is written in JavaScript as opposed to the
native Swift or Java, all visual components are rendered natively on the device. This
creates a highly-responsive interface that feels like a native application as opposed to
a mobile website.
Street-level Imagery
Much of Torchbearer’s work, whether human-based or machine-based, relies
on visual computations based on the visual scene a driver would be seeing as he
approaches a maneuver point. This computation requires a source of street-level
imagery, photographs, taken from a vehicle on the road. These images must be
of a relatively high definition (at least 640 pixels by 640 pixels), be in color and
be available at all maneuver points through which Torchbearer provides navigation
services (ideally, most roadways in the United States). Additionally, images must
have no distortion, either by attribute of camera setup or post-production correction.
That is, each image much be rectilinear.
We use Google Street View due its high coverage of U.S. roads, high definition
images, and public availability. The service can return an image for a particular
27
latitude, longitude and compass bearing. The service returns rectilinear-projected [2],
distortion free images for a given latitude, longitude, field of view and bearing. Field
of view is limited to 120 degrees, as any larger can lead to incorrect perspective near
the vertical edges of the image—a side effect of rectilinear projection.
Human Input
Torchbearer makes decisions via two means—algorithms and humans. Leverag-
ing human opinion and decision-making in a computational system presents unique
challenges, which are not considerations in most computer systems. Human input
provides Torchbearer with insight into the landmark description problem that may
be difficult to express algorithmically: while our machine-based pipelines include well-
founded heuristics for finding and describing the best landmark to use for describing
a maneuver point, we hypothesize that humans may offer some unique insight into
solving this problem that our heuristics do not. The subjective nature of determining
the best landmark, as well as a description of what that landmark is, make human
insight and opinion especially valuable.
In order to gather human input at a large scale, in real time, we require a large
source of human workers. For this Torchbearer leverages Mechanical Turk (MTurk), a
large-scale crowdsourcing platform. Mechanical Turk manages a pool of workers, and
allows requesters (such as Torchbearer) to submit Human Intelligence Tasks (HITs) to
this pool. At a high level, a HIT is simply a question to be asked of a human worker,
with some form of answer specification. Torchbearer presents HITs to workers via a
web page hosted on its servers, allowing for rich content to be displayed to the worker.
The demographics of the Mechnaical Turk worker pool are not restricted; we allow
any worker to work on Torchbearer HITs so long as they pass the qualification test
detailed in Section 3.
28
Each HIT specifies a monetary reward, which Torchbearer pays to the worker
upon successful completion of the HIT, as well as a maximum duration the worker
is allowed to work on the HIT. Workers are compensated at a rate chosen to be
above the average pay for similar work; this removes any concerns about inferior
results due to sub-par wages. Workers are paid eight cents for selecting landmarks
in an image (similar to the more general object tagging task, which is common on
Mechanical Turk) and 10 cents for to describe a landmark (similar to the common
image captioning task). Workers are paid three cents to verify the accuracy of a
description.
Additionally, the HIT can demand that the worker has a certain qualification—a
test created by Torchbearer, which the worker must pass—in order to be allowed to
work on the HIT. Lastly, the HIT specifies how many workers it should be completed
by, allowing for the collection of multiple answers to the same question. When a
HIT is submitted to MTurk, it becomes available for workers to complete. Workers
choose the HITs they work on; they are not automatically assigned. Once a HIT has
been completed by the specified number of workers, the answers are sent back to the
requester by adding them to a distributed queue.
While each human-based pipeline task specifies its own format for questions
and answers, the general system Torchbearer uses for gathering data via MTurk
is constant. When a pipeline task requests human input, Torchbearer’s MTurk
management service (Turk Service) submits a HIT to MTurk with the parameters
specified by the pipeline task. Some questions are simple in terms of how they can be
displayed to the worker. They may consist of a text-based question with text-based
answers, for example. Such questions are submitted to MTurk as part of the HIT
specification, and are hosted entirely by MTurk. (The Description Verification task,
which we discuss in detail later on, is an example of this type of HIT.) Other questions
29
may require displaying rich content to the worker and accepting interactive answers,
such as the drawing of boxes around landmarks in an image. These questions must
be served to workers as HTML pages by Turk Service, and the HIT only specifies
the URL of the given page. When a worker is ready to complete the task, MTurk
requests the page from Turk Service, and displays it to the worker.
MTurk collects answers from workers as they complete each HIT. Once a HIT has
been completed by the number of workers required, MTurk sends the list of answers
back to Torchbearer by adding a message to a distributed queue shared between
Torchbearer and MTurk. Turk Service continuously polls this queue for new lists of
answers. When one arrives, Turk Service first determines the aggregated answer—
the final answer based on a combination of the individual answers of each worker—
by applying an aggregation function. (The aggregation function varies by HIT; we
discuss the specific function for each human-based task in the Pipelines section.) Turk
Service sends this aggregated answer to the Orchestrator of the pipeline, and pipeline
execution continues.
Getting Meaningful Answers
The primary challenge associated with asking a question of a worker is that of
trust: do we trust that the worker gave us a meaningful answer, that she took the
time to give the best response, as opposed to the easiest? Additionally, do we trust
that she actually understood the task?
In the simplest scenario, for a specific human input question, Torchbearer
submits a single HIT to MTurk, and accepts the response from the worker who
completed it as the final answer. While straightforward, this approach does not
offer confidence in how meaningful the response is—it is possible the worker put
minimum effort into the HIT in the name of speedy completion. To counteract this,
30
Torchbearer makes use of two separate methods for filtering out nonsensical human
answers: sampling and majority verification. Additionally, we require that all MTurk
workers complete a training exercise and pass a qualification test specific to the task
they are working on prior to submitting any results. The training materials and
examination are hosted by MTurk.
Worker Qualification Leveraging MTurk’s Qualification system allows us to filter
out workers who do not understand the goal of Torchbearer’s HITs. This screening
allows us to both train the worker in how to complete a task as well as ensure
they have the understanding and insight needed to complete the task successfully.
While qualification does not prevent a worker from providing (either intentionally or
unintentionally) a bad answer, it does ensure that they are capable of providing an
answer of acceptable quality.
The qualification system consists of two components, training and enforcement.
The training component consists of a web-based guide to completing the given task,
complete with good and bad examples, descriptions of the goal of the task and step-
by-step instructions. When a worker first desires to work on a Torchbearer HIT,
she is presented with this guide. After viewing it, she may take the qualification
test, a short multiple-choice exam which asks the worker to pick the best answer
to an example HIT. Even though some questions may be largely opinion-based, the
answer set is clear as to which choice would be an acceptable answer. Other answers
have a glaring inconsistency which the guide would have specifically pointed out as
being undesirable–such as selecting a non-permanent object as the best landmark.
An example test can be seen in Appendix ??.
In order to be allowed to work on Torchbearer HITs, a worker must score at least
80% on the qualification test and have viewed all parts of the training guide. Until
31
these requirements have been met for a given type of HIT, MTurk will not allow the
worker to submit answers.
Sampling In the sampling approach, we require that multiple workers complete
each HIT, providing us with multiple responses. We can then determine the final
answer by applying an aggregation function to the individual responses, such as taking
the mean or median or mode.
We benefit in two ways from this approach: first, having a majority of meaningful
responses dampens the response of a negligent worker. Consider the trivial example
of a HIT asking workers to count the number of cars appearing in an image. If we
asked only a single worker, we would have to take her at her word, with no means of
knowing how correct or incorrect her response was. However, if we ask five workers,
and three provide the correct count while two provide the incorrect count (whether
by intentional neglect or honest mistake) we could still arrive at the correct answer by
either taking the median or the mode. Of course, the increased cost of this approach
is directly proportional to the number of workers we ask to complete our HIT.
The second benefit comes into play if there is more than one answer to the
question being asked, or if the question is largely opinion-based. Consider an example
where we want to know which car in an image is the nicest, or most luxurious.
Obviously, this is not an objective question–but we may still be able to gain insight
by looking at the most frequent answers given by our sample of workers. If four
workers suggest that car A is the nicest, one suggests that car B is the nicest, and
one suggest that it is in fact car C, then we have reasonable evidence that car A is
considered to be the most luxurious.
The sampling approach is powerful, but works best when the answers are
quantitative and can be easily aggregated. For HITs which require answers that
32
are difficult to aggregate and compare the majority verification approach is best.
Majority Verification Instead of requiring multiple workers to answer each HIT,
the majority verification approach, inspired by Kulkarni et al. [30], requires only one
answer. However, to ensure that the given answer is correct, a sample of workers
(generally three) is asked to confirm that the answer is correct. This verification is
treated as a majority vote: if at least two out of three workers assert that the given
answer is correct, we trust that answer.
This method can be more cost effective, as asking a worker to vote on the
correctness of an answer is cheaper than asking them to define the answer for a
complex task. Additionally, this approach does not require an aggregation function
be defined, which is convenient for answers which are difficult to quantify, such as
text-based answers. One important limitation of this approach is that it will not
work well with opinion-based quandaries, such as the most luxurious car in an image.
The voting pool is unlikely to agree on whether an answer is correct, since they
themselves have opinions which can differ from that of the worker who provided the
answer. Instead, this approach is well-suited to HITs which have an obvious correct
answer, such as the number of cars in an image.
Pipelines
The problem of determining the optimal landmark description for a maneuver
point consists of two main tasks: determining salient landmarks within the drivers
view of the maneuver point and creating lexical descriptions of those landmarks. We
refer to these two broad tasks as the saliency and description tasks, respectively. We
propose two methodologies for solving each task, giving a total of four pipelines.
Our approaches are based on two principal methodologies–human-based and
33
machine-based. To this end, we have created one pipeline which is entirely machine
based, another which is entirely human based, and two others which are hybrids of
machine and human computation.
Pipelines are referenced via a method-method notation, where method can be
either human or machine. The left-hand method refers to the method used for
selecting salient regions of the maneuver point; the right-hand method refers to that
for deriving a description of a given region.
Pipelines at a High Level
While the exact manner in which a pipeline solves the landmark description
problem varies from pipeline to pipeline, all pipelines share a general sequence of
execution, and all take a tuple consisting of latitude, longitude and bearing as input
and yield a tuple containing the best, most salient selected landmark as output.
The first step in any pipeline is to obtain street-level images of the maneuver
point at the given geographic coordinate and bearing (number 1 in Figure 3.3).
For each maneuver point, Torchbearer gathers street-level images from three points
relative to the maneuver point: “at”, “just before” and “before” the intersection,
corresponding to 25, 50 and 100 feet, respectively. (See Figure 3.4.) When directions
are spoken to the end user, these positions are inverted, into “at”, “just after” and
“after”, describing the position of the maneuver point relative to the image the
selected landmark was found in.
We use imagery from these three positions in order to obtain a ”view” of the
maneuver point that captures landmarks of different scales—from signs right at
the intersection to buildings which may only be visible from farther back from the
intersection. The closest (25-foot) distance was selected as it is the closest distance at
which a stop sign generally becomes visible in a Google Streetview image; the farthest
34
Load Streetview
Image
General Pipeline Flow
Select  Landmark
Saliency Detection
Cleanup
(latitude, longitude, bearing)
Best Landmark
Description
Candidate set
generation
1
2
3
4
5
Figure 3.3: The general structure of a Torchbearer pipeline.
35
distance was selected as buildings on the side of the road near the intersection became
visible.
25’
at
just before
before
50’
100’
Figure 3.4: The positions of street-level images relative to a maneuver point.
No matter the exact approach the pipeline takes to obtaining a landmark
description, it will need these images to perform its determination. Next, the pipeline
must generate a set of candidate landmarks, C. A candidate landmark is simply an
object at the maneuver point that could be used as the basis for a landmark-based
instruction; we know nothing about how salient the landmark is, however.
Generation of the candidate landmark set is performed implicitly by either the
saliency or description step, depending on the specific pipeline. Pipelines which
leverage human-based description rely on the saliency step (whether human or
machine-based) to generate a candidate set (step 2 in Figure 3.3). Pipelines which
use machine-based landmark description generate a candidate landmark set as part
of the description step (step 3 in Figure 3.3).
36
After the saliency of each candidate landmark has been determined (step 2 in
Figure 3.3), and each candidate has received a lexical description (step 3 in Figure
3.3), the pipeline must decide which landmark is best, or most salient (step 4 in
Figure 3.3). The formula varies by pipeline, and depends on which components of
saliency were measured.
Saliency
The saliency step of a pipeline is responsible for quantifying the saliency of
candidate landmarks. While saliency consists of three components—visual, semantic
and structural, not all three components are considered individually in each pipeline.
Human-based saliency is based on only a single overall score, generated by human
opinion, that represents humans’ ability to distinguish good landmarks. Machine-
based pipelines consider both semantic and visual saliency. In all pipelines, structural
saliency is enforced rather than evaluated : in accordance with the literature, we
consider only candidate landmarks that are located at or very near to the maneuver
point.
The Human Approach Humans are accustomed to picking out landmarks from
their surroundings in day-to-day life, be it for giving a friend directions or for their own
internalization of a route or location. We can take advantage of this innate ability
by asking a human MTurk worker to select what they believe is the most salient,
most standout landmark at a given maneuver point. Unlike algorithmic saliency
detection, here we do not separate the concept of saliency into its visual and semantic
subparts. Rather, we hypothesize that, because human workers have an elemental
understanding of what makes a landmark salient, the decisions they make regarding
the best landmark at a given point implicitly incorporate these saliency concepts.
37
We gather human saliency detection input via an MTurk HIT, denoted a
“Saliency HIT”. The Saliency HIT must be completed by five workers, and consists
of the following task: after the worker elects to work on the HIT, he is shown a
high-resolution image of the maneuver point in question from three distances (at,
just before, and before the point, corresponding to 25, 50, and 100 feet, respectively).
Note that all three of these images are of equal dimensions. The HIT instructs the
worker to use his mouse to draw a bounding box tightly around the object that he
believes is the best landmark—the one he would use if he were telling a driver to
perform the given maneuver right at that point. The worker can choose an object in
any of the three images, but can only select one object.
We offer the worker three images from three distances so that landmarks of
different scales can be captured: a stop sign, for example, is hard to detect in an
image from far away, but is prevalent in an image from right near the maneuver
point. Likewise, a large building may be an excellent landmark, but might only
be visible from some distance away from the maneuver point. In essence, we are
showing the worker the approach to, or the path leading up to, the landmark, and
allowing them to see what the driver would see at three points along this path. In
the final instruction spoken to the driver, we take into account the position of the
best landmark—that is, if the best landmark is one which was selected from the just
before image, the spoken instruction will tell the driver to turn just after the specified
landmark.
After the worker makes his selection, the Torchbearer-hosted webpage submits
the coordinates of the drawn bounding box, along with the position corresponding
to the image the box was drawn in, to MTurk. After five workers complete the task,
MTurk sends the set of five bounding boxes back to Turk Service, via a distributed
queue. Torchbearer must now aggregate these answers: this particular human task
38
leverages the sampling and aggregation approach to human input described in a
previous section. That is, because bounding box coordinates are quantitative, we can
combine them together in a manner which rewards the agreement among workers,
if there is any, and culls answers which are in the severe minority and likely to be
meaningless.
Turk Service performs aggregation by creating a matrix called a saliency map for
each of three maneuver point images; this matrix represents the number of workers
who included each pixel in the bounding box they drew. Algorithm 3.1 creates this
matrix. The result of this operation is a matrix of size equal to the maneuver point
images shown to the worker, where an element corresponds to a pixel in the original
maneuver point image and where the value of each element is an integer between
0 and n, where n is equal to the number of workers. While the saliency map does
not incorporate any decision about which regions are or are not salient landmarks, it
encodes the relative saliency of each pixel in the image. To make this matrix easier
to work with in subsequent pipeline steps, we normalize all values between 0 and 255,
where a value of 255 indicates maximal saliency. A subsequent task in a pipeline can
use this saliency map to either find the most salient regions or to query the total
saliency of a target region.
The Machine Approach The algorithmic approach to determining salient land-
marks consists of separate components for visual and semantic saliency. However,
the machine-based saliency step deals only with visual saliency; the machine-based
description step provides semantic saliency scores.
Visual saliency refers to the perceptive quality of a region of the drivers view
which causes that region to stand out from its neighbors–that is, the degree to which
a region grabs a drivers visual attention. Street-level imagery of a maneuver point
39
Input: B, a set of tuples (x1, y1, x2, y2) representing bounding boxes; m, the width
of maneuver point image; n, the height of maneuver point image;
Output: S, a matrix of dimension m by n
1: S ← 0m,n
2: for b ∈ B do
3: S[by1 : by2, bx1 : bx2] += 1
4: end for
5: return B
Algorithm 3.1: Creating a saliency map from human input
serves as input; the goal is to quantify each pixel of a maneuver point image in terms
of its relative visual saliency. Specifically, given an m x n input image of a maneuver
point, we output an m x n saliency map, where each element in the matrix is an
integer between 0 and 255 corresponding to how visually salient that pixel is. A value
of 0 indicates no saliency, while a value of 255 indicates maximal saliency.
Torchbearer leverages a state-of-the art, deep-learning based algorithm called
SalNet [44] to estimate the pixel-level visual saliency across an image. Rather
than seeking to identify specific neuroscience-inspired image features, which identify
saliency, as many previous approaches do, SalNet takes a completely data-driven
approach, using a deep convolutional neural network to learn where the human
gaze tends to fixate in different images. Training data consists of a large dataset
of ImageNet [13] images, each with a corresponding ground truth saliency map. This
dataset was created by tracking subjects’ gaze as they were shown each image and
recording the time gaze was focused on each pixel. These gaze times were then
normalized to between 0 and 255, inclusive.
SalNet uses a deep neural network architecture to predict the saliency map
40
for an input image. The first three layers of this network consist of pretrained layers
from a Visual Geometry Group image classification network, VGG16 [57]; the authors
recognize that the low-level features learned by these layers offer valuable input to the
saliency problem. VGG16 was trained on an extremely large dataset, and by using
transfer learning, SalNet can benefit from this extensive training without needing to
train on so many images itself.
After the pretrained VGG network, SalNet incorporates a series of convolutional
and pooling layers, and finally a deconvolutional layer, which will cast the output back
into a matrix of the same size as the input. Training of the neural network consists of
minimizing the Euclidean distance between the saliency map output by the network
and the ground truth saliency map provided by the training dataset. During training,
the weights of the first three layers are fixed at the pretrained weights from the VGG16
network; only the additional, saliency-specific layers unique to SalNet are actively
trained.
It is important to note that SalNet is trained on a wide range of ImageNet
images from across a broad range of topics; it does not incorporate any knowledge
specific to the navigation domain. At the time of writing, no dataset containing
ground truth saliency maps for street-level imagery of sufficient size for training a
neural network was available. Training the SalNet architecture with domain-specific
data would certainly be worthwhile future work. However, the general principles of
visual saliency are not specific to any single domain, and the generalized training
of SalNet allows it to perform well on an evaluation set of images from across the
ImageNet corpora. We have hypothesized that it can adequately generalize to the
navigation domain.
41
Figure 3.5: Left: a maneuver point image. Right: a corresponding saliency map
generated by SalNet
Description
The second half of the landmark selection problem consists of deriving a lexical
description of a candidate landmark, although the machine approach to description
is also responsible for generating candidate landmarks as well as providing semantic
saliency scores. This description should be specific enough so as to allow a driver to
easily distinguish that given landmark from its surroundings.
The Human Approach To gather human descriptions for a given landmark, we
again leverage Mechanical Turk. However, instead of using a sampling approach
as we did with saliency crowdsourcing, we use a verification approach. First, for a
candidate landmark c, we annotate the street-level image of the maneuver point for
which this landmark is a candidate to include a bounding box drawn around the
landmark. We create an MTurk HIT is with only a single assignment; the worker is
shown this image and asked to describe the object enclosed in the bounding box. The
exact format of the question is: “Provide a specific description of the main object in
the box. Describe PERMANENT, man-made things–NOT cars, people or things that
could move. Pretend you were using that object as a landmark when giving someone
directions.” The Torchbearer-hosted webpage presents the worker with a text box
into which to type their answer.
42
After the worker has submitted the description, we create a verification HIT on
MTurk, with three assignments. The annotated maneuver point image, along with
the candidate description, is shown to each worker. The worker is asked to decide
whether the description is accurate and meets the criteria of describing permanent,
man-made things–not cars, people or things that could move. Three radio buttons are
displayed–“Description is accurate” and “Description is inaccurate and the landmark
is valid”, and “not a valid landmark”.
If at least two of the three workers indicate that the description is accurate, the
description is accepted, and pipeline execution can continue. If at least two of the
three workers indicate the landmark is invalid, pipeline execution continues, with this
landmark removed from the set of candidates. If the majority of workers indicate that
the description is incorrect, or if there is no majority opinion, the description process
repeats, with the creation of a new description HIT and subsequent verification HITs.
Torchbearer will retry this process up to three times—if no description could be
derived, the landmark is removed from the candidate set and pipeline execution
continues.
The Machine Approach Torchbearer leverages two approaches for finding seman-
tically salient landmarks and quantifying their salience: a data driven approach, which
uses a geosocial datasource to estimate the local significance of business and points
of interest, and a deep learning-based object detection algorithm, which searches for
known types of semantically salient features in maneuver point images.
Data-driven Approach Torchbearer estimates the semantic saliency of a land-
mark via an estimate of the number of people, who have recently visited the landmark
counted by the social networking application FourSquare. Previous work has shown
the efficacy of using geosocial streams as a proxy for the local importance of a
43
landmark; the intuition being that the more people who have checked in to a given
location, the more well-known, or prominent, it is [47]. FourSquare incorporates
businesses, points of interest and publicly accessible places into its ecosystem; these
are referred to as venues. User location is recorded transparently, without the need
for the user to explicitly tap a “check in” button Torchbearer leverages FourSquares
venue data both to find candidate landmarks and determine their semantic saliency.
To find candidate landmarks for a given maneuver point, Torchbearer queries
FourSquare for venues which are within a given radius of the maneuver point. By
default, we use a small 100-foot radius, in the aim of ensuring any returned venue will
be on or near the road upon which the maneuver point is located. FourSquare returns
a list of tuples consisting of the venue name, the type of venue (such as restaurant,
gas station, etc.) the geographic coordinates and the number of FourSquare users
who have checked in to that venue.
We compute the relative bearing between the venue and the approach bearing
of the user, and discard venues which are not within 45-degrees of either side of
the user, as the field-of-view of our street-level imagery is 90-degrees. We convert
each of these venues to a Landmark: the landmark’s description is the name of the
FourSquare venue, concatenated with its category. For example, the description for a
landmark corresponding to a venue with the name “Starbucks” and category “Coffee
Shop” would be “Starbucks Coffee Shop”. The landmarks semantic saliency score,
Ss is a function of the number of checkins in the last six months c and the number of
locations l, if the venue is a chain:
Ss = c+ l (3.4)
This measure captures both the local significance and wide-area ubiquity of the
44
landmark. Note that all saliency scores are relative, and are meant to be compared
against other candidate landmarks at a maneuver point.
We determine the position of the landmark relative to the maneuver point based
on its proximity to the maneuver point: if within 50 feet, the position is “at”. If
not, the position is “after”. These positions are inverted (into “at” and “before”,
respectively) if the landmark is selected for inclusion in a spoken instruction to an
end user. Figure 3.6 shows this determination.
at
after
50’
10
0’
L
Figure 3.6: Determining landmark position for data-driven description approach. We
consider landmarks within the 50-foot inner radius to have a position of “at”, and
those within the 100-foot outer radius to have a position of “”after”. For example,
landmark L in this diagram would have a position of “after”.
Object Detection Approach Some landmarks are ubiquitous and proven to be
highly semantically salient, independent of the maneuver point’s geographic location.
Road infrastructure, such as stop signs and traffic lights, is a prime example: these
landmarks are universally recognizable among drivers, and have been shown to serve
as excellent landmarks for use in navigation instructions [38]. Unfortunately, we found
45
no dataset of street signage or traffic lights with coverage beyond a specific locality.
Instead, we leverage a state-of-the-art object detection algorithm, Faster-RCNN [50],
to detect stop signs and traffic lights at maneuver points. Note that as a direction
for future research, extending the object detection model to include other types of
landmarks is both feasible and potentially beneficial.
Faster-RCNN (FRCNN) is a deep, region-based, convolutional neural network
which takes an image as input and yields a set of bounding boxes, class labels (a
string denoting which object the region was classified as) and confidence scores for
objects of interest detected within the image [50]. It is currently one of the highest
performing classifiers in terms of both speed and accuracy [53], [50].
FRCNN leverages an existing image classification network, ResNet, to compute
feature maps for an image, and then uses the output of an intermediate convolutional
layer in that base network as input to its own FRCNN-specific layers. This inclusion
of a network trained for large-scale classification is known as transfer learning,
and allows an FRCNN model to take advantage of the extensive training across
millions of ImageNet images encoded within ResNet. The output of this intermediate
convolutional layer, although trained on ImageNet data, outputs high-level image
features as opposed to specific classes probabilities. Using these high-level feature
maps as input, FRCNN trains its own final (fully-connected) layers to output class
probabilities specific to our data.
FRCNN consists of two sub-networks: a Region Proposal Network (RPN),
trained to output a set of possible bounding boxes, and the CNN network itself, which
performs classification and final bounding box adjustment (based on the predicted
class). To predict likely bounding boxes, the RPN considers a pre-generated set of
anchor boxes. Each anchor box is a fixed set of 9 candidate bounding boxes, of
different sizes and aspect ratios, anchored at every point in the image. For example,
46
if the input image is of dimensions n x n, there are 9n2 anchor boxes for the RPN
to consider. For each anchor box, the RPN learns to output (through the training of
three convolutional layers) a probability corresponding to the likelihood of the box
containing an object of interest, as well as a tuple of four doubles indicating the
amount by which to adjust each coordinate of the predefined anchor box. Boxes with
a probability of objectiveness below a certain threshold are discarded, the rest are
passed on to the classification sub-network.
Given a set of possible bounding boxes generated by the RPN, the CNN first uses
Region of Interest Pooling (ROI) to generate fixed-size convolutional feature maps
corresponding to the region of the input feature map contained within each bounding
box. ROI consists of splitting the box into k evenly sized regions and selecting the
maximal value from each region, yielding a feature map of size k, where k is a small
integer, often 7. This pooled feature map is then input to two successive 4096-neuron
fully-connected layers–these two layers learn the actual classification function.
The output of the second fully-connected layer is passed through a softmax layer
of size equal to c+ 1, where c is the number of classes we are trying to predict. (The
extra output is for the “background” class–a bounding box that did not contain an
object.) The softmax layer gives a floating-point number for each output, subject to
the following constraint: let Y be the set of outputs, then
∑
y∈Y
y = 1 (3.5)
This gives a probability distribution over the set of possible classes for the
likelihood of an object being a particular class (or background).
In addition to the softmax output corresponding to class predictions, the network
outputs a tuple of bounding box adjustments corresponding to each class. (These
47
are output via a single fully-connected layer of size 4c.) These adjustments capture
information about how to transform a pre-generated anchor box into the correct shape
for a class; for example, it will learn that a stop sign is square.
Using images from Google Streetview, we constructed a dataset of 800 street-
level images and ground-truth bounding boxes. Ground truth labels were created
by hand using the Visual Object Tagging Tool [12]. Each image contained traffic
lights, stopsigns or both. We generated an addition 75 negative examples—images
containing neither a stoplight nor a stop sign. This dataset was divided into training
and test sets, with a split of 85% train and 15% test. We trained an FRCNN network
for 20 epochs–that is, 20 complete passes through our training set. At the completion
of training, we achieved a mean average precision on our test dataset of 0.71 for stop
lights and 0.75 for stop signs.
Finding Landmarks in Saliency Maps
Given a saliency map, it is often important to locate candidate landmarks based
on hot spots, or highly salient regions, in the map. The significance of this is different
for human-based saliency detection than for machine-based saliency detection. As an
example, consider the street-level image and corresponding saliency map shown in
Figure 3.7.
Figure 3.7: Left: a street-level image, with two stop signs and a building as potentially
salient landmarks. Center: the corresponding saliency map, generated by SalNet.
Right: the saliency map overlaid atop the street-level image.
48
With human-based saliency detection, the goal is to reduce the set of returned
bounding boxes into a reduced set of distinct landmarks, by combining overlapping
bounding boxes into a single area. For example, of the five answers it might be that
three bounding boxes mostly overlap, indicating that that those workers intended to
select the same landmark, while the other two answers overlap a separate landmark.
Rather than treat all five bounding boxes as separate landmarks, it is beneficial to
instead consider only the two distinct landmarks. First, this reduces the scale of future
pipeline operations—those steps do not need to perform (redundant) calculations on
as many candidate landmarks. This reduction saves time and compute cycles and, in
the case of human-based tasks, fees paid to workers. Second, by reducing bounding
boxes into aggregated areas, we can assign a saliency score to the candidate landmark
based on how many answers included it in their bounding box. This can be used at
the end of the pipeline as part of the decision process for choosing the best landmark.
It is this score that acts as proxy for human intuition into what makes the best
landmark: the more workers who select the pixels containing a landmark, the more
salient the landmark.
In the case of machine-based candidate landmark generation, we need to
correlate the set of candidate landmarks generated by the machine description
(FourSquare-based) step with an area of the visual saliency map. Only the latitude,
longitude and relative bearing between street-level image and landmark are known.
We need to locate potential salient regions in the saliency map, so that we can
determine if the candidate landmark aligns with one of those regions. Given a saliency
map, a matrix of values ranging from 0 to 255, the goal is to label each pixel as
belonging to a specific salient region or being non-salient.
Non-maximal suppression (NMS) is a state-of-the-art method for reducing a
set of bounding boxes to only the significant ones, discarding bounding boxes which
49
enclose the same region using greedy clustering and a fixed distance threshold [40]. If
our saliency map were composed of entirely rectangular regions of different saliency
values (as is actually the case with human-based saliency detection) this method
would be sufficient. However, the saliency map returned by our computer-vision
based saliency algorithm estimates saliency at the pixel level and, as a result, makes
no guarantee about the shape of salient regions.
The Watershed Algorithm is an image segmentation approach, designed to single
out distinct regions in the image by separating foreground elements from background
elements [3]. In classic image processing, these regions might be objects one wishes
to separate from one another. In our case, we wish to separate regions of relatively
high saliency (foreground) from their low-saliency surroundings (background). The
algorithm works by considering our saliency map as a topological surface, where the
value of a pixel denotes its height–pixels with a value of 0 (no saliency) are valleys and
pixels with a value of 255 (highest saliency) are peaks. For each valley, or minima, in
the map, the algorithm simulates filling the topology with different-colored water–that
is, it labels pixels as belonging to a given segment. As simulated the water level rises,
water from different valleys will begin to converge. To prevent this, the algorithm
constructs infinitely tall barriers, or segmentation lines, between the two valleys. The
algorithm continues this process until even the tallest peak is submerged, leaving only
the barriers above water. These barriers now encapsulate different objects, or salient
regions, within the map.
To make this algorithm more impervious to over-segmentation and noise–small
regions of high salience within a low-salience area or vise versa–we leverage the
marker-controlled watershed algorithm [51]. Here, we dictate to the algorithm which
pixels we know to be independent, salient regions, which ones we know to be non-
salient, background pixels and which ones we are unsure about (the border area
50
between known salient regions and non-salient background). Now, rather than
flooding starting at the minima, the algorithm begins flooding from each foreground
region we specified and the background region; it is now simply finding where the
segmentation line will be placed within the unknown border area.
In order to apply the watershed algorithm, several preprocessing morphological
steps must be taken to clean up the saliency map, and each pixel must be labeled
according to its status as known background, known foreground, or unknown. We
adapt a procedure outlined in [1]. The following steps outline this process, given a
saliency map S:
1. Perform binary segmentation on S, rendering each pixel as salient (255) or non-
salient (0). (This segmentation yields a “black and white” image.) We first
compute a threshold t, at or above which a pixel is considered salient and below
which a pixel is considered non-salient. We select t via Otsu Thresholding [42],
which works by iterating through all possible threshold values in [0, 255] and
selecting the one which minimizes the sum of the weighted variances within the
salient and non-salient classes. That is,
threshold = argmint(
| n |
| n | + | s |σ
t
n +
| s |
| n | + | s |σ
t
n) (3.6)
where t is the candidate threshold, s is the set of salient pixels, n is the set of
non-salient pixels and σ is the variance within the given set of pixels when a
given t is used as the threshold value. Figure 3.8 shows the saliency map after
Otsu Thresholding.
2. Remove small, insignificant salient areas (white noise) by performing morpho-
logical opening on the binary segmentation. Figure 3.9 shows the saliency map
51
Figure 3.8: The result of applying Otsu Thresholding to the saliency map. White
areas (having a value of 255) represent areas of saliency.
after applying morphological opening. While difficult to see at a small scale,
several spots of white noise were removed.
Figure 3.9: The saliency map after applying both Otsu Thresholding and morpholog-
ical opening. While difficult to see at a small scale, several spots of white noise were
removed.
3. Remove small, insignificant non-salient areas (holes) by performing morpholog-
ical closing on the binary segmentation. Figure 3.10 shows the results of this
step; as this particular saliency map does not have any non-salient holes within
a salient region the process had no visible effect.
4. Determine which pixels are known to be non-salient by dilating the binary
segmentation, falsely enlarging the salient regions. Dilation consists of scanning
a square kernel K over the binary segmentation and, at each point, replacing
52
Figure 3.10: The results of the morphological closing step; as the particular saliency
map does not have any non-salient holes within a salient region the process had no
visible effect.
the binary segmentation pixel underneath the anchor point (center) of K with
the maximal value overlapped by K. Denote this dilation, shown in Figure 3.11,
as Mn.
Figure 3.11: Dilation Mn: the parts of the image known to be non-salient are in black
(values of 0). Notice that the salient (white) regions are slightly enlarged compared
to the results of the previous step.
5. Apply a distance transformation to the binary segmentation, resulting in the
value of each pixel being equal to the Euclidean distance between that pixel
and a pixel with value 0 (non-salient background). This operation is essentially
finding salient peaks, or the centers of salient regions, as the pixels which are
farthest from a non-salient pixel are the ones in the center of a large salient
53
region. Denote this distance transform as D (shown in Figure 3.12).
Figure 3.12: Distance transformation D: the center points of the salient regions are
exactly white (255), as they are the farthest from a non-salient (black) pixel.
6. Determine the set of pixels which are likely to be salient by applying a binary
threshold to the distance transform, where t, the threshold, is set to c∗max(D),
where c is a constant factor which we set to 0.7. The goal is to isolate those
pixels which are far from any non-salient pixels, as we can be confident that
these are salient pixels. Denote this threshold Ms, shown in Figure 3.13.
Figure 3.13: Threshold Ms, the white areas (values of 255) represent the areas of the
saliency map we have high confidence are salient.
7. Pixels which are not known to be either salient or non-salient can be found by
Mu = Ms −Mn. This subtraction is shown in Figure 3.14.
54
Figure 3.14: Mu, the result of subtracting the matrix of known background areas
from the matrix of known foreground areas. the white areas (values of 255) represent
the unknown areas between salient and non-salient (background) regions.
8. Each distinct (disconnected) region of salient pixels in Ms needs to be labeled
from 2...n + 1, where n is the number of distinct regions. The background, or
non-salient-pixels, must be labeled as 1. This is accomplished by performing
a connected component analysis on Ms with 8-connectivity, yielding Mlabeled,
a matrix with consecutively labeled connected components. (This matrix is
shown in Figure 3.14.) Label the unknown region with 0; this is the region in
which watershed will draw a segmentation line to determine the final boundary
around the salient regions. Specifically, ∀pij ∈Mu | p = 0,Ml[i, j] = 0.
Figure 3.15: Mlabeled, where dark blue is known non-salient background, purple is
unknown, and yellow, green and turquoise are each a specific known salient region.
9. Run the watershed algorithm on S, using Mlabeled as markers. The returned
55
matrix Mw will have labeled all pixels as non-salient (1) or as belonging to a
salient region (2...n+1). The result is shown in Figure 3.16.
Figure 3.16: Mw, the result of the watershed algorithm. The grey region is non-salient
background, and each of the colored regions is a distinct salient region.
10. Calculate the bounding box around each salient region in Mw; these are the
saliency map’s salient regions. The final bounded salient regions are show in
Figure 3.17.
Figure 3.17: The final salient bounding boxes.
Quantifying Landmark Uniqueness
The semantic uniqueness of a landmark is an important factor in its saliency [10].
Even for pipelines that leverage human-based saliency detection, and therefore do not
componentize the saliency score, uniqueness is still used for tie breaking purposes.
56
We use the lexical description of a landmark to derive its uniqueness as compared
to the rest of the candidate landmark set. Our approach is based on word embeddings,
where a word is represented as a high-dimensional vector in vector space [20].
The value in such a framework stems from the Distributional Hypothesis, which
contends that words which are semantically similar will be distributionally similar
as well, appearing together in the same written contexts [26]. The goal in creating
vectorizations of a set of words is to represent semantically similar words with similar,
i.e. close, points in high-dimensional space. This approach allows us to determine
the similarity of words by comparing the Euclidean distance between the points or
cosine similarity between the corresponding (normalized) vectors.
Word2Vec Predictive modeling is a common method for generating the vector
representations of a set of words, wherein a machine learning algorithm learns to
accurately predict a word’s context, or words that are likely to appear around it, given
only the word [4]. One such model, Word2Vec, is trained to predict a nearby word
given another word, effectively internalizing a representation of which words appear
in the same contexts [39]. The algorithm uses a neural network with a single hidden
layer of size equal to the desired dimensionality of the word embedding (often 300).
Using a large corpus of text, and a selected vocabulary of important words therein, the
network is trained to accurately predict the probability of each word in the vocabulary
occurring within a small window of other words in the vocabulary, within the text
of the corpus. In doing so, the algorithm generates a v x d weight matrix (from the
hidden layer to the output layer) which acts as a function from word : P , where v is
the size of the vocabulary, d is the dimensionality of embeddings and P is a vector of
probabilities for each word in the vocabulary. After training, each row in this matrix
represents the embedding for a word in the vocabulary. Intuitively, if two words
57
are similar, they are likely to be surrounded by similar words, per the Distribution
Hypothesis. Thus, they will have learned similar weights, so as to generate similar
probability distributions over the vocabulary.
We use a pretrained word2vec model [22] with 300-dimensional word embed-
dings, trained on the Google News corpus and containing 300 million vocabulary
words. We use cosine similarity as a measure of similarity between word embeddings,
meaning that our similarity measure is bound between [-1, 1], with 1 indicating
complete similarity and -1 indicting complete lack of similarity.
To find the similarity between two landmark description phrases, we compute
a description vector, which is a sum of the vectors of each word in a description.
We then calculate the similarity between the two description vectors. Given two
candidate landmarks c1 and c2, the similarity s between these landmarks is defined
as
pairSimilarity = (c1, c2) −→ c(
∑
w∈c1.description
embedding(w),
∑
w∈c2.description
embedding(w))
(3.7)
where embedding is the word2vec vector for the given word and c is the cosine
similarity function of two vectors v1 and v2:
c = (v1, v2) −→ cos(θ) (3.8)
= (v1, v2) −→ A ·B‖ A ‖‖ B ‖ (3.9)
where θ is the angle between the two vectors.
To find the similarity of a landmark c as compared to all other landmarks in a
58
set of candidate landmarks C:
totalSimilarity =
∑
k∈C
pairSimilarity(k, c) (3.10)
Pipeline Specifics
Machine-Machine
Input: X0 = (latitude, longitude, bearing)
Output: The most salient landmark, including a description for use in navigation
instructions
Candidate selection: Description
Load Streetview
Image
Machine-Machine
Computer Vision
Saliency Detection
Visual Saliency
Scoring
Select Best
Landmark
Cleanup
Computer Vision
Landmark Search
Data-Driven
Landmark Search
1
2a
3
4
5
2b 2c
Figure 3.18: The pipeline structure of the Machine-Machine pipeline.
Step 1: Load Streetview Image
Input: X1 = (latitude, longitude, bearing)
59
Output: Y1 = (latitude, longitude, bearing, [image urls])
This step consists of querying the Google Streetview API for street-level imagery
at distances of 25, 50 and 100 feet from the given coordinate at an angle opposite of
the bearing, as shown in Figure 3.4. (We refer to these distances relatively as “at”,
“just before” and “before”.) We store returned images on Amazon Simple Storage
Service (S3), and include the S3 URL of each in a tuple which is included in the
output of this step.
Step 2a: Computer Vision Saliency Detection
Input: X2a = Y1 = (latitude, longitude, bearing, [image urls])
Output: Y2a = (latitude, longitude, bearing, [image urls], [saliency maps])
Implementing the machine approach methodology outlined in the Saliency
section, this step uses the SalNet deep learning architecture to compute a saliency
map for each image in the tuple of images in X2a. The processing of the images
happens in parallel and consists of feeding the the street-level image through SalNet.
For each image, this step yields a one-dimensional matrix of the same shape as
the input image, with values ranging between 0 and 255, inclusive. We add each
matrix to the output tuple provided to subsequent pipeline steps.
Step 2b: Computer Vision Landmark Search
Input: X2b = Y1 = (latitude, longitude, bearing, [image urls])
Output: Y2b = (latitude, longitude, bearing, [image urls], [candidate landmarks])
This step uses the Faster RCNN-based object recognition algorithm, described
in the Saliency section, to detect candidate landmarks in each maneuver point image.
The network has been trained to detect stop signs and stop lights; it returns, for each
object it detects, a tuple consisting of the coordinates of the objects bounding box
within the image, a confidence score between 0 and 1 and a description (label) for the
60
object. We discard any objects with a confidence score less than 0.8 to avoid false
detections, based on the notion that it is better from a usability standpoint to not
provide a landmark description in an instruction than it is to provide a description of a
nonexistent landmark. The remaining objects are converted into candidate landmark
tuples, with a semantic saliency score of 1.0. (We assume that all users are fully
aware of what a stop sign or stoplight looks like, thus no other landmark can be more
semantically salient than a landmark detected by this step.) These landmarks are
included in the output of this step.
Step 2c: Data-driven Landmark Search
Input: X2c = Y1 = (latitude, longitude, bearing, [imageurls])
Output: Y2c = (latitude, longitude, bearing,
[image urls], [candidate landmarks], [saliency maps])
This step uses FourSquare, described in Section 3, to find candidate landmarks
by searching for venues within a 100-foot radius of the maneuver point, as detailed
in the Saliency section. Candidate landmarks are included in the output tuple.
Step 3: Visual Saliency Scoring
Input: X3 = Y2a ∪ Y2b ∪ Y2c = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks])
Output: Y3 = (latitude, longitude, bearing, [image urls], [candidate landmarks])
This step assigns a quantitative score to each candidate landmark to designate
its visual saliency in the context of the maneuver point image. The computer-
vision based saliency detection approach (Step 2a) is not landmark-aware; that is,
it determines relative saliency at the pixel-level. This step aggregates these pixel-
level values into a score for the entire landmark. Given the bounding box coordinates
x1, x2, y1, y2 of a candidate landmark and the saliency map S of the maneuver point,
61
the visual saliency score of that candidate is calculated as
score =
x2∑
i=x1
y2∑
j=y1
Sij∑
S
(3.11)
That is, the visual saliency score is the sum of the submatrix contained within
the bounding box divided by the sum of the entire saliency map. This gives two
desirable properties: first, the larger a landmark is, the higher its score. Second, the
more high-saliency pixels contained within a landmark bounding box, the higher its
score.
While candidate landmarks detected by the object detection (Step 2b) include
bounding boxes, and can therefore be correlated directly with a region in the saliency
map, those returned by the data-driven approach (Step 2c) do not. For these
candidates, only the relative bearing between the maneuver point and landmark is
known. In order to estimate which rectangular region of the saliency map corresponds
to these landmarks we must first locate salient regions within the saliency map and
then determine if one of those regions lies on the given bearing.
To locate salient regions, we use the watershed-based approach described
previously. This lends us a set of bounding boxes, each containing a salient region
within the saliency map. To determine if one of these salient regions represents our
candidate landmark, we consider two points about the street-level image off of which
the saliency map was created: first, the pitch of the image is zero degrees, meaning
that the horizon line, where a venue would be, is roughly in the vertical center of the
image. Second, the field of view of the image is 90-degrees, and is not distorted or
warped. Given the relative bearing between the maneuver point and the landmark,
we check if there exists a salient region at this bearing in the vertical middle of the
saliency map. (See Figure 3.19.) If there is, we use this region as the bounding box
62
for the candidate, and calculate the visual saliency score as above. If not, we assign
a score of 0, as we have no evidence as to the visual saliency of this landmark.
00 4520 0
-450 45020
0
0-450
venue
Figure 3.19: Left: a landmark saliency map, with bounding boxes of salient regions.
The intersection between the relative bearing parallel and vertical middle is within
a salient region (shaded), and identifies the landmark within the saliency matrix.
Right: A bird’s eye view of an intersection. Our street-level images are a rectilinear
projection of a spherical image covering a 90 degree field of view.
Step 4: Select Most Salient Landmark
Input: X4 = Y3 = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks])
Output: Y4 = (latitude, longitude, bearing, best landmark)
At this point in the pipeline, we have a set of candidate landmarks, each complete
with both a visual and semantic saliency score. In order to determine the best, most
salient landmark, we must first determine the uniqueness saliency score for each
candidate, calculated via the method described in Section 3. Next, we normalize
each of the three saliency scores to a value between 0 and 1. Given a set of candidate
landmarks C, the normalized score for a given saliency component (visual, semantic
63
or structural) for a given landmark c can be found by
scorecomponent =
cscore
maxi∈C(iscore)
(3.12)
where iscore is the score for landmark i for a given component.
The total saliency score for a candidate is then the sum of the three normalized
scores.
S = Sv + Ss + Su (3.13)
where Sv is the visual saliency score, Ss the semantic saliency score, and Su the
uniqueness score, .
The candidate landmark with the highest summed scores is the best, most salient
landmark, and is the output of this step. The description of this landmark will be
included in navigation instructions spoken to the user.
Step 5: Cleanup
Input: X5 = Y4 = (latitude, longitude, bearing, best landmark)
Output: Y5 = (best landmark)
This final step consists of system cleanup tasks. All intermediate images—
namely, street level imagery—stored on S3 are removed. The best landmark is stored
in a database, associated with the maneuver point and pipeline identifier for future
retrieval.
64
Human-Machine
Input: X0 = (latitude, longitude, bearing)
Output: The most salient landmark, including a description for use in navigation
instructions
Candidate selection: Description
Load Streetview
Image
Human-Machine
Human Saliency
Detection (MTurk)
Human Saliency
Scoring
Select Best
Landmark
Cleanup
Computer Vision
Landmark Search
Data-Driven
Landmark Search
1
2a 2b 2c
3
4
5
Figure 3.20: The pipeline structure of the Human-Machine pipeline.
Step 1: Load Streetview Image
Input: X1 = X0 = (latitude, longitude, bearing)
Output: Y1 = (latitude, longitude, bearing, [image urls])
This step is implemented in the same manner as Step 1 of the Machine-Machine
pipeline.
65
Step 2a: Human Saliency Detection
Input: X2a = Y1 = (latitude, longitude, bearing, [image urls])
Output: Y2a = (latitude, longitude, bearing, [image urls], [saliency maps])
This step generates saliency matrices for the maneuver point, one for each of the
street-level images found in Step 1. This implementation uses the crowdsourcing ap-
proach described in Section 3, and leverages human intuition about what constitutes
a good landmark. The generated saliency map is therefore not specific to a single
component of landmark saliency (visual, semantic or structural) but comprises the
entire saliency metric.
The output of this step is a matrix of the same dimensions as the input maneuver
point image; each element is a value between 0 and 255 indicating the relative saliency
at that point in the image.
Step 2b: Computer Vision Landmark Search
Input: X2b = Y1 = (latitude, longitude, bearing, [image urls])
Output: Y2b = (latitude, longitude, bearing, [image urls], [candidate landmarks])
This step is implemented in the same manner as Step 2b of the Machine-Machine
pipeline.
Step 2c: Data-driven Landmark Search
Input: X2c = Y1 = (latitude, longitude, bearing, [image urls])
Output: Y2c = (latitude, longitude, bearing, [image urls],
[candidate landmarks], [saliency maps])
This step is implemented in the same manner as Step 2c of the Machine-Machine
pipeline, except that the semantic saliency gleaned from the geosocial database is not
used. (In this pipeline, the human-based saliency detection serves as the entire basis
66
of saliency.) Rather, this step is used to generate candidate landmarks, which are
correlated with the human-created saliency map in Step 3.
Step 3: Human Saliency Scoring
Input: X3 = Y2a ∪ Y2b ∪ Y2c = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks])
Output: Y3 = (latitude, longitude, bearing, [image urls], [candidate landmarks])
This step is implemented in the same manner as Step 3 of the Machine-Machine
pipeline.
Step 4: Select Most Salient Landmark
Input: X4 = Y3 = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks])
Output: Y4 = (latitude, longitude, bearing, best landmark)
The candidate landmark with the highest human saliency score is the best, most
salient landmark, and is the output of this step. If a tie exists, uniqueness, calculated
as described in Section 3, is used as a tie-breaker. The description of this landmark
will be included in navigation instructions spoken to the user.
Step 5: Cleanup
Input: X5 = Y4 = (latitude, longitude, bearing, bestlandmark)
Output: Y5 = (best landmark)
This step is implemented in the same manner as Step 5 of the Machine-Machine
pipeline.
67
Machine-Human
Input: X0 = (latitude, longitude, bearing)
Output: The most salient landmark, including a description for use in navigation
instructions
Candidate selection: Saliency
Load Streetview
Image
Machine-Human
Computer Vision
Saliency Detection
Saliency map
landmark search
Mark Landmarks Visual SaliencyScoring
Human Landmark
Description (MTurk) For every landmark in candidate landmarks
Select Best
Landmark
Cleanup
1
2
3
4a 4b
5
6
7
Figure 3.21: The pipeline structure of the Machine-Human pipeline.
Step 1: Load Streetview Image
Input: X1 = X0 = (latitude, longitude, bearing)
Output: Y1 = (latitude, longitude, bearing, [image urls])
This step is implemented in the same manner as Step 1 of the Machine-Machine
pipeline.
68
Step 2: Computer Vision Saliency Detection
Input: X2 = Y0 = (latitude, longitude, bearing, [image urls])
Output: Y2 = (latitude, longitude, bearing, [image urls], [saliency maps])
This step is implemented in the same manner as Step 2a of the Machine-Machine
pipeline.
Step 3: Find candidate landmarks within saliency map
Input: X3a = Y2 = (latitude, longitude, bearing, [image urls], [saliency maps])
Output: Y3a = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks])
Using the watershed algorithm described in Section 3, we search the machine-
generated saliency maps from Step 2 for salient regions, which compose the candidate
landmark set.
Step 4a: Create annotated maneuver point images
Input: X3a = Y2 = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks])
Output: Y3a = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks], [annotated image urls])
In order for human workers to provide written descriptions for candidate land-
marks, they need to see an image of the maneuver point with the candidate landmark
outlined. We choose to show workers an annotated image of the entire maneuver
point, as opposed to a cropped image containing only the candidate landmark, so
that workers can incorporate context into their descriptions. For example, we have
observed descriptions which incorporate the landmark’s surroundings, such as “one
story blue house next to the oak tree” and “stop sign near the crosswalk”.
69
For each candidate c in the set of candidate landmarks C, we generate an image
which contains a 3-pixel thick red border drawn around the bounding box of c. We
store these images on S3, and include the relevant URLs in the output of this step.
Step 4b: Visual Saliency Scoring
Input: X3b = Y2 = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks], [annotated image urls])
Output: Y3b = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks], [annotated image urls]))
This step is implemented in the same manner as Step 3 of the Machine-Machine
pipeline, except that the landmark search (watershed) component is not needed as
all candidate landmarks include bounding boxes.
Step 5: Human-based Landmark Description
Input: X4 = Y3a ∪ Y3b = (latitude, longitude, bearing,
[image urls],
[saliency maps], [candidate landmarks], [annotated image urls])
Output: Y4 = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks], [annotated image urls]
For each landmark c in the set of candidate landmarks C, we utilize the human-
based description method described in Section 3 to obtain a lexical description of c.
These descriptions are included in the given candidate landmark tuple in the output
of this step. This step does not complete until all candidate landmarks have been
processed through MTurk.
Note that it is possible for the description of a candidate landmark to fail, if
workers are unable to agree upon the accuracy of a description within three attempts,
or if workers agree that the landmark is invalid due to being temporary or irrelevant.
70
(This process of description and verification is described in Section 3.) If description
fails for a candidate, it is removed from the candidate set.
Step 6: Select Most Salient Landmark
Input: X5 = Y4 = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks], [annotated image urls])
Output: Y5 = (latitude, longitude, bearing, best landmark)
This step is implemented in the same manner as Step 4 of the Machine-Machine
pipeline.
Step 7: Cleanup
Input: X6 = Y5 = (latitude, longitude, bearing, best landmark)
Output: Y6 = (best landmark)
This step is implemented in the same manner as Step 5 of the Machine-Machine
pipeline.
71
Human-Human
Input: X0 = (latitude, longitude, bearing)
Output: The most salient landmark, including a description for use in navigation
instructions
Candidate selection: Description
Load Streetview
Image
Human-Human
Human Saliency
Detection (MTurk)
Mark Landmarks Visual SaliencyScoring
Human Landmark
Description (MTurk) For every landmark in candidate landmarks
Select Best
Landmark
Cleanup
1
2
4a 4b
Saliency map
landmark search3
5
6
7
Figure 3.22: The pipeline structure of the Human-Human pipeline.
Step 1: Load Streetview Image
Input: X1 = X0 = (latitude, longitude, bearing)
Output: Y1 = (latitude, longitude, bearing, [image urls])
This step is implemented in the same manner as Step 1 of the Machine-Machine
pipeline.
72
Step 2: Human Saliency Detection
Input: X2 = Y1 = (latitude, longitude, bearing, [image urls])
Output: Y2 = (latitude, longitude, bearing, [image urls], [saliency maps])
This step is implemented in the same manner as Step 2a of the Human-Machine
pipeline.
Step 3: Find candidate landmarks within saliency map
Input: X3a = Y2 = (latitude, longitude, bearing, [image urls], [saliency maps])
Output: Y3a = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks])
Using the watershed algorithm described in Section 3, we search the machine-
generated saliency maps from Step 2 for salient regions, which compose the candidate
landmark set.
Step 4a: Create annotated maneuver point images
Input: X3a = Y2 = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks])
Output: Y3a = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks], [annotated image urls])
This step is implemented in the same manner as Step 3a of the Machine-Human
pipeline.
Step 4b: Visual Saliency Scoring
Input: X3b = Y2 = (latitude, longitude, bearing,
[image urls], [saliency maps], [candidate landmarks])
Output: Y3b = (latitude, longitude, bearing, [image urls], [candidate landmarks])
73
This step is implemented in the same manner as Step 3 of the Machine-Machine
pipeline, except that the landmark search (watershed) component is not needed as
all candidate landmarks include bounding boxes.
Step 5: Human-based Landmark Description
Input: X4 = Y3a ∪ Y3b = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks], [annotated image urls])
Output: Y4 = (latitude, longitude, bearing, [image urls],
[saliency maps], [candidate landmarks], [annotated image urls])
This step is implemented in the same manner as Step 4 of the Machine-Human
pipeline.
Step 6: Select Most Salient Landmark
Input: X5 = Y4 = (latitude, longitude, bearing, [image urls],
[saliencmaps], [candidate landmarks], [annotated image urls])
Output: Y5 = (latitude, longitude, bearing, best landmark)
This step is implemented in the same manner as Step 4 of the Human-Machine
pipeline.
Step 7: Cleanup
Input: X6 = Y5 = (latitude, longitude, bearing, best landmark)
Output: Y6 = (best landmark)
This step is implemented in the same manner as Step 5 of the Machine-Machine
pipeline.
74
RESULTS
Our aim with these analyses is to understand the effectiveness of human versus
machine methodologies for landmark selection and to determine the efficacy of
the overall system for improving drivers’ cognitive load and performance during
navigation. We analyze the Torchbearer system on two fronts: first, we examine
the differences between pipelines on a performance and efficiency level, comparing
execution cost, execution time and similarity between results. Second, we perform a
field study with real drivers using the Torchbearer system to navigate along a route
unknown to them, comparing cognitive load, driving performance and perceived task
difficulty between all four pipelines and a control.
We leverage ANOVA-based analyses throughout this section to determine if
pipeline has a significant effect on the variable of interest. Note that in all statistical
analyses used throughout this section, requirements for normal distribution are tested
by visual analysis of the Q-Q plot. Homogeneity of variance is tested via Levene’s
Test at a significance level of 0.05. If either of these assumptions fail, we utilize the
Kruskal-Wallis analysis in place of ANOVA.
Pipeline Comparison
To evaluate the differences in efficiency, cost and solution overlap we created a
test set of 400 maneuver points in San Francisco, California, using an existing dataset
of geographic coordinates for all intersections in the city [41]. Maneuver points were
created at random by selecting an intersection and a route leading into it; the bearing
for the maneuver point was computed by measuring the angle between the two points
closest to the intersection in a poly line representation of the route (see Figure 4.1).
Each maneuver point was processed through each of the four Torchbearer
75
Figure 4.1: Left: The Google Streetview image of the intersection of Mission and
Cesar Chavez in San Francisco, part of the SF test set. Right: A map view of this
intersection. The grey line is a polyline representative of the selected route leading
into the intersection. To find the bearing value for the Torchbearer maneuver point
we calculate the angle w.r.t. due north between the two points outlined in black.
pipelines, resulting in a balanced result set of 1,600 pipeline executions.
Marginal Cost
Torchbearer pipelines incur monetary cost when they use MTurk to gather
human input. In an effort to compare the drawbacks and benefits of each pipeline, it
is important to have an understanding of the differences in expenditure.
Task workers record the cost incurred for the processing of each maneuver point
as the pipeline executes in the Torchbearer database; anytime a HIT is submitted
to MTurk the cost is increased by nc where n is the number of workers who will
complete the HIT and c is the amount to be paid to each worker. For this experiment
we paid workers $0.05 for a saliency selection HIT, $0.05 for a landmark description
HIT and $0.03 for a landmark verification HIT. These amounts were selected based on
observational analysis of Mechanical Turk pricing for similar object-detection-related
tasks; we aimed to offer above average pay for each type of HIT to avoid low pay
76
as a confounding variable in work quality. The marginal pipeline costs (the cost of
processing an additional maneuver point) are shown in Figure 4.2
Machine-
Machine
Machine-
Human
Pipeline
Marginal Cost By Pipeline
Human-
Machine
Human-
Human
C
o
st
 (
U
S
D
)
$0.00
$0.50
$1.00
$1.50
$2.00
$2.50
$3.00
Figure 4.2: Marginal cost by pipeline
Based on the results of a one-way Analysis of Variance (ANOVA) test, we find
that the mean marginal cost differs significantly by pipeline F (3, 396) = 154.59,
p < 0.001. A post-hoc analysis using Tukey Honest Significant Differences (HSD)
reveals that, at p < 0.05, the marginal cost differs significantly between all pipelines
except for the Machine-Machine and Human-Machine pipelines.
The marginal cost of the Machine-Machine pipeline is extremely low—it has a
mean cost per maneuver point of $0.00004. The Machine-Machine pipeline requires no
human input, therefore the cost is entirely a result of computational resource usage:
this pipeline takes on average of six seconds to execute form end-to-end, as will be
seen in the following section, and the price of the AWS node upon which Torchbearer
77
runs is $0.02 per hour.
The results for the Human-Machine pipeline are similarly deterministic—this
pipeline requests a single saliency detection HIT with a fixed number of worker
assignments (5 in our experiment). The Machine-Human pipeline exhibits not only
the highest average cost, but also the highest variance. Both of these traits are due to
the description verification component, which has the potential to repeat the entire
description step, introducing non-determinism and increasing the cost of a execution
significantly.
This non-determinism due to verification is also a likely explanation of the
variance observed in the Human-Machine pipeline. However, variance is less than
the Machine-Human pipeline, which we attribute to humans’ seeming ability to
select more meaningful landmarks during the saliency step than the SalNet-based
saliency approach. In other words, it is possible that the machine approach to saliency
sometimes selects salient regions, which do not contain an object that can be easily
described, and contention is created among and between the describing workers and
verification workers. This leads to more “loops” of the description step when workers
don’t agree, and therefore a higher execution cost.
The relatively high costs of the human-based pipeline may not render them
impractical, however. Since the street-level imagery used by Torchbearer is not
(currently) realtime, updated on a scale of years, a given maneuver point only needs
to be processed by Torchbearer on a relatively rare frequency. Thus, if a particular
pipeline proves to be expensive, but highly useful for drivers, it might be worth
bearing that cost on an n-year cycle. Of course if realtime imagery is used, the
Machine-Machine pipeline may be the only economically viable option.
Torchbearer is able to amortize costs by storing landmark descriptions for every
maneuver point it processes. Thus, only the first request for a given maneuver
78
point/pipeline combination will require processing by the pipeline. Costs are
amortized by the number of requests received between updates of the street-level
imagery source.
Execution Time
Along with monetary cost, execution time is a cost to using a given pipeline.
Using our San Francisco test set, we measure both end-to-end processing time and
task-wise execution time.
End-to-End Execution Time We record the start and end timestamps for each
execution; the difference between these timestamps are shown in Figure 4.3
Machine-
Machine
Machine-
Human
Pipeline
End-to-End Execution Time By Pipeline
Human-
Machine
Human-
Human
E
xe
c
u
ti
o
n
 T
im
e
 (
se
c
o
n
d
s)
0 min
4 min
8 min
12 min
16 min
20 min
24 min
28 min
Figure 4.3: End-to-end execution time by pipeline
Based on the results of an ANOVA test, we find that the mean end-to-end
differs significantly by pipeline F (3, 396) = 117.24, p < 0.001. A post-hoc analysis
79
using Tukey HSD reveals that, at p < 0.05, the end-to-end execution time differs
significantly between all pipelines.
The Machine-Machine pipeline exhibits the lowest mean end-to-end execution
time by an extreme margin, with very low variance. The pipelines which incorporate
human input are, unsurprisingly, slower on the order of tens of minutes. These
pipelines also exhibit significant variance, which is expected given the relative
unpredictability of the human pipeline tasks. Likely for the same reasons we observe
a higher marginal cost, we see a longer mean execution time for the Machine-Human
pipeline than we do for the Human-Machine pipeline. The mean execution time for
the Human-Human and Machine-Human pipelines are similar, again implying that
the machine approach to saliency results in more looping, or contention, at the human
description step. The Human-Human pipeline has the largest variance, due to the
most reliance on human work, and also the highest time.
Based on these results, it is likely that the only pipeline capable of executing at
realtime speeds is the Machine-Machine pipeline. However, there a couple of nuances
to consider: first, Mechanical Turk has the potential to become faster as Torchbearer
continues to build up a pool of workers. (The more workers, the more likely an
already-qualified worker will be at the ready when a Torchbearer HIT is submitted.)
During the SF test set simulation, 47 workers completed saliency HITs, 37 completed
description HITs and 64 completed verification HITs. Over time, as the reputation
of Torchbearer as a fair, well-paying requester grew, and more workers completed the
qualification process, more parallelization could occur at the human worker level.
Second, as noted in regards to pipeline cost, the benefits of a realtime pipeline
will not be realized unless the street-level imagery source is also realtime, which
Google Streetview is not. Indeed, most landmarks are permanent fixtures in the
environment, and thus a Torchbearer pipeline only needs to process a given maneuver
80
point whenever the street-level imagery is updated. This means that processing could
be batched—an entire city’s landmarks could be reevaluated at one time, and the per-
maneuver point execution time of a pipeline is not relevant.
Third, for long trips it may be that a 20 to 30 minute processing time is
acceptable for some landmarks. If a route consists of an hour of freeway driving,
followed by several maneuver points in the destination city, the latter maneuver points
can be processed while the user is on the freeway.
Execution Time By Task
For each pipeline, we evaluate the mean time required to complete each task.
This gives insight into any bottlenecks that might exist in a pipeline, as well as
the effectiveness of any task parallelization that was implemented. Each plot below
represents an “average timeline” of execution. The length of the horizontal bar shows
the average execution time for the given task, laid out in the order of execution. The
plot is arranged such that tasks which execute in parallel are shown with the same
start time.
Machine-Machine Figure 4.4 shows that the lion’s share of processing time in the
Machine-Machine pipeline is devoted to the computer vision saliency (SalNet) task.
This is unsurprising, as SalNet is a computationally-intensive algorithm, consisting of
convolutional filters being applied across the street-level image many times. We also
observe noticeable time reductions by parallelizing the saliency detection, computer
vision search and data-driven search tasks.
Machine-Human Figure 4.5 makes clear that the human landmark description
task is the bottleneck in the Machine-Human pipeline, accounting for approximately
98% of total execution time. Parallelization does not provide significant benefits in
81
Load images
Data-driven search
Computer vision 
search
Computer vision 
saliency detection
Visual saliency scoring
Select best landmark
0 1 2 3
Execution Timeline 
(Machine-Machine)
4 5 6
Cleanup
Time (minutes)
Step
Figure 4.4: Execution time by task (Machine-Machine pipeline)
Load images
Saliency map
search 
Computer vision 
saliency detection
Visual saliency scoring
& mark
Human description
Select best landmark
0 2 4 6
Time (minutes)
Execution Timeline 
(Machine-Human)
8 10 14 1812 16 20 2422
Cleanup
Step
Figure 4.5: Execution time by task (Machine-Human pipeline)
82
terms of end-to-end execution time.
Load images
Human saliency 
detection
Visual saliency scoring
Select best landmark
0 2 4 6
Execution Timeline 
(Human-Machine)
8 10 1412
Data-driven search
Computer vision 
search
Cleanup
Time (minutes)
Step
Figure 4.6: Execution time by task (Human-Machine pipeline)
Human-Machine Figure 4.6 shows that the Human-Machine pipeline suffers from
a single bottleneck in the form of the human saliency task, which is to be expected
as all other tasks required no human input. While some tasks are parallelized, the
effect of this on the overall execution time is negligible.
Human-Human In Figure 4.7, it is clear that the two human-based tasks
comprise the majority of execution time. The duration of the saliency task is
somewhat longer than the description task, which we attribute to the number of
workers required, as well as the difficulty of each task: the saliency task requires a
sample of five workers, each of who had to make a somewhat involved decision about
where to draw a box. The description task, on the other hand, requires a single worker
to write a description, and three more to simply approve of what she wrote. While
the description task does have the potential to “loop” if the description is rejected,
83
Load images
Human saliency 
detection
Select best landmark
Execution Timeline 
(Human-Human)
Visual saliency scoring
& mark
Human description
Cleanup
0 2 4 6 8 10 14 1812 16 20 2422
Time (minutes)
Step
Figure 4.7: Execution time by task (Human-Human pipeline)
in the single-iteration case this task requires less workers, performing an easier task,
than the saliency task does.
Selected Landmark Overlap
Every pipeline eventually selects a landmark, inclusive of a bounding box within
the street-level image outlining its location. By comparing the intersection-over-
union (IoU) between two landmark bounding boxes we can see to what degree the
bounding boxes are selecting the same area. IoU is the ratio of area overlapped by
both bounding boxes to the area encompassed by both bounding boxes; thus an IoU
of 1 signifies complete agreement, or overlap, and an IoU of 0 indicates no overlap.
IoU is expressed as
IoU =
area(intersection(b1, b2))
area(union(b1, b2))
(4.1)
where b1 and b2 are the bounding boxes of two selected landmarks. Figure 4.8 shows
the intersection and the union of two hypothetical bounding boxes.
For each maneuver point in the SF test set, we compute the IoU between the
selected landmark returned from each pipeline. Table 4.1 shows the mean IoU between
84
Bounding Boxes Union Intersection
Figure 4.8: The intersection (right) and union (center) of a pair of hypothetical
bounding boxes (left). The black area selection represents the area of the given
metric.
Table 4.1: Mean Intersection Over Union of Selected Landmark
Machine-Human Human-Machine Human-Human
Machine-Machine 0.35 0.65 0.08
Machine-Human 0.07 0.09
Human-Machine 0.05
85
each pair of pipelines across all maneuver points. This is essentially a measure of how
likely two pipelines were to select the same landmark, or, looked at another way, the
agreement between two pipelines in terms of landmark saliency.
The mean IoU between landmarks selected by the Machine-Machine and
Human-Machine pipelines is the highest, at 0.65; which we largely attribute to the
pipelines’ identical method of selecting candidate landmarks—object detection via
Faster-RCNN and FourSquare venue search. Interestingly, the methods used for
determining saliency, and selecting the best landmark, vary: while Human-Machine
considers only saliency as determined by human workers, Machine-Machine considers
a componentized saliency score with input from SalNet and semantic saliency based
on checkins and ubiquity. This implies that, at least to some degree, humans agree
with our componentized saliency method in regards to what makes the best landmark.
A high IoU also exists between the Machine-Machine and Machine-Human
pipelines, suggesting agreement between the salient regions generated by SalNet,
which define the candidate landmark set for the Machine-Human pipeline, and the
object-detection algorithm and/or the FourSquare venue search, which together build
the candidate set for the Machine-Machine pipeline.
For other pipeline combinations, the mean IoU is low enough that is unlikely to
be more significant than random chance. However, even though different pipelines
identify different landmarks, they could still provide utility to drivers. This is
examined in the following section.
Field Experiments
To evaluate the efficacy of each of our approaches for reducing driver cognitive
load and improving driving performance, we conduct a Institutional Review Board
approved instrumented-vehicle driving study (real driver, real vehicles, real roads)
86
in which subjects navigate along a route unknown to them using the Torchbearer
system. (The Human Subjects Consent Form for these experiments can be found
in Appendix B.) It must be noted up front that, due to constraints on time and
resources, a full-scale human factors study is out of the scope of this work. While the
experimental design we discuss could be applied to a larger sample and potentially
yield significant results, here we use a sample size of five human subjects. Along with
contributing a experimental design for future work, this small-scale study provides
exploratory evidence as to the effectiveness of the Torchbearer system.
Experimental Design
We evaluate each of Torchbearer’s four pipelines against a control pipeline which
delivers instructions containing no landmarks. The control pipeline is comparable to a
mainstream navigation application, such as Google Maps, which provides only street
names and distances in its instructions.
Using a within-subjects design, five subjects drove an identical route through
downtown Bozeman, Montana, using only the Torchbearer app for navigation (shown
in Figure 4.9). The route was selected due to its grid (city block) layout, offering
many locations for turns and a wide variety of landmarks (residential, business, and
street infrastructure.) It allowed for incorporating a large number of maneuvers into
the allocated 60-minute experiment time frame. This route was divided into five legs,
with a different pipeline being used for navigation of each leg. After the completion of
each leg, the subject was given asked to complete the NASA-TLX survey, to measure
perceived task load for that leg and pipeline. A sample of landmarks used for each
pipeline and route leg can be seen in Appendix A.
The subject was given no information about the route prior to the start of
driving; the only information they were given throughout the drive was spoken by
87
Start Stop
3
1
2
4
5
Figure 4.9: The route driven by subjects through Bozeman, Montana. Each color
represents a different leg. Each leg is navigated using a different pipeline.
88
the Torchbearer app. Subjects were all white; two were male and three were female.
All indicated they had at least some familiarity with the area of Bozeman in which
the test route was located. Subjects were not compensated.
Our experiment has two sources of nuisance variability, or blocking factors: the
route leg and the subject (driver). Each leg of the route is likely to have differences
in road type, normal traffic levels and availability of good landmarks. Subjects vary
in their driving abilities, driving style (tendency to brake hard, turn quickly, etc.),
preexisting knowledge of the area in which the experiment is conducted, as well as
global factors such as the time of day, weather, or traffic levels at the time the subject
completed the trials. All of these characteristics can have an undesired effect on the
variable of interest.
To control for these two blocking factors, we use a Latin squares design, which
allows for controlling two sources of variation—subject and route leg—and isolates
the treatment effect (pipeline). This is accomplished by requiring that each pipeline
be analyzed on all route legs an equal number of times, and also that each subject be
treated with each pipeline an equal number of times. A Latin square can be thought
of as an n by n matrix, where rows represent a subject and columns represent a route
leg, and n is equal to the number of pipelines, subjects and route legs (five).
The standard Latin squares design does not control for the effects of treatment
order—the carryover effect of subjects always being treated with pipeline x after
pipeline y—so we use a counterbalanced Latin square, which carries the additional
stipulation that each pipeline must be preceded by and followed by every other
pipeline an equal number of times. That is, if py is preceded by px for one subject, py
must be followed by px for exactly one subject.
Because we have an odd number of treatments (four Torchbearer pipelines and
one control pipeline) it is not possible to achieve the counterbalancing stipulation
89
within an n by n Latin square. Instead, two n x n Latin squares, with the second
being a vertical reflection of the first, are required. This results in a 2n by n matrix,
still with n route legs and n pipelines, but now requiring 2n = 10 subjects. Because
the scope of our study is limited to 5 subjects, we counterbalance the 5 by 5 to
the greatest extent possible, but still have some immediate orderings which do not
have the reverse represented in the square. This is a weakness of our study, and an
argument in favor of future work with a larger pool of subjects, but we argue it will
not threaten validity to a greater extent than the small sample size. Our Latin square
design is displayed in Table 4.2.
Using the Latin square design, we arrive at the following statistical model:
Yijk = Y + Pi +Rj + Sk + eijk (4.2)
where Y is the grand mean, Pi is the pipeline (treatment) effect for a particular
pipeline i, Rj is the route leg effect for a particular route leg j, Sk is the subject
effect for a particular subject k, e is the error term and Yijk is an observation for a
particular subject, route leg and pipeline.
Peripheral Detection Task To measure the effect of pipelines’ landmark descrip-
tions on cognitive load, we use a peripheral detection task (PDT). This secondary task
consists of subjects wearing a headset, which positions an LED light approximately
15 degrees to the left of the center of vision and 2 degrees above the horizon. This
light blinks at a uniform random interval between 3 and 5 seconds, for a duration of
between 200 and 1000 milliseconds [35]. A button is attached to the subject’s finger,
which can be pressed against the steering wheel. The subject is asked to depress the
button as quickly as possible whenever they see the light blink. The average delay
between light blink and button depression is recorded, along with a miss rate—if the
90
Table 4.2: Counterbalanced Latin Squares Design
Leg 1 Leg 2 Leg 3 Leg 4 Leg 5
Subject 1 No
landmarks
Human-
Human
Machine-
Machine
Human-
Machine
Machine-
Human
Subject 2 Human-
Human
Human-
Machine
No
landmarks
Machine-
Human
Machine-
Machine
Subject 3 Human-
Machine
Machine-
Human
Human-
Human
Machine-
Machine
No
landmarks
Subject 4 Machine-
Human
Machine-
Machine
Human-
Machine
No
landmarks
Human-
Human
Subject 5 Machine-
Machine
No
landmarks
Machine-
Human
Human-
Human
Human-
Machine
subject fails to press the button within 2 seconds of a light blink, it counts as a miss.
Intuitively, the more cognitive effort the subject must expend on the primary task
of navigation and vehicle operation, the less effort they can put towards the PDT.
Thus, a more cognitively-intensive primary task will result in a higher miss rate and
longer button press delay.
The PDT must be evaluated on two levels—first the miss rate, the probability
of a subject never pressing the button within 2 seconds of the LED blinking, and
second the mean response time for non-missed blinks.
Using a Kruskal-Wallis evaluation based on the linear model in Equation 4.2,
where Y is the PDT response time, we found no evidence supporting a difference
in mean PDT response time between pipelines (F (4, 12) = 1.38, p = 0.29). We use
Kruskal-Wallis in place of ANOVA because the normality assumption is violated (by
visual analysis of the Q-Q plot). Figure 4.10 shows that differences in the mean are
small relative to the large interquartile range.
We also found no evidence of pipeline affecting PDT miss rate (F (4, 12) = 0.46,
91
500
600
700
800
Pipeline
PDT Response Time By Pipeline
Control
Re
sp
on
se
 tim
e (
m
s)
Human-
Human
Human-
Machine
Machine-
Human
Machine-
Machine
Figure 4.10: PDT response time by pipeline
p = 0.76), using the same analysis as for response time. (The normality assumption
was violated for this data as well.) Figure 4.11 shows the distribution of miss rate by
pipeline.
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Pipeline
PDT Miss Rate By Pipeline
Control
M
iss
 R
ate
Human-
Human
Human-
Machine
Machine-
Human
Machine-
Machine
Figure 4.11: PDT miss rate by pipeline
Gravitational Force Events We also monitor for erratic, harsh, potentially dan-
gerous driving patterns, by counting instances of high lateral (X) and longitudinal (Y )
gravitational forces (G-forces). These G-force spikes, which we call excessive force
92
events can signify harsh breaking, rapid acceleration or swerving. Specifically, we
count the number of times during a route leg that the vehicle experienced a G-force
of greater magnitude than the thresholds set forth in Table 4.3. G-forces are measured
in the X and Y directions using a Freematics ONE vehicle data logger, which includes
3-axis acceleration data and is anchored to the vehicle frame via the vehicle’s OBD-II
port.
Table 4.3: Gravitational Force Event Thresholds (Naturalistic Teenage Driving Study
[56])
Event Type Axis Threshold (G)
Harsh acceleration Y > 0.35
Hard braking Y < −0.45
Right swerve X > 0.05
Left swerve X < −0.05
Using the same analysis as for the PDT metrics, we found no evidence that
pipeline affects the number of excessive force events occurring during a drive
(F (4, 12) = 1.44, p = 0.28). See Figure 4.12.
Surveys Lastly, we survey subjects using the NASA-TLX survey [27] and
our own Likert-scale survey to analyze perceived task difficulty, as well as per-
ceived landmark goodness, navigation confidence, and navigation difficulty between
pipelines. Both surveys were administered for each pipeline, immediately following
the completion of each leg.
The NASA-TLX survey consists of six sub-scales, which when combined aim to
measure the total workload induced by the task—in this case navigating a route leg
from start to finish using a particular pipeline for navigation. The scales are ordinal,
93
0
5
10
15
20
Pipeline
Excessive Force Events By Pipeline
Control
Ev
en
t c
ou
nt
Human-
Human
Human-
Machine
Machine-
Human
Machine-
Machine
Figure 4.12: Gravitational force events by pipeline
with 20 levels ranging from very low to very high. (See Appendix C for a full copy
of the NASA-TLX survey.) The sub-scales are mental demand, which measures the
mental and perceptual acuity required to complete the task; physical demand, which
gauges how strenuous the task was; temporal demand, which measures perceived
time pressure or rush to complete the task; overall performance, which indicates the
subject’s opinion of how successful she was at completing the task; effort, a combined
measure of mental and physical exertion; and frustration level, how annoyed and
irritated the subject felt during the task [27]. It is very important to note that
the Performance sub-scale considers level 0 to equate to total success and
level 20 to total failure, the opposite of what one might expect.
We evaluate each sub-scale independently, so that individual effects can be
parsed out. Figure 4.13 shows the score distributions by pipeline, across each sub-
scale. Because the scales are ordinal, we use the non-parametric Kruskal-Wallis
analysis of variance to test for differences between pipelines. Table 4.4 lists the results
across each sub-scale.
We found no evidence to suggest that pipeline affects any of the NASA-TLX
94
Mental Demand
NASA-TLX Sub-Scale By Pipeline
Control
0
5
Human-
Human
Human-
Machine
Machine-
Human
Machine-
Machine
10
15
20
Physical Demand
Control
0
5
Human-
Human
Human-
Machine
Machine-
Human
Machine-
Machine
10
15
20
Temporal Demand
Control
0
5
Human-
Human
Human-
Machine
Machine-
Human
Machine-
Machine
10
15
20
Performance
Control
0
5
Human-
Human
Human-
Machine
Machine-
Human
Machine-
Machine
10
15
20
Effort
Control
0
5
Human-
Human
Human-
Machine
Machine-
Human
Machine-
Machine
10
15
20
Frustration
Control
0
5
Human-
Human
Human-
Machine
Machine-
Human
Machine-
Machine
10
15
20
Figure 4.13: NASA-TLX scores by sub-scale
95
Table 4.4: Kruskal-Wallis analysis of variance by pipeline for NASA-TLX survey
Sub-Scale χ2 p
Mental
Demand
χ2(4) = 2.53 0.64
Physical
Demand
χ2(4) = 0.25 0.99
Temporal
Demand
χ2(4) = 1.01 0.91
Perceived
Performance
χ2(4) = 1.99 0.74
Effort χ2(4) = 0.92 0.92
Frustration χ2(4) = 3.59 0.46
sub-scales.
In addition to the NASA-TLX survey, after each route leg we administered a
three-question Likert scale survey addressing the quality of landmarks selected by
the pipeline and the confidence subjects felt at navigational decision points. Each
question is a statement with which the subject indicates their agreement, selecting
from strongly agree, agree, not sure, disagree, and strongly disagree. The statements
are as follows:
The landmarks I was told about helped me find turns.
I knew what each landmark was going to look like when I heard its description.
I felt confident in where to perform each maneuver (turn) on the route.
In Figure 4.14 we show the distribution of agreement with each statement
by pipeline. Each color represents a different score, with the lightest equating to
“strongly disagree” and the darkest to “strongly agree”. The more width a color
occupies, the more subjects gave that answer as their response.
96
In Table 4.5 we analyze these results using a Kruskal-Wallis test to determine if
there are significant differences in the distribution of answers between pipelines. For
each survey question, we rejected the null hypothesis at significance level of 0.10; there
is evidence that pipeline does affect participant responses to each of these questions.
Table 4.5: Kruskal-Wallis analysis of variance by pipeline for landmark survey
Question χ2 p (* denotes significance
at the 0.1 level)
Landmarks helped find
turns
χ2(4) = 8.11 0.08∗
Landmarks could be
visualized from
descriptions
χ2(4) = 10.64 0.03∗
Subject was confident at
decision points
χ2(4) = 13.83 0.01∗
In order to determine which pipelines are significantly different from others in
terms of their affect on each survey question, we use Dunn’s test with a Bonferroni
adjustment for post-hoc analysis. Dunn’s is a pairwise comparison test which, for
each combination of pipelines (a, b), tests the null hypothesis that the probability of
drawing a larger value from a than from b is 0.5. The alternative hypothesis is that
one group stochastically dominates another: the chances of sampling a larger value
form that group is greater than 0.5. The Bonferroni adjustment adjusts p-values to
account for having done multiple comparisons.
For the “confidence at decision points” metric, we find that Machine-Machine
pipeline is significantly more likely to have a higher (more agreeable) score than
the control pipeline (Z = −2.78, p = 0.05) as well as the Human-Machine pipeline
(Z = −3.21, p = 0.01). For the “landmarks helped find turns” metric, we find
97
Landmarks Helped With Navigation
AgreeDisagree
Control
Human-Human
Human-Machine
Machine-Human
Machine-Machine
Not
sure
Able to Visualize Landmarks
AgreeDisagree
Control
Human-Human
Human-Machine
Machine-Human
Machine-Machine
Not
sure
Confidence At Decision Points
Not
sure Agree
Strongly
Agree
Strongly
Disagree
Control
Human-Human
Human-Machine
Machine-Human
Machine-Machine
Disagree
Figure 4.14: Landmark effectiveness survey scores
98
no evidence of significant differences in distribution between two specific pipelines.
For the “landmark descriptions” metric we we find that Machine-Machine pipeline
is significantly more likely to have a higher (more agreeable) score than the human-
machine pipeline (Z = −3.02, p = 0.02).
Discussion
Contrary to existing literature, we did not find the inclusion of landmark
descriptions in navigation instructions to have a significant effect on drivers’ cognitive
load, erratic driving behavior, or perceived task load. We did find that instructions
inclusive of landmark descriptions generated entirely by machine (machine-machine
pipeline) lead to increased driver confidence at decision points as compared to
navigation instructions which included only street name and distances (control
pipeline). This finding is in line with participants’ subjective written comments,
which indicated that including stop lights and stop signs in instructions was helpful.
Without a larger study, it is not possible to definitely say whether or not any
of Torchbearer’s pipelines were effective in terms of reducing cognitive load, harsh
G-force events or perceived task load. While we found no evidence of such effects in
our small field study, a larger study, preferably consisting of 30 participants, would
offer more definitive insight.
Threats to Validity
As alluded to previously, the principal threat to validity is the extremely small
sample size employed in our field experiments. However, even within this small-
scale study there are potential biases: first, study participants had relatively high
familiarity with area of the route, given that all were residents of Bozeman, Montana.
If landmark descriptions are more helpful in terms of cognitive load, erratic driving
99
reduction or reduced task load in areas drivers are unfamiliar with, we would
significantly underestimate the effect.
Due to time constraints, and the instrumented-vehicle experiment as opposed to
a simulated one, each leg of the test route did not include a large number of maneuver
points. Additionally, there was little variation in terms of road and environment type
(surface versus highway, urban versus rural). A simulator-based experiment could
allow for efficiently varying the driving environment.
Subjects may have been predisposed to ”like” the concept of including landmarks
in navigation instructions, even if there was no observable effect in terms of workload
or driving behavior. This could potentially bias the results of the ”confidence at
maneuver points” survey—if subjects felt like they were ”supposed” to like landmarks,
they may have been inclined to indicate an increased sense of confidence.
100
CONCLUSION
We proposed Torchbearer, a system that uses multiple pipeline-based approaches
to automatically generate landmark descriptions for use in navigation instructions.
Each pipeline leveraged a different combination of human, crowd-sourced input
and algorithmic approaches, including object detection, deep saliency detection and
geosocial data mining. Together with a mobile application, each of these pipelines
can be used to provide spoken turn-by-turn driving directions, inclusive of landmark
descriptions.
While the goal of Torchbearer was to reduce cognitive load, erratic driving
behavior and perceived workload for drivers, our field study did not find evidence
of any significant effect on these metrics between Torchbearer pipelines and a street
name only control pipeline. We suspect that a larger study is needed, with better
controls for prior route knowledge, to accurately determine if such an effect exists.
The primary point for future work centers around additional field evaluation,
with more subjects, and a driving simulator to analyze different road types and
environments. Additionally, experiments should be undertaken regarding landmark
location, including the efficacy of including landmarks along the leg of a route
to indicate to a driver that she is on the correct route. The object detection
algorithm should be trained to recognize additional types of road infrastructure, such
as crosswalks.
Future Work
A count-based approach should be investigated, where the edge between two
maneuver points is analyzed for recurring salient landmarks of the same type, such
as stop lights, and an instruction of the form ”turn left onto ¡street¿ at the ¡nth
101
stop light” presented. The counter-approach could also be investigated, where the
recurrence along an edge of a landmark type counts against the saliency, such that a
landmark would only be chosen if the driver will not encounter one of that type until
the maneuver point.
Further insight into semantic saliency can be gained by additional mining of
geosocial data: while we currently consider overall check-in data, a data source such
as Facebook or Instagram could be used to determine the relevance of a landmark to
an individual driver. For example, if a Walgreens pharmacy is a candidate landmark,
its saliency score could account for the fact that the driver has visited a Walgreens
store on n previous occasions.
While Torchbearer currently uses fixed distances from maneuver points for
locating landmarks, it is possible that speed of travel affects the optimal position
of landmarks. Further study should be done to determine if increasing landmark
distance from intersection as speed increases is beneficial.
In an attempt to improve our ability to analyze the effects of pipeline on cognitive
load, an arithmetic task can be incorporated into the field experiment, where a
subject is asked to solve math problems during the drive. This consumes more of the
subject’s available cognitive demand, leaving less to put towards the PDT; this can
help yield significant effects in PDT metrics by making difference between pipeline
more apparent. Additionally, other metrics may provide insight into the potential
benefits of Torchbearer, such as the total time taken to drive a leg, the amount of
time the subject’s eyes leave the road and the subject’s willingness to pay for the
technology provided by a given Torchbearer pipeline.
Much of these additional areas of investigation will alter only the portion of the
Torchbearer system which selects the best landmark—existing methods for finding
and describing landmarks will be used. In this way, Torchbearer has provided a
102
robust base against which future landmark-based navigation systems can be built.
103
REFERENCES CITED
104
[1] Image segmentation with watershed algorithm, Oct 2017.
[2] Agarwal, P., Burgard, W., and Spinello, L. Metric localization using
google street view. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ
International Conference on (2015), IEEE, pp. 3111–3118.
[3] Barnes, R., Lehman, C., and Mulla, D. Priority-flood: An optimal
depression-filling and watershed-labeling algorithm for digital elevation models.
Computers & Geosciences 62 (2014), 117–127.
[4] Baroni, M., Dinu, G., and Kruszewski, G. Don’t count, predict!
a systematic comparison of context-counting vs. context-predicting semantic
vectors. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers) (2014), vol. 1, pp. 238–247.
[5] Bayly, M., Young, K. L., and Regan, M. A. 12 sources of distraction
inside the vehicle and their effects on driving performance. Driver distraction:
Theory, effects, and mitigation (2008), 191.
[6] Beeharee, A. K., and Steed, A. A natural wayfinding exploiting photos in
pedestrian navigation systems. In Proceedings of the 8th conference on Human-
computer interaction with mobile devices and services (2006), ACM, pp. 81–88.
[7] Birrell, S. A., and Young, M. S. The impact of smart driving aids on
driving performance and driver distraction. Transportation research part F:
traffic psychology and behaviour 14, 6 (2011), 484–493.
[8] Burnett, G. turn right at the traffic lights: The requirement for landmarks in
vehicle navigation systems. The Journal of Navigation 53, 3 (2000), 499–510.
[9] Burnett, G. E., and Joyner, S. An assessment of moving map and
symbol-based route guidance systems. Ergonomics and safety of intelligent driver
interfaces (1997), 115–137.
[10] Caduff, D., and Timpf, S. On the assessment of landmark salience for human
navigation. Cognitive processing 9, 4 (2008), 249–267.
[11] Choudhary, P., and Velaga, N. R. Modelling driver distraction effects due
to mobile phone use on reaction time. Transportation Research Part C: Emerging
Technologies 77 (2017), 351 – 365.
[12] Corporation, M. Visual object tagging tool (vott). https://github.com/
Microsoft/VoTT, 2018.
[13] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L.
ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09 (2009).
105
[14] Edquist, J., Horberry, T., Hosking, S., and Johnston, I. Effects of
advertising billboards during simulated driving. Applied ergonomics 42, 4 (2011),
619–626.
[15] Elias, B., and Brenner, C. Automatic generation and application of
landmarks in navigation data sets. In Developments in spatial data handling.
Springer, 2005, pp. 469–480.
[16] Facebook. React native.
[17] Fingas, J. Google maps uses landmarks to provide natural-sounding directions,
Apr 2018.
[18] for Statistics, N. C., and Analysis. 2016 fatal motor vehicle crashes:
Overview. Report DOT HS 812 456, National Highway Traffic Safety Adminis-
tration, 2017.
[19] for Statistics, N. C., and Analysis. Distracted driving 2016. Report DOT
HS 812 517, National Highway Traffic Safety Administration, 2017.
[20] Goldberg, Y., and Levy, O. word2vec explained: Deriving mikolov et
al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722
(2014).
[21] Golledge, R. G. Human wayfinding and cognitive maps. In The Colonization
of Unfamiliar Landscapes. Routledge, 2003, pp. 49–54.
[22] Google. Google news corpus word2vec model.
[23] Harbulk, J. L., and Noy, I. Y. The impact of cognitive distraction on driver
visual behavior and vehicle control. Report 13889 E, Ergonomics Division, Road
Safety Directorate and Vehicle Regulation Directorate, 2002.
[24] Harel, J., Koch, C., and Perona, P. Graph-based visual saliency. In
Advances in neural information processing systems (2007), pp. 545–552.
[25] Harms, L., and Patten, C. Peripheral detection as a measure of driver
distraction. a study of memory-based versus system-based navigation in a built-
up area. Transportation Research Part F: Traffic Psychology and Behaviour 6, 1
(2003), 23–36.
[26] Harris, Z. S. Distributional structure. Word 10, 2-3 (1954), 146–162.
[27] Hart, S. G., and Staveland, L. E. Development of nasa-tlx (task load
index): Results of empirical and theoretical research. In Advances in psychology,
vol. 52. Elsevier, 1988, pp. 139–183.
106
[28] Hile, H., Vedantham, R., Cuellar, G., Liu, A., Gelfand, N.,
Grzeszczuk, R., and Borriello, G. Landmark-based pedestrian navigation
from collections of geotagged photos. In Proceedings of the 7th international
conference on mobile and ubiquitous multimedia (2008), ACM, pp. 145–152.
[29] Klippel, A., and Winter, S. Structural salience of landmarks for route
directions. In International Conference on Spatial Information Theory (2005),
Springer, pp. 347–362.
[30] Kulkarni, A. P., Can, M., and Hartmann, B. Turkomatic: automatic
recursive task and workflow design for mechanical turk. In CHI’11 Extended
Abstracts on Human Factors in Computing Systems (2011), ACM, pp. 2053–
2058.
[31] L. Reyes, M., and Lee, J. The influence of ivis distractions on tactical and
control levels of driving performance.
[32] Lee, P. U., Klippel, A., and Tappe, H. The effect of motion in graphical
user interfaces. In International Symposium on Smart Graphics (2003), Springer,
pp. 12–21.
[33] Leshed, G., Velden, T., Rieger, O., Kot, B., and Sengers, P. In-car
gps navigation: engagement with and disengagement from the environment. In
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
(2008), ACM, pp. 1675–1684.
[34] Lovelace, K. L., Hegarty, M., and Montello, D. R. Elements of
good route directions in familiar and unfamiliar environments. In International
conference on spatial information theory (1999), Springer, pp. 65–82.
[35] Martens, M., and Van Winsum, W. Measuring distraction: the peripheral
detection task. TNO Human Factors, Soesterberg, Netherlands (2000).
[36] May, A. J., and Ross, T. Presence and quality of navigational landmarks:
effect on driver performance and implications for design. Human factors 48, 2
(2006), 346–361.
[37] May, A. J., Ross, T., and Bayer, S. H. Incorporating landmarks in driver
navigation system design: An overview of results from the regional project. The
Journal of Navigation 58, 1 (2005), 47–65.
[38] May, A. J., Ross, T., and Bayer, S. H. Incorporating landmarks in driver
navigation system design: An overview of results from the regional project.
Journal of Navigation 58, 1 (2005), 4765.
[39] Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
107
[40] Neubeck, A., and Van Gool, L. Efficient non-maximum suppression. In
Pattern Recognition, 2006. ICPR 2006. 18th International Conference on (2006),
vol. 3, IEEE, pp. 850–855.
[41] of San Francisco, C. Street intersections. ://data.sfgov.org/Geographic-
Locations-and-Boundaries/Street-Intersections/ctsg-7znq/data.
[42] Otsu, N. A threshold selection method from gray-level histograms. IEEE
transactions on systems, man, and cybernetics 9, 1 (1979), 62–66.
[43] Owkes, M., and Desjardins, O. Driver distraction: A review of the
literature. Journal of Computational Physics 270, 1 (Aug. 2014), 587–612.
[44] Pan, J., Sayrol, E., Giro-i Nieto, X., McGuinness, K., and
O’Connor, N. E. Shallow and deep convolutional networks for saliency pre-
diction. In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (June 2016).
[45] Pettitt, M., Burnett, G. E., and Stevens, A. Defining driver distraction.
In 12th World Congress on Intelligent Transport SystemsITS AmericaITS
JapanERTICO (2005).
[46] Quesnot, T. Linked landmark data: Toward the automatic detection of
landmarks on the web of data. Advancing Geographic Information Science: The
Past and Next Twenty Years (2012), 227.
[47] Quesnot, T., and Roche, S. Measure of landmark semantic salience through
geosocial data streams. ISPRS International Journal of Geo-Information 4, 1
(2014), 1–31.
[48] Raubal, M., and Winter, S. Enriching wayfinding instructions with local
landmarks. In International conference on geographic information science (2002),
Springer, pp. 243–259.
[49] Regan, M. A., Hallett, C., and Gordon, C. P. Driver distraction and
driver inattention: Definition, relationship and taxonomy. Accident Analysis &
Prevention 43, 5 (2011), 1771–1781.
[50] Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-
time object detection with region proposal networks. In Advances in neural
information processing systems (2015), pp. 91–99.
[51] Roerdink, J. B., and Meijster, A. The watershed transform: Definitions,
algorithms and parallelization strategies. Fundamenta informaticae 41, 1, 2
(2000), 187–228.
108
[52] Ro¨ser, F., Hamburger, K., Krumnack, A., and Knauff, M. The
structural salience of landmarks: results from an on-line study and a virtual
environment experiment. Journal of Spatial Science 57, 1 (2012), 37–50.
[53] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,
Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet
large scale visual recognition challenge. International Journal of Computer
Vision 115, 3 (2015), 211–252.
[54] Salient. The Oxford English Dictionary. Oxford University Press, 2018.
[55] schneider, a. Mazal tov! google and waze officially tie the knot, Oct 2014.
[56] Simons-Morton, B. G., Zhang, Z., Jackson, J. C., and Albert, P. S.
Do elevated gravitational-force events while driving predict crashes and near
crashes? American Journal of Epidemiology 175, 10 (2012), 1075–1079.
[57] Simonyan, K., and Zisserman, A. Very deep convolutional networks for
large-scale image recognition. CoRR abs/1409.1556 (2014).
[58] Smith, A., and Page, D. U.s. smartphone use in 2015. Report 202.419.4372,
Pew Research Center, 2015.
[59] Sorrows, M. E., and Hirtle, S. C. The nature of landmarks for real and
electronic spaces. In International Conference on Spatial Information Theory
(1999), Springer, pp. 37–50.
[60] Tsimhoni, O., Smith, D., and Green, P. Address entry while driving:
Speech recognition versus a touch-screen keyboard. Human factors 46, 4 (2004),
600–610.
[61] Walker, J., Alicandri, E., Sedney, C., and Roberts, K. In-vehicle
navigation devices: Effects on the safety of driver performance. In Vehicle
Navigation and Information Systems Conference, 1991 (1991), vol. 2, IEEE,
pp. 499–525.
[62] Wenig, N., Wenig, D., Ernst, S., Malaka, R., Hecht, B., and
Scho¨ning, J. Pharos: improving navigation instructions on smartwatches by
including global landmarks. In Proceedings of the 19th International Conference
on Human-Computer Interaction with Mobile Devices and Services (2017), ACM,
p. 7.
[63] Zaidel, D. M., and Noy, Y. I. Automatic versus interactive vehicle
navigation aids. Ergonomics and safety of intelligent driver interfaces (1997),
287–307.
109
APPENDICES
110
APPENDIX A
FIELD EXPERIMENT ROUTE AND LANDMARKS
111
Start Stop
3
1
2
4
5
Figure A.1: The test route driven by subjects in Bozeman, Montana. Subject drive
each leg using a different pipeline for navigation.
112
Table A.1: Leg 1: Instructions and Landmarks By Pipeline
Machine-
Machine
Machine-
Human
Human-
Machine
Human-
Human
Turn left onto
West College
Street
before The
Daily coffee
shop
at the red and
white yield
sign
before The
Daily coffee
shop
at the round
about
Turn right
onto South
8th Avenue
before the
Loaf n’ Jug
gas station
at the stop
sign
at the stop
sign
Continue left
onto West
Harrison
Street
at the Hapner
Hall college
residence hall
at the stop
sign
at the Jake
Jabs College
of Business
and
Entrepreneur-
ship college
hall
at the brick
building with
windows
Turn right
onto South
7th Avenue
at the
crosswalk sign
at the
crosswalk
You have
arrived at
your
destination
Hannon
Dining hall
college dining
hall
crosswalk brick building brick building
with windows
113
APPENDIX B
HUMAN SUBJECTS CONSENT FORM
114
 
SUBJECT CONSENT FORM FOR PARTICIPATION IN HUMAN 
RESEARCH AT MONTANA STATE UNIVERSITY 
 
Using Landmarks To Provide Better Driving Directions 
 
You are being asked to participate in a driving study. This study may help us obtain a better 
understanding of which types of navigation instructions are easiest for drivers to follow. You 
were identified as a potential subject because you 1) have a valid driver license, 2) have 
minimum motor vehicle liability insurance as required under Montana law and 3) have access to 
a vehicle.  
 
Procedures Involved 
Participation is voluntary and you can choose to not answer any questions you do not want to 
answer and/or you can stop at anytime. If you are a student, participation or non-participation 
will not affect your grade or class standing. If you agree to participate you will be asked to: 
● Drive your own vehicle on streets in Bozeman, following driving directions spoken to you 
by a computerized voice on a mobile phone. These directions will tell you where and 
when to turn, similar to how Google Maps or Apple Maps provides spoken driving 
directions. You will not know anything about the route before you begin driving. 
● Wear a headset which has an LED light visible only in your peripherals, and a button on 
your finger which you can press against the steering wheel. The light will blink at random 
intervals as you drive. Each time the light blinks, you will be asked to press the button. 
● Complete a short survey about your experience using the system. 
● The entire study will take about 1 hour. 
 
Risks 
You will be subject to the normal risks involved in everyday driving. The task of watching for a 
blinking light and pressing a button might be distracting, which could cause you to pay less 
attention to operating the vehicle. 
 
Benefits 
The study is of no benefit to you. 
 
Alternatives available 
There is no affect on you if you decide not to participate in this study. 
 
Source of Funding 
N/A 
 
Cost to Subject 
None 
 
115
 
Confidentiality 
Your personal information will be kept private and secure. Any results which are published or 
made publicly available will not include any personally identifiable information. All data which 
can be linked to you will be stored on a password-protected computer or stored on an 
encrypted, restricted-access cloud storage provider. 
 
If you sustain any bodily harm during this study, you will be referred to a 
trained caregiver and emergency medical care will be summoned if needed. 
However, there is no compensation available from MSU for injury. 
 
There is no compensation available from MSU related to motor vehicle 
liability, or for damages to your vehicle or personal property. 
 
Should you have any questions about this research, please contact Fred 
Vollmer at (360) 927-5124 or [fredric.vollmer@msu.montana.edu]. If you 
have additional questions about the rights of human subjects please contact 
the Chair of the Institutional Review Board, Mark Quinn, (406) 994-4707 
[mquinn@montana.edu]. 
 
-------------------------------------------------------------------------------------------- 
 
AUTHORIZATION: I have read the above and understand the discomforts, 
inconvenience and risk of this study. I, ____________________________ 
(name of subject), agree to participate in this research. I understand that I 
may later refuse to participate and that I may withdraw from the study at 
any time. I have received a copy of this consent form for my own records.  
 
Signed: ____________________ 
 
Investigator: ________________ 
 
Date: ______________________ 
116
APPENDIX C
NASA-TLX SURVEY
117
Name   Task    Date
   Mental Demand How mentally demanding was the task?
   Physical Demand How physically demanding was the task?
   Temporal Demand How hurried or rushed was the pace of the task?
   Performance How successful were you in accomplishing what
you were asked to do?
   Effort How hard did you have to work to  accomplish
your level of performance?
   Frustration How insecure, discouraged, irritated, stressed,
and annoyed wereyou?
Figure 8.6
NASA Task Load Index
Hart and Staveland’s NASA Task Load Index (TLX) method assesses
work load on five 7-point scales. Increments of high, medium and low
estimates for each point result in 21 gradations on the scales.
Very Low Very High
Very Low Very High
Very Low Very High
Very Low Very High
Perfect     Failure
Very Low Very High
118
APPENDIX D
MECHANICAL TURK SAMPLE QUALIFICATION EXAM
119
 
 
Qualify for Image Landmark Selection HITs
 
Click the button below to go through the quick tutorial, then answer the questions below to instantly qualify. NOTE: There is one ONE correct answer for
each question. Your score will be out of 100%.
In order to pass the test, please do the quick tutorial!
1. Where would you draw the landmark selection box in the following image?
(The stop light)
120
(The car)
(The middle of the intersection)
2. Where would you draw the landmark selection box in the following image?
121
(The pedestrian)
(The car)
122
(The building)
3. Where would you draw the landmark selection box in the following image?
123
(The telephone pole)
(The house)
(The house)
4. Where would you draw the landmark selection box in the following image?
124
(The crosswalk sign)
(The garbage cans)
125
(The cars)
4. Where would you draw the landmark selection box in the following image?
126
(The restaurant sign)
(The tree)
(The cars)