Library Hi Tech
RAMP – the Repository Analytics and Metrics Portal: A prototype web service that
accurately counts item downloads from institutional repositories
Patrick OBrien, Kenning Arlitsch, Jeff Mixter, Jonathan Wheeler, Leila Belle Sterman,
Article information:
To cite this document:
Patrick OBrien, Kenning Arlitsch, Jeff Mixter, Jonathan Wheeler, Leila Belle Sterman, (2017) "RAMP
– the Repository Analytics and Metrics Portal: A prototype web service that accurately counts item
downloads from institutional repositories", Library Hi Tech, Vol. 35 Issue: 1, pp.144-158, https://
doi.org/10.1108/LHT-11-2016-0122
Permanent link to this document:
https://doi.org/10.1108/LHT-11-2016-0122
Downloaded on: 29 June 2017, At: 09:18 (PT)
References: this document contains references to 40 other documents.
The fulltext of this document has been downloaded 1165 times since 2017*
Users who downloaded this article also downloaded:
(2017),"Library technology in the next 20 years", Library Hi Tech, Vol. 35 Iss 1 pp. 5-10 https://doi.org/10.1108/LHT-11-2016-0131
(2017),"Promoting information literacy: perspectives from UK universities", Library Hi Tech, Vol.
35 Iss 1 pp. 53-70 https://doi.org/10.1108/
LHT-10-2016-0118
Access to this document was granted through an Emerald subscription provided by All users group
For Authors
If you would like to write for this, or any other Emerald publication, then please use our Emerald
for Authors service information about how to choose which publication to write for and submission
guidelines are available for all. Please visit www.emeraldinsight.com/authors for more information.
About Emerald www.emeraldinsight.com
Emerald is a global publisher linking research and practice to the benefit of society. The company
manages a portfolio of more than 290 journals and over 2,350 books and book series volumes, as
well as providing an extensive range of online products and additional customer resources and
services.
Emerald is both COUNTER 4 and TRANSFER compliant. The organization is a partner of the
Committee on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for
digital archive preservation.
*Related content and download information correct at time of download.
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
RAMP – the Repository Analytics
and Metrics Portal
A prototype web service that accurately counts
item downloads from institutional repositories
Patrick OBrien and Kenning Arlitsch
Library, Montana State University, Bozeman, Montana, USA
Jeff Mixter
OCLC Online Computer Library Center Inc, Dublin, Ohio, USA
Jonathan Wheeler
Library, University of New Mexico, Albuquerque, New Mexico, USA, and
Leila Belle Sterman
Library, Montana State University, Bozeman, Montana, USA
Abstract
Purpose –The purpose of this paper is to present data that begin to detail the deficiencies of log file analytics
reporting methods that are commonly built into institutional repository (IR) platforms. The authors propose a
new method for collecting and reporting IR item download metrics. This paper introduces a web service
prototype that captures activity that current analytics methods are likely to either miss or over-report.
Design/methodology/approach – Data were extracted from DSpace Solr logs of an IR and were
cross-referenced with Google Analytics and Google Search Console data to directly compare Citable Content
Downloads recorded by each method.
Findings – This study provides evidence that log file analytics data appear to grossly over-report due to traffic
from robots that are difficult to identify and screen. The study also introduces a proof-of-concept prototype that
makes the researchmethod easily accessible to IRmanagers who seek accurate counts of Citable Content Downloads.
Research limitations/implications – The method described in this paper does not account for direct
access to Citable Content Downloads that originate outside Google Search properties.
Originality/value – This paper proposes that IR managers adopt a new reporting framework that classifies
IR page views and download activity into three categories that communicate metrics about user activity
related to the research process. It also proposes that IR managers rely on a hybrid of existing Google Services
to improve reporting of Citable Content Downloads and offers a prototype web service where IR managers
can test results for their repositories.
Keywords Web analytics, Assessment, Google Analytics, Institutional repositories, Google Search Console,
Log file analytics
Paper type Research paper
Introduction
Institutional repositories (IR) disseminate scholarly papers in an open access environment
and have become a core function of the modern research library. IR run on a variety of
software platforms, with great diversity in installation, configuration, and support systems,
Library Hi Tech
Vol. 35 No. 1, 2017
pp. 144-158
Emerald Publishing Limited
0737-8831
DOI 10.1108/LHT-11-2016-0122
Received 4 November 2016
Revised 4 November 2016
Accepted 26 November 2016
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0737-8831.htm
© Patrick OBrien, Kenning Arlitsch, Jeff Mixter (OCLC), Jonathan Wheeler, Leila Sterman, Susan
Borda. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone
may reproduce, distribute, translate and create derivative works of this article (for both commercial
and non-commercial purposes), subject to full attribution to the original publication and authors. The
full terms of this licence may be seen at: http://creativecommons.org/licences/by/4.0/legalcode
The authors wish to express their gratitude to the Institute of Museum and Library Services, which
funded this research (Arlitsch et al., 2014). The authors would also like to thank to Bruce Washburn,
Consulting Software Engineer at OCLC Research, for his assistance in developing RAMP, and to Susan
Borda, Digital Technologies Librarian at Montana State University, for her help with data extraction.
144
LHT
35,1
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
and many libraries attempt to track file downloads as a metric of IR success. This metric is
most meaningful if the measurements are consistent and accurate, and if they measure
human rather than robot traffic.
Prior research published by the authors demonstrated that “up to 58% of all human-
generated IR activity goes unreported by Google Analytics” (OBrien et al., 2016a, b),
a service that is used by approximately 80 percent of academic libraries[1]. Google Analytics
is a “page tagging” analytics service that relies on tracking code in HTML pages to
register visits. The tracking code is bypassed when users are sent directly to the
downloadable file (usually a PDF) in the IR, as is often the case when Google Scholar (GS) is
the user’s discovery service of choice. This results in significant undercounting of
high-value IR file downloads.
Conversely, over-counting as a result of robot traffic can occur when “log file analytics”
are utilized with open source IR platforms such as DSpace. Robots (also known as “bots”)
account for almost 50 percent of all internet traffic (Zeifman, 2015) and 85 percent of IR
downloads (Information Power Ltd, 2013). Libraries may not have the resources needed to
maintain appropriate filtering mechanisms for this overwhelming robot traffic, particularly
as the bots themselves are continually changing. Libraries dependent on commercial IR
platforms that utilize log file analytics must trust that the vendor has sufficient skill and
resources to detect and filter robot traffic.
Several projects are being developed in the European library community to set
standards and develop tools for IR statistics reporting. These include: PIRUS2, which is
now funded as the IRUS-UK service (Needham and Stone, 2012); a German project called
Open-Access-Statistics (Haeberli-Kaul et al., 2013); Statistics on the Usage of Repositories
(SURE) in the Netherlands (Verhaar, 2009); and OpenAIRE, a project of the European
Union in support of open access publications, including the development of usage
statistics (Rettberg and Schmidt, 2012). Some of these services provide COUNTER-
compliant statistics[2] processed through their infrastructure, and make data visible that
can be used for national benchmarking. As of this writing, no such service exists for North
American IR.
Prior research (OBrien et al., 2016b) identified and defined three types of IR downloads or
views: Ancillary Pages, Item Summary Pages, and Citable Content Downloads. Only the
last, Citable Content Downloads, can be considered an effective measure of IR impact since
they represent file downloads of the actual articles, presentations, etc., that comprise
the intellectual content of the IR. Ancillary Pages are defined as the HTML pages that users
click through to navigate to the content, and Item Summary Pages are also HTML pages
that contain metadata, abstracts, and the link that leads to publication file. Statistics that
show views of Ancillary Pages and Item Summary Pages represent limited value in the
effort to demonstrate the impact of IR on the scholarly conversation.
This research compares the log files of a DSpace IR with data compiled from Google
Analytics and Google Search Console (GSC). The results show a large discrepancy between
these two methods. To address the significant inaccuracies of current reporting methods,
this paper introduces a prototype web service that we believe provides an accurate and
simple measure of Citable Content Downloads. We call this prototype web service the
Repository Analytics and Metrics Portal (RAMP). RAMP is easy to use and provides a
proof-of-concept solution to acquire data that are normally difficult to access and
cumbersome to maintain without considerable programming skills. Prior research
confirmed that the proposed method improved Citable Content Downloads reporting by
more than 800 percent for two of the four IR in the study. The other two IR study
participants were unaware that 100 percent of their Citable Content Downloads were
missing from their Google Analytics reporting. This “miss” amounted to 299,662 downloads
in a 134-day period (OBrien et al., 2016b).
145
The
Repository
Analytics and
Metrics Portal
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
Research statement
Reports of IR activity should reflect human use. The web analytics packages built into IR
software platforms rely on log file analysis and are heavily biased toward over-counting
item downloads. Reasons for this include extensive access to IR content by bots, and the
lack of tools necessary to identify and filter bot activity from usage reports. A prototype
web service called RAMP is presented as a partial solution to the difficulty of accurately
measuring IR use and impact. The RAMP prototype extracts relevant GSC data that can be
combined with Google Analytics to produce accurate counts of Citable Content Downloads.
Literature review
The amount of data that search engines must mine from the web is large and increasing, as
is the number of queries they try to resolve. The indexed web is currently estimated to
contain nearly five billion pages (de Kunder, 2016; van den Bosch et al., 2016), and Google
revealed in 2016 that it now handles “at least two trillion searches per year” (Sullivan, 2016).
While it is difficult to ascertain the total size of stored data on the web, total internet traffic is
an easier measure and is projected to surpass one zettabyte (1,000 exabytes) by the end of
2016 (Cisco, 2016).
Search engines could not exist without robots, also known as “crawlers” or “spiders,”
which constantly scour websites and retrieve information to add to search engine indices.
When indexing sites, “crawlers start with a list of seed URLs and branch out by extracting
URLs from the pages visited” (Zineddine, 2016). Content providers depend on these robots to
help gather website content into general search engines like Google and specialty search
engines like GS. Frequent crawler visits to IR are necessary for harvesting new content.
The activity of these indexing robots is considered beneficial, as it is “in part a positive
metric, an indication of site success” (Huntington et al., 2008). While sites must allow and
even encourage bots to crawl and index their pages, usage reporting of IR is only
meaningful if bot activity can be filtered.
Although robots are essential to the effective functioning of search engines, not all robot
traffic is well intentioned. Some robots scrape content to replicate it elsewhere, a relatively
benign, if potentially unethical activity. Some are designed as malware and have entirely
nefarious purposes, as evidenced by the October 2016 distributed denial of service (DDoS)
attack on Dyn, an internet infrastructure company that offers Domain Name System
services to resolve web addresses into IP addresses (Newman, 2016).
This research project is concerned with the sheer volume of robot traffic and the
difficulty in distinguishing it from human traffic. Huntington et al. (2008) estimated that
robots accounted for 40 percent of all web traffic, and by 2015 that number had risen to
50 percent (Zeifman, 2015). Worse, for those attempting to accurately report the use of IR,
nearly 85 percent of all IR downloads are estimated to be triggered by robots (Information
Power Ltd, 2013). One of the most difficult issues in dealing with robots is simply
detecting their presence as non-human action. “Every new crawler stays unknown for a
while and it is up to the detection techniques to ensure that such period is as short as
possible” (Lourenço and Belo, 2006). The published research that is most aligned with ours
was a two-year bot detection study that found 85 percent of the traffic to an IR was from
robots (Greene, 2016). This was the first benchmark study of its kind for Open Access IR,
but it focused exclusively on bot detection rather than achieving accurate counts of human
download activity.
Google’s share of the explicit search engine market has hovered around 65 percent
(comScore Inc., 2016) for at least the past five years (comScore Inc., 2011), and the company’s
specialized academic search engine, GS, has become very popular among those seeking
scholarly content. It is difficult to determine GS market share because “GS usage
information is not available to participating institutions or libraries” (Herrera, 2011).
146
LHT
35,1
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
However, its ease of use and broad coverage has contributed to its popularity (Nicholas et al.,
2009) and its growth. A University of Minnesota survey of 1,141 graduate students found
that over half used GS at least a few times each month (Cothran, 2011), and a San Francisco
State University study found that GS was the top SFX source for requests in 2011
(Wang and Howard, 2012). A report from JISC found that “30% of doctoral students used
Google or Google Scholar as their main source of research information they sought,”
but more specifically, the study found Google sources were “strongly favored above other
sources by arts and humanities, social science and engineering and computer science
students” (Carpenter, 2012). GS is also very popular among academic faculty and
professional scientists. A Nature survey of 3,000 scholars showed that over 60 percent of
scientists and engineers and over 70 percent of scholars in social sciences, arts, and
humanities use GS on a regular basis (Van Noorden, 2014).
Some research shows that GS may be less reliable, updated less frequently, and has a
more web traffic-based ranking than other academic indexing services, such as Scopus,
Web of Science, and PubMed (Falagas et al., 2008). While some studies have noted a
100 percent retrieval of sources from replicated systematic reviews (Gehanno et al., 2013),
others find that although the coverage in GS is quite good (~95 percent retrieval of
biomedical research) it is less efficient than searching Web of Science, Scopus, or PsycINFO
(Giustini and Boulos, 2013). GS is still often recommended to medical professionals for
serendipitous discovery (Gehanno et al., 2013).
Questions have also persisted about the parent company’s commitment to the GS search
engine. Although GS’s Chief Engineer, Anurag Acharya, “declines to reveal usage figures,
he claims that the number of users is growing worldwide, particularly in China. And the
Google Scholar team is expanding, not contracting” (Bohannon, 2014). While GS has
detractors, its scope and use is growing.
Market conditions and incentive
The method and prototype service introduced in this paper leverage Google tools because
they are among the best available, due to market incentives. A strategic market force that
works in the library community’s favor is the fact that 90 percent of Google’s US$75 billion
2015 revenue was generated by its proprietary advertising network, which is based on a
Pay Per Click (PPC) advertising model (Alphabet Inc., 2015). PPC relies on advertisers
bidding to display ads based upon the keywords used in search queries. Every time an
advertisement is clicked, the advertiser typically pays Google between US$0.05 and
US$50.00, with the price determined by an efficient market facilitated by Google’s real-time
customer bidding system (Edelman et al., 2007). The market reveals that Google’s customers
are unwilling to pay over $900 per click for the most expensive PPC key word (Lake, 2016)
without some certainty that potential customers (not robots) are clicking their
advertisement. As a result, Google is one of the best in the world at robot detection and
screening. We can also assume that Google incentives to invest in bot detection will remain
strong if their advertising revenue continues to grow. In short, market conditions provide
Google with the incentive and resources to invest in bot detection that far exceeds the
abilities of the library community.
Tools used in this study
GSC is a free diagnostic tool that was previously known as Google Webmaster Tools; it was
rebranded in 2015 to broaden its appeal and use (Google Inc., 2015). GSC captures data
about queries related to websites, alerts webmasters to problems encountered by Google’s
crawlers, and provides a management interface for monitoring these sites. “Search Console
provides actionable reports, tools, and learning resources designed to get your content on
Google Search” (Google Inc., 2016a). The GSC provides various metrics for reporting,
147
The
Repository
Analytics and
Metrics Portal
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
including clicks, impressions, date, device, etc. GSC data can be accessed directly through
the “Search Console” dashboard in the Google Webmaster Tools site[3], queried through the
API using Python or Java, or through the Query Explorer[4]. GSC records item downloads
from all Google search properties.
Apache Solr is used in DSpace to index the item-level metadata as well as the usage log
data that contain page view and download statistics. Solr can be queried through the web UI
using a “localhost” setup or from the command line using “curl.” The Solr data contain
statistics on item page-level usage as well as file-level usage. Each record is a “click” as
defined by Google (Google Inc., 2016b). However, Solr data are raw data and do not provide
definitions concerning timeouts, double clicking, etc. and make no attempt
to tally or screen activity (i.e. downloads) per bitstream per day. The metrics Solr
provides include URL, date, device, city, country, IP address, referrer, id, handle, etc.
(Diggory and Lawrence, 2016).
Research method
The research team built two data sets from different sources to allow direct comparisons of
Citable Content Downloads events. The first source was a combination of Google Analytics
and GSC download events, and will henceforth be referred to as the GA/GSC data set.
The second data set was compiled from the Solr log files built into the DSpace platform.
The GA/GSC event data include page URLs containing bitstreams that could be parsed to
create an index of Citable Content Downloads for each item based on its DSpace handle.
For example, the URL below has a handle of “1/9943” and is a good representation of raw
data compiled from GA/GSC events (http://scholarworks.montana.edu/xmlui/bitstream/
handle/1/9943/IR-Undercounting-preprint_2016-07-19.pdf?sequence=6&isAllowed=y).
Data were collected from January 5 through May 17, 2016, (n¼ 134 days) and consist of
three primary sources related to the Montana State University (MSU) IR, called ScholarWorks:
(1) “Undercounting File Downloads from Institutional Repositories” data set (OBrien
et al., 2016a). This data set consists of daily Citable Content Downloads events
(n¼ 45,158) collected for each URL by GSC.
(2) Citable Content Downloads events initiated though a DSpace web page and recorded
by Google Analytics events. These records (n¼ 5,640) were extracted via the Google
Analytics API and included the following dimensions and metrics:
• Event Category (ga:totalEvents).
• Page (ga:pagePath).
• Unique Events (ga:uniqueEvents).
(3) Disaggregated log data were extracted from the ScholarWorks DSpace Solr indexes
(Masár, 2015). DSpace Solr indexes are divided into multiple parts consisting of a
statistics core and a search core. Approximately 1.8 MM records were extracted
from the ScholarWorks Solr statistics core for the 134-day research period.
The search core includes metadata about communities, collections, and items. Search
core data were extracted and joined with the statistics core data to associate logged
Citable Content Downloads events with corresponding DSpace item handles. Listed
below are the critical fields and settings required for joining the DSpace Solr
statistics and search data; statistics core fields are shown in bold font and search
core fields are italicized:
• Time¼ January 5, 2016-May 17, 2016
• IsBot¼False
148
LHT
35,1
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
• StatisticsType¼ view
• BundleName¼ORIGINAL
• Type¼ 0
• OwningItem: *
• Id: *
• Search.ResourceType¼ 2
• Handle¼ *
• Search.ResourceId¼ *
• StreamSourceInfo¼ *
These three data sets[5] were combined into a single data set (n¼ 130,384) representing the
Citable Content Downloads events recorded by GA/GSC and the MSU DSpace IR for each
handle on a daily basis.
Limitations
The main limitation of the method described in this paper is that it does not account for
Citable Content Downloads that originate outside Google Search properties. For instance,
non-Google search engines (e.g. Yahoo!, Bing, Yandex, etc.) may also send users directly to
the PDF file of the article in the IR, and these cases would not be recorded with the method in
this study. However, Google’s 65 percent market share for its combined search engine
properties (comScore Inc., 2016) is quite high, and the number of direct links to
Citable Content Downloads in IR from non-Google properties is therefore likely to be small.
This is another area we are studying, and preliminary results have confirmed that the IR in
our study receive very few direct links from Yahoo or Bing. A tool similar to GSC for Bing
and Yahoo is now available, which may allow most of the remainder of the traffic
commercial search engine market to be tested in a future study (Microsoft Inc., 2016).
Other Citable Content Downloads that are not included are direct links that are
exchanged in e-mail or text messages, or that have been posted on web pages or on
non-Google social media sites like Facebook. Again, the numbers of these links are likely to
be small for a given IR, because these referrers are less likely to serve as intellectual
discovery services for scholars.
Finally, the research is limited by the team’s lack of resources to enhance features, such
as integration with Google Analytics API data and IR metadata, and it may not be able to
maintain the prototype service beyond the IMLS “Measuring Up” grant project expiration
date of December 2017.
Findings
The findings of this study demonstrate that there is an enormous disparity between the
reporting methods (log file, GA, and GSC), and that they are not comparable in any way.
Specifically, log file data capture all Citable Content Downloads activity in the IR and apply a
standard filter used by most software vendors to screen out known bots; Google Analytics
Download Events reflect all Citable Content Downloads activity that originates fromAncillary
and Item Summary HTML pages internal to the IR; and GSC provides all Citable Content
Downloads events that occurred via direct link from a Google search property.
The two biggest unknown factors are:
(1) How many Type I errors (false positives of non-human Citable Content Downloads
events) are included in metrics generated from log data?
149
The
Repository
Analytics and
Metrics Portal
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
(2) How many Type II errors (false negatives of human Citable Content Downloads
events) in IR metrics are excluded by the GA/GSC from direct links that originate
from non-Google search properties?
Most notable in the descriptive statistics describing the data grouped by date (Table I) is the
overwhelming disparity that shows the potential instability and lack of predictability of the
activity reported by the two methods. The most concerning result is the very large Kurtosis
that indicates that the log data have infrequent but large deviations typically known as
outliers. The Google Kurtosis, on the other hand, is relatively close to the mean – representing
a smooth distribution of activity over time. The finding that log data have a standard
deviation larger than their mean and median is consistent with the Kurtosis. The data
gathered through GA/GSC, on the other hand, show a relatively low standard deviation,
indicating a more normal distribution of activity. The following conclusions can be drawn
from Table I: data from the log files could be interpreted to claim that the MSU IR averages
380 percent more Citable Content Downloads each day than the data drawn from GA/GSC;
and the standard deviation of the log file data indicates that user activity swings wildly from
day to day, while the data drawn fromGA/GSC are relatively stable and predictable over time.
Table II provides another look at the data aggregated by unique identifier (i.e. DSpace
handle) over the entire 134-day period. Again, relevant descriptive statistics shown indicate a
large difference between the two tracking methods. Because of the log data outliers,
the median is more relevant than mean. However, the log data median is also significantly
larger than the Google mid-point. The log data standard deviation and range are almost
Log data Google
Mean 2,405.72 626.39
Standard error 249.11 10.22
Median 1,835.00 640.00
Mode 1,728.00 704.00
SD 2,883.68 118.35
Variance 8,315,621.23 14,006.28
Kurtosis 26.17 −0.99
Skewness 5.10 −0.20
Range 19,429 500
Minimum 492 370
Maximum 19,921 870
Sum 322,366 83,936
Count 134 134
Table I.
Descriptive statistics
for Citable Content
Downloads by date
Log data Google
Mean 38.01 9.90
Standard error 1.51 0.63
Median 14 2
Mode 1 0
SD 138.64 58.32
Sample variance 19,222.20 3,401.35
Range 4,872.00 2,060.00
Minimum 0 0
Maximum 4,872 2,060
Sum 32,236 83,936
Count 8,482 8,482
Table II.
Descriptive statistics
by handle
150
LHT
35,1
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
140 percent more than Google’s. This indicates that the issue is probably more involved than
detecting and removing a few obvious robot outliers.
Looking at the t-Test Pairwise Two Sample for Means for data aggregated by date
(Table III) or handle (Table IV) indicates a strong rejection of the hypothesis that the two
methods produce similar results. Tables III and IV also contain the F-test Two sample
for variance results that also confirm that we can reject the hypothesis that the two methods
are the same.
When graphing the data by weekday (Figure 1), one would expect to see similar-sized
weekday totals and trend lines. However, the two methods are divergent with log data
indicating Wednesdays produces 422 percent more Citable Content Downloads events than
were tracked with GA/GSC.
Discussion
Citable Content Downloads events reported by IR are initiated by humans seeking
information. Including robot activity (Type I errors) in reported IR metrics does more
harm than good for strategic and operational decision making. Library stakeholders
are better served by excluding some human IR activity (Type II errors) from IR
Log data Google
Mean 2,405.72 626.39
Variance 8,315,621.23 14,006.28
Observations 134.00 134.00
Pearson correlation 0.22
Hypothesized mean difference –
df 133.00
t stat 7.20
P(T⩽ t) one-tail 0.00
t critical one-tail 1.66
P(T⩽ t) two-tail 0.00
t critical two-tail 1.98
F 593.71
P(F⩽ f) one-tail 0.00
F critical one-tail 1.33
Table III.
Date F and t-test
pairwise two sample
for means
Log data Google
Mean 38.01 9.90
Variance 19,222.20 3,401.35
Observations 8,482.00 8,482.00
Pearson correlation 0.83
Hypothesized mean difference 0.00
df 8,481.00
t Stat 26.94
P(T⩽ t) one-tail 0.00
t critical one-tail 1.65
P(T⩽ t) two-tail 0.00
t critical two-tail 1.96
F 5.65
P(F⩽ f) one-tail 0.00
F critical one-tail 1.04
Table IV.
Handle F and t-test
pairwise two sample
for means
151
The
Repository
Analytics and
Metrics Portal
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
reporting metrics if those reports can also exclude bot activity (Type I errors). While
Type II errors are not desirable, Type I errors give the impression of poor integrity
and jeopardizes the public trust and good will the library community has with its
stakeholders.
Google Analytics is the most prevalent web analytics service being used in academic
libraries, and for good reason: it is powerful, free, and relatively easy to implement.
However, incorporating the most important metric for IR – Citable Content Downloads – is
disjointed, difficult to access and limited to a moving 90-day access window. Without the
skills to systematically access the GSC API, IR managers do not have access to a persistent
data set of their largest and most important metric.
We advocate the creation of a single data store similar to the IRUS-UK initiative,
although our method uses web service based on GA/GSC platforms. Our research
demonstrates that the proposed method has a very high potential of providing a superior
long-term solution than what the library community might build and maintain on its own.
There are a few caveats related to privacy and long-term access costs that require further
investigation and discussion. Because our team is preparing to submit data and analysis on
Google Analytics privacy for publication, we are limiting our scope of discussion in this
paper to long-term access and cost.
The log file analytics packages built into IR platforms offer a solution that initially
appears to capture 284 percent more Citable Content Downloads, but closer inspection
reveals that much of that traffic is not generated by humans. The capacity to accurately
filter robot traffic from these log file analytics packages is beyond most libraries. Worse, it is
impossible to tell whether a platform that claims to effectively filter robot traffic is actually
able to do so as well as Google.
RAMP web service prototype
The RAMP prototype offers a simple web interface for IR managers to view and download
Citable Content Downloads event data from their IR for a given date. The service automates
the daily aggregation of Google Search API analytics and stores them as daily file dumps.
It also provides single-day statistics and visualizations.
70,000
60,000
50,000
40,000
30,000
CC
D
20,000
10,000
0
Sunday
10,620
13,718
31,081
52,647
Google Log data Linear (log data) Linear (google)
39,939
61,383
55,685
53,525
14,405 13,518 12,934
10,259
8,482
28,106
Monday Tuesday Wednesday Thursday Friday Saturday
Weekday
Figure 1.
Citable Content
Downloads events
by weekday
152
LHT
35,1
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
The research team reused and refactored Python code originally developed for the
previously published research (OBrien et al., 2016b) to experiment with the proof-of-concept
prototype. RAMP provides the following capabilities:
(1) persistent access to Citable Content Downloads event data over time;
(2) no significant investment in training or system configuration; and
(3) the potential to aggregate IR metrics across organizations for consistent
benchmarking and analysis interpretation.
The landing page (Figure 2) provides a list of current organizations that have registered
with RAMP as of February 2017. The service provides two ways of collecting/reviewing
GSC data. The first is real-time daily statistics, where users select a date and the service
conducts a query of the Google Search API to retrieve data (see Figure 3). This method is
limited to 90 days of data that GSC stores.
For the daily statistics, users can view the data in two different ways. The first is to
download the data, which includes total clicks on Citable Content Downloads and total
number of clicks per device. Downloaded data are available as TSV format and could be
compiled locally for use in spreadsheets or other applications.
The second method of viewing data is through RAMP’s visualization feature, which
currently shows how many Citable Content Downloads URLs were clicked in a given day as
well as the type of device (Mobile, Tablet, or Desktop) that accessed the document (Figure 4).
Figure 2.
Landing page for the
RAMP service
153
The
Repository
Analytics and
Metrics Portal
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
In addition to querying daily statistics, we recommend users set up the service to
automatically collect daily logs every day at midnight to ensure access to their data past the
90-day window imposed by Google. It should be noted that the Google Search API has a
three-day delay before statistics are available for download. Consequently, the daily
statistics query actually pulls data from three days prior to the current date.
Registering for access to RAMP
The prototype currently runs on Google’s Cloud Platform and leverages Google’s APIs.
The service requires no installations, configurations, special skills, or training, and once
Figure 3.
Organization page
with administrative
login
Access by device
Mobile
Tablet
Desktop
Desktop
371 (80%)
17%
3%
80%Figure 4.
Daily statistics
visualization
by device
154
LHT
35,1
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
approved, IR managers can have access to data in a matter of minutes if they already run
Google Analytics or GSC. The only local requirement is that the Google Account
administrator for the IR adds RAMP’s e-mail to Google Analytics or GSC, as would be done
for any other “user” in the organization.
IR managers should send an e-mail[6] to our research team, requesting access to RAMP.
Below is the process for accessing the RAMP service.
Once authorization has been provided, RAMP developers will create a registration entry
for the participating institution and forward a special RAMP service account e-mail address.
To begin downloading and analyzing data, the IR manager will need to authorize RAMP to
access their repository’s GSC data by adding the RAMP service account e-mail address[7].
Next, they will input the repository’s base URL in the “Sites to Analyze” form (Figure 5).
The base URL should correspond to the web property for which the service account is
authorized as described above. It is possible to add multiple URLs in order to include both
HTTP and HTTPS protocols, where applicable.
Once the websites are added, the IR manager is able to view real-time daily statistics for
the past 90 days as well as initiate daily logging. Because the RAMP application has no
access to any personally identifiable information, proprietary, or confidential information,
anyone with access to the RAMP system can search or download the daily statistics
collected for any of the participating institutions. Although the system is open for all
accepted participants to read, participating institutions can stop the logging of their data
any time. In this pilot, the MSU will have the authority to start and stop daily logging and
add or modify the URLs that are used by RAMP.
Conclusion
Web server logs for IR platforms are excellent at tracking all activity (i.e. page views, visits,
item downloads, etc.). However, analytics reports using log analysis are heavily biased
toward over-reporting due to excessive utilization of IR content by robots. The problems
with existing methods of reporting Citable Content Downloads from IR can be summarized
into three statements: page tagging analytic services grossly undercount, locally
administered log file analytics grossly over-count, and it is difficult to ascertain whether
commercial services that offer log file analysis packages can manage the rapidly changing
robot environment quickly enough to provide consistent measurement.
Currently, the open source repository community does not have access to a reasonable
solution for identifying and filtering out this unwanted bot activity consistently. Solutions
developed and tested by the community appear effective on obvious and less sophisticated
robots. However, 50 percent of bot activity is due to “bad bots” (Zeifman, 2015) that require
highly advanced machine learning risk assessment algorithms for real-time bot detection.
These methods were pioneered and deployed by former academics who now work at Google
and other leading companies (The Economist, 2016). Without this advanced technology the
Figure 5.
Managing URLs used
by the RAMP service
155
The
Repository
Analytics and
Metrics Portal
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
statistical sensitivity and specificity of any solutions developed by the library community is
questionable, at best.
The RAMP prototype described in this paper utilizes Google services for a solution that is
accessible and free. It relies on Google’s platforms to ensure that all participants are using
clearly defined metrics and terminology. Enhancements made to Google’s platform pose little
technical risk for participant implementation, and any changes in Google services affect all
participants at the same time. This ensures that all participant data are comparable over time
and allow the library community to aggregate data for benchmarking and best practice
identification with confidence. Finally, the method provides high accuracy and robustness.
Notes
1. Privacy research in progress by the authors shows 80 percent of academic libraries that are
members of ARL, DLF, or the OCLC Research Libraries Partnership use Google Analytics.
Publication expected in 2017.
2. Project counter – www.projectcounter.org/code-of-practice-sections/general-information/
3. Google Search Console dashboard – www.google.com/webmasters/tools/search-analytics
4. Google Analytics Query Explorer – https://ga-dev-tools.appspot.com/query-explorer/
5. Google Search Console + Google Analytics + DSpace Solr Cores (statistics + search)
6. Send e-mail to Jeff Mixter, Software Engineer at OCLC – mixterj@oclc.org
7. https://support.google.com/webmasters/answer/2453966?hl=en
References
Alphabet Inc. (2015), “Consolidated revenues”, Form 10K, United States Securities and Exchange
Commission, Washington, DC, available at: www.sec.gov/Archives/edgar/data/1288776/000
165204416000012/goog10-k2015.htm#s2A481E6E5C511C2C8AAECA5160BB1908 (accessed
October 28, 2016).
Arlitsch, K., OBrien, P., Kyrillidou, M., Clark, J.A., Young, S.W.H., Mixter, J., Chao, Z., Freels-Stendel, B.
and Stewart, C. (2014), “Measuring up: assessing accuracy of reported use and impact of digital
repositories”, Funded grant proposal, Institute of Museum and Library Services, Washington,
DC, available at: http://scholarworks.montana.edu/xmlui/handle/1/8924 (accessed July 15, 2016).
Bohannon, J. (2014), “Google Scholar wins raves – but can it be trusted?”, Science Magazine, January 3,
p. 14.
Carpenter, J. (2012), “Researchers of tomorrow: the research behaviour of Generation Y doctoral
students”, Information Services and Use, Vol. 32 Nos 1-2, pp. 3-17, doi: 10.3233/ISU-2012-0637.
Cisco (2016), “The zettabyte era – trends and analysis”, Cisco, Cisco Visual Networking Index, available
at: www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/
vni-hyperconnectivity-wp.html
comScore Inc. (2011), “comScore releases June 2011 US search engine rankings”, July 13, available at:
www.comscore.com/Press_Events/Press_Releases/2011/7/comScore_Releases_June_2011_U.S.
_Search_Engine_Rankings (accessed August 10, 2011).
comScore Inc. (2016), “comScore releases February 2016 US desktop search engine rankings”, March 16,
available at: www.comscore.com/Insights/Rankings/comScore-Releases-February-2016-US-
Desktop-Search-Engine-Rankings (accessed October 2, 2016).
Cothran, T. (2011), “Google scholar acceptance and use among graduate students: a quantitative
study”, Library & Information Science Research, Vol. 33 No. 4, pp. 293-301, doi: 10.1016/j.
lisr.2011.02.001.
de Kunder, M (2016), “The size of the World Wide Web (the internet)”, October 2, available at: www.
worldwidewebsize.com
156
LHT
35,1
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
Diggory, M. and Lawrence, A. (2016), “SOLR statistics”, DuraSpace, Dspace Documentation Wiki,
July 11, available at: https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics (accessed
October 28, 2016).
Edelman, B., Ostrovsky, M. and Schwarz, M. (2007), “Internet advertising and the generalized second-
price auction: selling billions of dollars worth of keywords”, The American Economic Review,
Vol. 97 No. 1, pp. 242-259, doi: 10.1257/000282807780323523.
Falagas, M.E., Eleni, I.P., Malietzis, G.A. and Pappas, G. (2008), “Comparison of PubMed, Scopus, Web
of Science, and Google Scholar: strengths and weaknesses”, The FASEB Journal, Vol. 22 No. 2,
pp. 338-342, doi: 10.1096/fj.07-9492LSF.
Gehanno, J.-F., Rollin, L. and Darmoni, S. (2013), “Is the coverage of Google Scholar enough to be used
alone for systematic reviews”, BMC Medical Informatics and Decision Making, Vol. 13 No. 1,
doi: 10.1186/1472-6947-13-7, available at: http://bmcmedinformdecismak.biomedcentral.com/
articles/10.1186/1472-6947-13-7
Giustini, D. and Boulos, M.N.K. (2013), “Google Scholar is not enough to be used alone for systematic
reviews”, Online Journal of Public Health Informatics, Vol. 5 No. 2, pp. 1-9, doi: 10.5210/ojphi.
v5i2.4623.
Google Inc. (2015), “Announcing Google Search Console – the new webmaster tools”, Google
Webmaster Central Blog, May 20, available at: https://webmasters.googleblog.com/2015/05/
announcing-google-search-console-new.html (accessed October 29, 2016).
Google Inc. (2016a), “Using Search Console with your website”, Google Search Console Help, available
at: https://support.google.com/webmasters/answer/6258314?hl=en&ref_topic=3309469
(accessed October 28, 2016).
Google Inc. (2016b), “What are impressions, position, and clicks? – Search Console Help”, available at:
https://support.google.com/webmasters/answer/7042828#click (accessed October 28).
Greene, J. (2016), “Web robot detection in scholarly open access institutional repositories”, Library Hi
Tech, Vol. 34 No. 3, pp. 500-520, available at: http://hdl.handle.net/10197/7682
Haeberli-Kaul, J., Beucke, D., Hitzler, M., Holtz, A., Mimkes, J., Riese, W., Herb, U., Recke, M., Schmidt, B.,
Schulze, M., Henneberger, S. and Stemmer, B. (2013), “Standardised usage statistics for open access
repositories and publication services”, DINI – Deutsche Initiative für Netzwerkinformation E.V.,
Göttingen (Trans byA. Rennison), available at: http://nbn-resolving.de/urn:nbn:de:kobv:11-100217555
Herrera, G. (2011), “Google Scholar users and user behaviors: an exploratory study”, College & Research
Libraries, Vol. 72 No. 4, pp. 316-330, doi: 10.5860/crl-125rl.
Huntington, P., Nicholas, D. and Jamali, H.R. (2008), “Web robot detection in the scholarly information
environment”, Journal of Information Science, Vol. 34 No. 5, pp. 726-741, doi: 10.1177/
0165551507087237.
Information Power Ltd (2013), “IRUS download data – identifying unusual usage”, IRUS Download
Report, available at: www.irus.mimas.ac.uk/news/IRUS_download_data_Final_report.pdf
(accessed July 1, 2016).
Lake, C. (2016), “The most expensive 100 Google Adwords keywords in the US”, Search Engine Watch,
May 31, available at: https://searchenginewatch.com/2016/05/31/the-most-expensive-100-google-
adwords-keywords-in-the-us/ (accessed November 2, 2016).
Lourenço, A.G. and Belo, O.O. (2006), “Catching web crawlers in the act”, Proceedings of the 6th
International Conference on Web Engineering, ACM Press, Palo Alto, CA, pp. 265-272, doi: 10.1145/
1145581.1145634, available at: http://portal.acm.org/citation.cfm?doid=1145581.1145634
Masár, I. (2015), “Solr – DSpace – Duraspace wiki”, Dspace Documentation Wiki, December 11,
available at: https://wiki.duraspace.org/display/DSPACE/Solr#Solr-Bypassinglocalhostrestri
tiontemporarily (accessed July 1, 2016).
Microsoft Inc. (2016), “Search keywords report”, Bing Webmaster Tools, available at: www.bing.com/
webmaster/help/search-keywords-report-20a352af (accessed November 3, 2016).
Needham, P. and Stone, G. (2012), “IRUS-UK: making scholarly statistics count in UK repositories”,
Insights: The UKSG Journal, Vol. 25 No. 3, pp. 262-266, doi: 10.1629/2048-7754.25.3.262.
157
The
Repository
Analytics and
Metrics Portal
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)
Newman, L.H. (2016), “What we know about Friday’s massive east coast internet outage”, Wired,
October 21, available at: www.wired.com/2016/10/internet-outage-ddos-dns-dyn/ (accessed
October 23, 2016).
Nicholas, D., Clark, D., Rowlands, I. and Jamali, H.R. (2009), “Online use and information seeking
behaviour: institutional and subject comparisons of UK researchers”, Journal of Information
Science, Vol. 35 No. 6, pp. 660-676, doi: 10.1177/0165551509338341.
OBrien, P., Arlitsch, K., Sterman, L., Mixter, J., Wheeler, J. and Borda, S. (2016a), Data Set Supporting the
Study Undercounting File Downloads from Institutional Repositories, Montana State University,
Bozeman, MT, available at: http://scholarworks.montana.edu/xmlui/handle/1/9939
OBrien, P., Arlitsch, K., Sterman, L., Mixter, J., Wheeler, J. and Borda, S. (2016b), “Undercounting file
downloads from institutional repositories”, Journal of Library Administration, Vol. 56 No. 7,
pp. 854-874, doi: 10.1080/01930826.2016.1216224.
Rettberg, N. and Schmidt, B. (2012), “OpenAIRE – building a collaborative open access infrastructure
for European researchers”, LIBER Quarterly, Vol. 22 No. 3, pp. 160-175.
Sullivan, D. (2016), “Google now handles at least 2 trillion searches per year”, Search Engine Land,
May 24, available at: http://searchengineland.com/google-now-handles-2-999-trillion-searches-
per-year-250247 (accessed October 23, 2016).
The Economist (2016), “Million-dollar babies”, The Economist, April 2, available at: www.econom
ist.com/news/business/21695908-silicon-valley-fights-talent-universities-struggle-hold-their
(accessed November 2, 2016).
van den Bosch, A., Bogers, T. and de Kunder, M. (2016), “Estimating search engine index size
variability: a 9-year longitudinal study”, Scientometrics, Vol. 107 No. 2, pp. 839-856, doi: 10.1007/
s11192-016-1863-z.
Van Noorden, R. (2014), “Online collaboration: scientists and the social network”, Nature, Vol. 512
No. 7513, pp. 126-129, available at: www.nature.com/news/online-collaboration-scientists-and-the-
social-network-1.15711
Verhaar, P. (2009), “SURE: statistics on the usage of repositories”, SURF Foundation, available at:
http://docplayer.net/750695-Sure-statistics-on-the-usage-of-repositories.html (accessed
November 3, 2016).
Wang, Ya and Howard, P. (2012), “Google Scholar usage: an academic library’s experience”, Journal of
Web Librarianship, Vol. 6 No. 2, pp. 94-108, doi: 10.1080/19322909.2012.672067.
Zeifman, I. (2015), “2015 bot traffic report: humans take back the web, bad bots not giving any ground”,
Incapsula Blog, December 9, available at: www.incapsula.com/blog/bot-traffic-report-2015.html
(accessed June 30, 2016).
Zineddine, M. (2016), “Search engines crawling process optimization: a webserver approach”,
Internet Research, Vol. 26 No. 1, pp. 311-331, doi: 10.1108/IntR-02-2014-0045.
Corresponding author
Kenning Arlitsch can be contacted at: kenning.arlitsch@montana.edu
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com
158
LHT
35,1
D
ow
nl
oa
de
d
by
M
on
ta
na
S
ta
te
U
ni
ve
rs
ity
L
ib
ra
ry
B
oz
em
an
A
t 0
9:
18
2
9
Ju
ne
2
01
7
(P
T)