%\documentclass[article]{jss}
\documentclass[nojss]{jss}


\usepackage{amsmath}
\newcommand{\class}[1]{``\code{#1}''}


\author{Kenneth Knoblauch\\ Inserm, U846 \And 
Laurence T.\ Maloney\\ New York University}
\Plainauthor{Kenneth Knoblauch, Laurence T. Maloney} 

\title{\pkg{MLDS}: Maximum Likelihood Difference Scaling in \proglang{R}}
\Plaintitle{MLDS: Maximum Likelihood Difference Scaling in R} 

\Abstract{
This introduction to the \proglang{R} package \pkg{MLDS}  is a modified and
updated version of \citet{KnoblauchMaloney2008} published
in the \emph{Journal of Statistical Software}.

The \pkg{MLDS} package  in the \proglang{R} programming language
 can be used to estimate perceptual scales based on the results of
psychophysical experiments using the method of difference scaling. In a difference scaling
experiment, observers compare two supra-threshold differences (a,b) and (c,d) on each trial.
The approach is based on a stochastic model of how the observer decides which perceptual
difference (or interval) $(a,b)$ or $(c,d)$ is greater, and the parameters 
of the model are estimated using a maximum likelihood criterion.
We also propose a method  to test the model by evaluating 
the self-consistency of the estimated scale.
The package includes an example in which an observer judges the differences in correlation 
between scatterplots. The example may be readily adapted to estimate perceptual scales
for arbitrary physical continua.  
}
\Keywords{difference scaling, sensory magnitude, proximity, psychophysics, signal detection theory, GLM}


\Address{
  Kenneth Knoblauch\\
  Inserm, U846, Bron, France\\
 Stem Cell and Brain Research Institute\\
  Department of Integrative Neurosciences \\
  Universit\'e de Lyon, Lyon 1 \\
  18 avenue du Doyen L\'epine\\
  69500 Bron, France \\
  Telephone: +33/472913477 \\
  Fax: +33/472913461 \\
  E-mail: \email{ken.knoblauch@inserm.fr}\\
  URL: \url{http://www.sbri.fr/members/kenneth-knoblauch.html} \\
   
  Laurence T.\ Maloney \\
  Department of Psychology\\
  Center for Neural Science \\
  New York University \\
  6 Washington Place, 8th Floor \\
  New York, NY 10011, United States of America \\
  Telephone: +1/212/9987851 \\
  E-mail: \email{ltm1@nyu.edu} \\
  URL: \url{http://www.psych.nyu.edu/maloney/}
}

%% need no \usepackage{Sweave.sty}
%\VignetteIndexEntry{MLDS: Maximum Likelihood Difference Scaling in R}
%\VignetteKeywords{difference scaling, sensory magnitude, proximity, psychophysics, signal detection theory, GLM, R}
%\VignettePackage{MLDS}

\begin{document}

\section{Introduction}
\begin{Scode}{echo=FALSE,results=hide}
library(MLDS)
\end{Scode}

Difference scaling is a psychophysical procedure used to estimate
supra-threshold perceptual differences for stimuli distributed along a
physical continuum.  On each of a set of trials, an observer is presented
with a quadruple, $(I_{a},I_{b},I_{c},I_{d})$, drawn from an
ordered set of stimuli, $\{I_{1} < I_{2} < \ldots < I_{p}\}$. For convenience, the two pairs 
$(I_{a},I_{b})$ and $(I_{c},I_{d})$ are often ordered so that $I_{a}<I_{b}$ and $I_{c}<I_{d}$
on the physical scale but they need not be. On each trial, the observer judges which pair shows
the greater perceptual difference. The experimental data consist of an $%
n \times 5$ matrix with each row comprising the indices of each quadruple, $(a,b;c,d)$, from the
larger set and the observer's response for each of $n$ trials. 
The output of the scaling procedure are scale values 
$\{\psi_{1},\psi_{2},\ldots ,\psi_{p}\}$ that best capture the 
subject's judgments of the perceptual difference between the stimuli in
each pair $(I_{a},I_{b})$ as we describe in detail below. 

In seminal papers, Schneider and colleagues %
\citep{Schneid80a,Schneid80b,Schneidetal74} applied this procedure to the
perception of loudness and proposed a method for estimating the difference
scale based on the proportion of times the fitted model reproduced the observer's
judgments. This method does not explicitly model stochastic variability in
the observer's responses. \citet{Bosch01} proposed a method based on
numerical rating of perceptual differences.
Subsequently, \citet{MalYang03} developed a
maximum likelihood procedure, maximum likelihood difference scaling (MLDS),
for estimating the parameters of the scale. Their method is based on direct 
perceptual comparison of pairs of stimuli. 

MLDS has been successfully applied to characterize
color differences \citep{MalYang03}, surface glossiness \citep{Obeinetal04},
image quality \citep{Charrieretal}, adaptive processes in face distortion 
\citep{Rhodesetal} and neural encoding of sensory attributes \citep{Yangetal2008}. 
The mathematical basis for the method, including necessary and sufficient conditions for 
a difference scale representation in the absence of observer error,
can be found in \citet[Chapter 4, Definition 1, p.~147]{Krantzetal71}. 
We summarize the most relevant of 
these conditions below when we discuss diagnostic tests of the model. 

In this article, we will describe the MLDS procedure using direct
maximization of the likelihood as initially described by \citet{MalYang03}
and an equivalent approach as a generalized linear model 
\citep[GLM][]{McCNeld89}. We present an \proglang{R} package \citep{R}, 
\pkg{MLDS}, that implements both approaches as well as diagnostics of the
estimated scale's validity. 
We first describe difference scaling experiments based on comparison 
of pairs of pairs of stimuli termed \emph{quadruples}.  
We later describe an alternative method based on triples of stimuli, 
termed \emph{triads} 
that may prove to be convenient in particular experimental applications
    In the last section, we will demonstrate the
package with an extended example in which we show how to use MLDS to
evaluate perception of correlation in scatterplots \citep{Cleve84b}.
The package is available from the Comprehensive \proglang{R} Archive
    Network at \url{http://CRAN.R-project.org/package=MLDS}.

In Figure 1 we show the
kinds of stimuli used in the example: the $p=11$ scatterplots each based on
samples of 100 points drawn from bivariate Gaussian distributions that differ only in
correlation $r$.  The first ten values of $r$ are equally spaced from 0.0
to 0.9 while the eleventh is 0.98. As explained in the next section, one
goal of difference scaling is to develop a perceptual scale that predicts
perceived differences between stimuli (here scatterplots) on the
physical scale.  An example of such a scale is shown as the twelfth graph
in the figure. The example code is organized so that
it can be readily adapted to other applications.

\begin{figure}[h]
\begin{center}
\setkeys{Gin}{width=0.95\textwidth}
\includegraphics{Figures/KM_JSS-002}
\setkeys{Gin}{width=0.8\textwidth}
\caption{The first 11 graphs are scatterplots of 100 points each drawn from
a bivariate Gaussian distribution with the correlation given above each graph.
The twelfth graph is a perceptual scale for differences of correlation estimated from
the judgments of a single observer. Notice that, for values of correlation less than about 0.4, 
the scatterplots are difficult to discriminate. The corresponding difference scale is 
nearly constant across this range. }
\label{Rscale}
\end{center}
\end{figure}


\section{Maximum likelihood difference scaling}

In this section, we develop the model of the observer's judgments in the
psychophysical task of difference scaling. In each experimental session, the
experimenter selects a set of $p$ stimuli, $\{I_{1},I_{2},\ldots ,I_{p}\}$
ordered along a physical continuum, such as the $p=11$ correlations in Figure~\ref{Rscale}. 
On each trial
the experimenter presents an observer with quadruples $\left(
I_{a},I_{b};I_{c},I_{d}\right) $, and asks him to judge which pair, $%
I_{a},I_{b}$ or $I_{c},I_{d}$, exhibits the \textit{larger perceptual difference}.
Figure~\ref{DSexamp} contains an example of such a quadruple. The observer's task is to
judge whether the upper or lower pair of scatterplots exhibits the larger
difference. It will prove convenient to replace 
$\left(I_{a},I_{b};I_{c},I_{d}\right) $ by the simpler notation 
$\left(a,b;c,d\right) $.  

\begin{figure}[htb!]
\begin{center}
\includegraphics{Figures/KM_JSS-003}
\end{Scode}
\caption{An example of a trial stimulus presentation from the difference
scaling experiment for estimating correlation differences in the scatterplot
experiment.  The observer must judge whether the difference in perceived correlation
is greater in the lower or upper pair of scatterplots.}
\label{DSexamp}
\end{center}
\end{figure}

\subsection{Choosing the quadruples}
Over the course of the experiment, the observer sees many
different quadruples. The experimenter could choose to present all possible quadruples 
$\left(a,b;c,d\right) $ for $p$ stimuli to the observer or a random sample of all 
possible quadruples.
In past work, experiments have used the set of all possible non-overlapping quadruples 
$a<b<c<d$ for $p$ stimuli and the resulting scales have proven to be readily interpretable.
Moreover, \citet{MalYang03} report extensive evaluations of this subset of all possible
quadruples. Consequently we will be primarily concerned with this set of quadruples. 

In the example we present, we use only the non-overlapping quadruples.
The physical scale values are correlations of bivariate 
Gaussian distributions, and the stimuli are
scatterplots drawn from bivariate Gaussian distributions. 
The changes needed to employ the methods and diagnostic tests presented here
with other subsets of possible quadruples are very slight, should the user prefer a different set
of quadruples.

By restricting ourselves to non-overlapping quadruples, we avoid a possible artifact in the
experimental design. Suppose we included
quadruples such as $\left(a,b;a,c\right)$ with $a<b<c$ where the same physical scale value appears twice
or quadruples of the form $\left(b,c;a,d\right)$ with $a<b<c<d$, where one interval is contained in the interior of the
other. Now consider an observer who is actually not capable of comparing intervals
and whose behavior cannot therefore be captured by a difference scale. If this observer can correctly order the
stimuli, then, on a trial of the form $\left(a,b;a,c\right)$, he can
still get the right answer by noting that both intervals have $a$ at one end so that $b<c$ implies
that $\left(a,b\right)$ must be less than $\left(a,c\right)$. 
A similar heuristic applied to $\left(b,c;a,d\right)$ with $a<b<c<d$
would allow the observer to appear to be ordering intervals correctly when, 
in fact, he cannot compare intervals. In either case,we might conclude that the observer
is to some extent ordering intervals correctly when in fact he is simply employing a
heuristic based on stimulus order.
Using only non-overlapping intervals effectively forces the observer to compare intervals.
Moreover, as noted above, the use of such intervals has proven 
itself in previous computational and experimental situations. 
However, as we shall see, one possible diagnostic test, the three-point test, requires comparison 
of intervals of the form  $\left(a,b;a,c\right)$, and the experimenter interested in applying this test 
would have to include certain overlapping 
intervals, as we describe below in the discussion of this test.

If $p=11$, as in the example, there
are 
\begin{equation}
\left( 
\begin{array}{c}
11 \\ 
4%
\end{array}%
\right) =\frac{11!}{4!\:7!}=330
\end{equation}%
different, non-overlapping quadruples. To control
for positional effects, on half of the forced-choice trials, chosen at
random, the pairs are presented in the order $\left( a,b;c,d\right) $ and on
the other half, $\left( c,d;a,b\right) $. The order in which quadruples are represented is
randomized. For $p=11$, the observer may judge each of the 330 non-overlapping
quadruples in randomized order or the experimenter may choose to have the observer
judge each of the quadruples $m$ times, completing 330$m$ trials in total.  Of course, the
number of trials judged by the observer affects the accuracy of the estimated
difference scale \citep[See][]{MalYang03}. The time
needed to judge all 330 trials in the example is roughly 10--12 minutes.

At the end of the experiment, the data are represented as an $n \times 5$
matrix  or data frame in which each row corresponds to a trial: four
columns  give the indices, $(a, b, c, d)$ of the stimuli from the ordered
set of $p$, and  one column indicates the response of the observer to the
quadruple as a 0 or 1,  indicating choice of the first or second pair. For
example,  
\begin{Scode}{}
head(kk3)
\end{Scode}
gives the first 6 rows of the data frame for the observations that generated the
scale shown in the lower right of Figure~\ref{Rscale}.

From these data, the experimenter estimates the perceptual scale values 
$\psi _{1},\psi _{2},\ldots ,\psi _{p}$ corresponding to the stimuli, $%
I_{1},\ldots ,I_{p}$, as follows. Given a quadruple, $\left( a,b\, ; c,d\right)$
from a single trial, we first assume that the observer judged $I_{a},I_{b}$
to be further apart than $I_{c},I_{d}$ precisely when, 
\begin{equation}
|\psi _{b}-\psi _{a}|>|\psi _{d}-\psi _{c}|  \label{equ1}
\end{equation}%
that is, the difference scale predicts judgment of perceptual difference.

It is unlikely that human observers would be so reliable in judgment as to
satisfy the criterion just given, particularly if the perceptual differences $| \psi_b
- \psi_a |$ and $| \psi_d - \psi_c |$ are close. \citet{MalYang03} proposed
a stochastic model of difference judgment that allows the observer to
exhibit some stochastic variation in judgment. Let $L_{ab} = | \psi_b -
\psi_a |$ denote the unsigned perceived length of the interval $I_a, I_b$ . The
proposed decision model is an equal-variance Gaussian signal detection model %
\citep{GrSw74} where the signal is the difference in the lengths of the
intervals, 
\begin{equation}
\delta(a, b; c, d) = L_{cd} - L_{ab} = | \psi_d - \psi_c | - | \psi_b -
\psi_a |  \label{equ2}
\end{equation}
If $\delta$ is positive, the observer should choose the second interval as
larger; when it is negative, the first. When $\delta$ is small, negative or
positive, relative to the Gaussian standard deviation,  $\sigma$, we expect
the observer, presented with the same stimuli, to give different, apparently
inconsistent judgments. The decision variable employed by the observer is
assumed to be 
\begin{equation}
\Delta(a, b; c, d) = \delta(a, b; c, d) + \epsilon = L_{cd} - L_{ab} +
\epsilon  \label{equ3}
\end{equation}
where $\epsilon \sim \mathcal{N}(0, \sigma^2)$: given the quadruple, $\left ( a, b\, ;
c, d \right )$ the observer selects the pair $I_c, I_d$ precisely when, 
\begin{equation}
\Delta(a, b\, ; c, d) > 0.  \label{equ4}
\end{equation}

\subsection{Direct maximization of likelihood}

In each experimental condition the observer completes $n$ trials, each based
on a quadruple $\mathbf{q}^k = \left( a^k, b^k ; c^k, d^k \right)$, $k = 1, n
$. The observer's response is coded as $R^k = 0$ (the difference of the
first pair is judged larger) or $R^k = 1$ (second pair judged larger). We denote the
outcome ``$cd$ judged larger than $ab$'' by $cd \succ ab$ for convenience. 
We fit the parameters $\mathbf{\Psi} = \left( \psi_1, \psi_2, \ldots, \psi_p
\right )$ and $\sigma$ by maximizing the likelihood, 
\begin{equation}
L(\mathbf{\Psi}, \sigma) = \prod^n_{k= 1}\Phi \left (\frac{\delta\left (%
\mathbf{q}^k \right)}{\sigma} \right )^{1 - R_k} \left (1 - \Phi \left (\frac{%
\delta\left (\mathbf{q}^k \right)}{\sigma} \right ) \right )^{R_k},
\label{like}
\end{equation}
where $\Phi(x)$ denotes the cumulative standard normal distribution
 and $\delta \left (\mathbf{q}^k \right ) = \delta
\left (a^k, b^k ; c^k, d^k \right )$  as defined in Equation~\ref{equ3}.

At first glance, it would appear that the stochastic difference scaling
model just presented has $p + 1$,  free parameters: $\psi_1, \ldots, \psi_p$
together with the standard deviation of the error term,  $\sigma$. However,
any linear transformation of the $\psi_1, \ldots, \psi_p$ together with a
corresponding scaling by $\sigma^{-1}$ results in a set of parameters that predicts
exactly the same performance as the original parameters. Without any loss of
generality, we can set $\psi_1 = 0$ and  $\psi_p = 1$, leaving us with the $%
p - 1$ free parameters, $\psi_2, \ldots, \psi_{p-1}$ and  $\sigma$. When
scale values are normalized in this way, we describe them as standard scales.

Equation~\ref{like} can be recognized as the likelihood for a Bernoulli
variable. Taking the negative logarithm allows the parameters to be
estimated simply with a minimization function such as \code{optim} in 
\proglang{R} \citep[for example][p.~445]{VR02}.

\subsubsection{Reparameterization}
In practice, we define the minimization functions to work with the
parameters on transformed scales:  the $p-2$ scale values by a logit
transformation 
\begin{equation}
\log \left( \frac{x}{1-x}\right) 
\label{logistic}
\end{equation}%
so they are constrained to be in the interval $(0,1)$ and $\sigma $ by the
logarithm so that it remains positive. These transformations have no
theoretical significance; they are used to avoid problems in numerical
optimization. Maximum likelihood methods are invariant under such reparameterization 
\citep[pp.~284--285]{MoodGray74}.

\subsection{GLM method}

In this section, we show how the above formulation can be framed as a GLM. A
generalized linear model \citep[GLM][]{McCNeld89} is described by 
\begin{equation}
\eta( \E [ Y] ) = X \beta ,
\end{equation}
where $\eta$ is a (link) function transforming the expected value of the
elements of the  response vector, $Y$, to the scale of a linear predictor
given by the product of the model matrix, $X$, and a vector of coefficients, 
$\beta$, and the elements of $Y$ are distributed as a member of the
exponential family. In the present case, the responses of the observer can
be considered as Bernoulli variables and, thus, can be modeled with the
binomial distribution which conforms to this family. The canonical link
function for the binomial case is the logistic transformation, Equation~\ref{logistic}. However,
other links are possible, and the inverse cumulative Gaussian, or quantile
function, corresponds to Equation~\ref{like}, described above 
 and would be equivalent to a probit analysis.

In this section, we assume that we have re-ordered each quadruple 
$\left( a,b; c,d \right)$ so that $a<b<c<d$.
With this ordering,  we can omit the absolute
value signs in Equation~\ref{equ2} which then becomes 
\begin{eqnarray}
\delta & = & \psi_d - \psi_c - \psi_b + \psi_a  \nonumber \\
\Delta & = & \delta + \epsilon.  \label{equ5}
\end{eqnarray}
The observer bases his or her judgment on $\Delta = \delta + \epsilon$ where 
$\epsilon \sim \mathcal{N}(0, \sigma^2) $. The observer therefore selects the
second pair $\left (c, d \right )$ with probability $\Phi \left (\delta
\right )$, where $\Phi$ is the Gaussian distribution function.

The design matrix, $X$, can be constructed by considering the weights of the 
$\psi_i$ as the covariates. On a given trial, the values in only four
columns are non-zero, taking on the values $1, -1, -1, 1$ in that order.
This yields an $n \times p$ matrix, $X$, where $n$ is the number of
quadruples tested and $p$ is the number of physical levels evaluated over
the experiment. For example, consider a set of 7~stimuli  distributed along
a physical scale and numbered 1--7. The five quadruples  
\[
\begin{array}{cccc}
2 & 4 & 5 & 6 \\ 
1 & 2 & 3 & 7 \\ 
1 & 5 & 6 & 7 \\ 
1 & 2 & 4 & 6 \\ 
3 & 5 & 6 & 7%
\end{array}
\]
yield the design matrix  
\[
X =  \left( 
\begin{array}{rrrrrrr}
0 & 1 & 0 & -1 & -1 & 1 & 0 \\ 
1 & -1 & -1 & 0 & 0 & 0 & 1 \\ 
1 & 0 & 0 & 0 & -1 & -1 & 1 \\ 
1 & -1 & 0 & -1 & 0 & 1 & 0 \\ 
0 & 0 & 1 & 0 & -1 & -1 & 1%
\end{array}
\right). 
\]

To render the model identifiable, however, we drop the first column, which
has the effect of fixing $\beta _{1}=0$, yielding a model with $p-1$
parameters to estimate as with the direct method. This yields the model, 
\begin{equation}
\Phi ^{-1} (\E[Y])=\beta _{2}X_{2}+\beta _{3}X_{3}+\ldots +\beta _{p}X_{p},
\end{equation}%
where $X_{j}$ is the $j^{th}$ column of $X$. Unlike the direct method, $\psi
_{p}=\beta _{p}$ is unconstrained. Implicitly in the GLM model, $\sigma =1$.
In fact, $\hat{\beta}_{p}$ from the GLM fit equals $\hat{\sigma}^{-1}$ from the direct
method, so that, all things being equal,  normalizing the GLM coefficients by $\hat{\beta}_{p}$ 
should yield
the same scale as obtained by the direct method. 

We have compared solutions using direct optimization (\code{optim} in %
\proglang{R}) and  GLM\ fits (\code{glm} function) and find good agreement.
Differences arise occasionally due to the additional constraints that we have imposed
when fitting by the direct method.

\subsection{Robustness}

In \proglang{R}, there is a choice between five built-in link functions for
the binomial family, including the logit, probit and cauchit (based on the
Cauchy distribution). As of \proglang{R} version 2.4.0, it has become simple
for the user to define additional links. In many circumstances, the choice
of link is not critical, since over the rising part of these functions, they
are quite similar. The difference scaling procedure, however, generates many
responses at the tails, i.e., easily discriminable differences and one might think that
it would be
more sensitive to the choice of link.

\citet{MalYang03} evaluated distributional robustness of the direct
optimization method. They varied the distributions of the error term $\epsilon$
while continuing to fit the data with the constant variance Gaussian error
assumption. They found that MLDS was remarkably resistant to failures of
the distributional assumptions. Hence, the GLM  approach using the
\code{probit} link is likely to be adequate for most applications of MLDS.

\subsection{Goodness of fit}

Use of the GLM approach benefits from the availability of several
diagnostic tests available in \proglang{R} for generalized linear models, and we report
a measure ``proportion of deviance accounted for''  (DAF) that has been suggested \citep[p.~84]{Wood06} as well as the
Akaike information criterion  \citep[AIC][]{Akaike}. 
Some standard diagnostic measures are problematic  or difficult to interpret 
for binary data, however, because
the distribution of the deviance cannot be guaranteed to approximate a $\chi^2$ distribution 
\citep{VR02,Wood06}.  Here, we implement  two diagnostic tests suggested by
\citet{Wood06} based on a bootstrap analysis of the distribution and independence of
the residuals in the example below.  The principle on which these tests are based
would be applicable to any GLM model.

\subsection{Diagnostic tests of the measurement model}

Even if the data passed an overall goodness of fit test, there still 
may be patterned failures in the data that would allow  rejection of the difference
scaling model. 
In this section, we describe two additional diagnostic tests based on
the necessary and sufficient conditions that an observer
must satisfy if we are to conclude that his judgments can be described by a
difference scaling model 
\citep[Chapter 4, p.~147, Definition 1]{Krantzetal71} discussed above.
These tests are specific to difference scaling and they correspond to necessary conditions  
for the existence of a difference scale in the non-stochastic case in Definition 1 of Krantz et al.


\subsubsection{The six-point test}

The first condition is the six-point condition, illustrated in Figure~\ref{SixPt}.
It is referred to as the ``weak monotonicity'' condition 
in \citet[Chapter 4, p.~147 and Figure~1]{Krantzetal71} in Definition 1, Axiom 4, p.~147. 
It is also known as the ``sextuple condition'' \citep{BlockMarschak1960}.
We describe the condition with an example.
Suppose that there are two groups of three stimuli whose indices are $a < b < c$ , and 
$a^{\prime}< b^{\prime}< c^{\prime}$, respectively. Suppose that a
non-stochastic observer considers the quadruple$(a, b ; a^{\prime}, b^{\prime})$ 
and judges that $ab \succ a^{\prime}b^{\prime}$, that the interval $ab$ is larger
than the interval $a', b'$. On some other trial, he considers $(b, c ;
b^{\prime}c^{\prime})$  and judges that $bc \succ b^{\prime}c^{\prime}$.
Now, given the quadruple, $(a, c ; a^{\prime}, c^{\prime}) $ there is only
one possible response consistent with the difference scaling model: he or
she must choose $ac \succ a^{\prime}c^{\prime}$. The reasoning behind this
constraint is illustrated in the figure and it can be demonstrated directly
from the model. 

For the non-stochastic observer, even one violation of
this six-point condition would allow us to conclude that there was no
consistent assignment of scale values $\psi_1, \psi_2, \ldots, \psi_p$  in a
difference scaling model that could predict his or her judgments in a
difference scaling task.

\begin{figure}[h]
\centering
\includegraphics{Figures/KM_JSS-005}
\caption{Six-Point Condition:  Given stimuli $a < b < c$ and $a' < b < c'$ ordered along a scale,
if $ab \succ a'b'$ and $bc \succ b'c'$, then $ac \succ a'c'$. }
\label{SixPt}
\end{figure}

The six-point condition is a slightly disguised test of additivity of contiguous intervals in the difference
scale. To see how it might fail, imagine that distances in the scale correspond to
chordal distances along a circular segment as shown in the left side of Figure~\ref{SixPtCurv}. Then the six-point condition in 
equality form implies that if $ab = a'b'$ and $bc = b'c'$, then $ac = a'c'$ where $=$ 
denotes subjective equality. If the six-point condition and other necessary conditions hold, then the chordal 
distances on the left side of Figure~\ref{SixPtCurv} can be represented along a line as in the previous Figure~\ref{SixPt} 
\citep[see][Chapter 4, for further discussion]{Krantzetal71}.
On the right side of Figure~\ref{SixPtCurv}, in contrast, judgments are based on chordal distances along an
ellipse. The six-point condition fails and these judgments cannot be represented by a difference scale.
\begin{figure}[h]
\setkeys{Gin}{width=0.4\textwidth}
\centering
\begin{tabular}{ll}   
\includegraphics{Figures/KM_JSS-006}
&   
\includegraphics{Figures/KM_JSS-007}
\end{tabular}
\caption{Six-Point Condition:  Left. Given stimuli $a < b < c$ and $a' < b < c'$ ordered along a circular (constant curvature) segment,
if the chordal distances $ab \approx a'b'$ and $bc \approx b'c'$, then $ac \approx a'c'$. Right. The six-point condition fails on a contour
with non-constant curvature.}
\label{SixPtCurv}
\setkeys{Gin}{width=0.8\textwidth}
\end{figure}

Human judgments in difference scaling tasks are not deterministic: if we present the same
quadruple at two different times, the observer's judgments need not be the same. The MLDS model
allows for this possibility.
In MLDS  decisions are based on a decision variable $\Delta(a, b ; c, d)$
and, for any given six points $a, b, c$ and $a^{\prime}, b^{\prime},
c^{\prime}$ there is a non-zero probability that the stochastic observer
will violate the six-point condition. In particular, suppose that 
$\psi_b - \psi_a$  is only slightly greater than $\psi_{b^{\prime}} - \psi_{a^{\prime}}$,
$\psi_c - \psi_b$ is only slightly greater than 
$\psi_{c^{\prime}} - \psi_{b^{\prime}}$, and $\psi_c - \psi_a$ is only slightly greater than
 $\psi_{c^{\prime}} - \psi_{a^{\prime}}$. Then we might expect that the
observer's probability of judging $ab \succ a^{\prime}b^{\prime}$ is only
slightly greater that 0.5 and similarly with the other two quadruples.
Hence, he has an appreciable chance of judging that 
$ab \succ a^{\prime}b^{\prime}$  and $bc \succ b^{\prime}c^{\prime}$ but 
$ac \prec a^{\prime}c^{\prime}$ or $ab \prec a^{\prime}b^{\prime}$ and 
$bc \prec b^{\prime}c^{\prime}$ but $ac \succ a^{\prime}c^{\prime}$, either a
violation of the six-point property.

\citet{MalYang03} proposed a method for testing the six-point property that
takes into account the stochastic nature of the observer's judgment and uses
a resampling procedure \citep{EfrTib93} to test the hypothesis that the MLDS
model is an appropriate model of the observer's judgments.

Given the experimental design and all of the quadruples used, we can
enumerate all six-point conditions present in the experiment, indexing them
by $k = 1, n_6$. We count the number of times, $V_k$, that the observer has
violated the $k^{th}$ six-point condition during the course of the
experiment and the number of times he has satisfied it, $S_k$. If we knew the
probability that the observer should violate this six-point condition $p_k$,
then we could compute the probability of the observed outcome by the
binomial formula, 
\begin{equation}
\Lambda^k_6 = \left ( 
\begin{array}{c}
V_k + S_k \\ 
V_k%
\end{array}
\right ) p_k^{V_k} (1 - p_k)^{S_k},
\end{equation}
and we could compute the overall likelihood probability 
\begin{equation}
\Lambda_6 = \prod_{k = 1}^{N_6} \Lambda_6^k.  \label{sixpt}
\end{equation}
Under the hypothesis that the difference scale model is an accurate
model of the observer's judgments, we have the fitted estimates of scale
values $\hat{\psi_1}, \ldots, \hat{\psi_p}$ and $\hat{\sigma}$. We can
compute estimates of the values
 $\hat{\Lambda}_6^k$ based on these scale values and
compute an estimate of 
$\hat{\Lambda}_6 = \prod_{k = 1}^{N_6} \hat{\Lambda}_6^k$.  
This is an estimate of the probability of the observed
pattern of six-point violations and successes. We next simulate the observer
$N$ times with the fitted parameter values 
$\hat{\psi_1}, \ldots, \hat{\psi_p}$ 
and $\hat{\sigma}$ of the actual observer used for the simulated
observer and perform the analysis above to get $N$ bootstrap estimates $%
\hat{\Lambda}_6^*$ of $\hat{\Lambda}_6$. Under the hypothesis that MLDS is
an accurate model of the observer's judgments, $\hat{\Lambda}_6$ should be
similar in value to $\hat{\Lambda}_6^*$ and we employ a resampling
procedure to test the hypothesis at the 0.05 level by determining whether  $%
\hat{\Lambda}_6$ falls below the 5th percentile of the bootstrap values $%
\hat{\Lambda}_6^*$ ( See \citet{MalYang03} for details).


\subsubsection{The three-point test}

The second empirically-testable necessary condition for a difference scale representation of
data is the three-point condition \citep[Chapter 4, p.~147, Definition 1, Axiom 3]{Krantzetal71}. 
Given three stimuli $a < b < c$ , the
non-stochastic observer must judge that $av \succ ab$ and $ac \succ bc$:
an interval must be greater than a proper interval contained within it. Often, in
difference scaling applications, this three-point property is evidently
satisfied and not formally tested. 
In some applications, the observer can simply be shown the test stimuli
and asked to order them. If he or she can do so in agreement with the physical scale, further
test of the three-point condition can be omitted.

In the stochastic case, subjects may confuse some of the stimuli on some trials. In the example
of Figure~\ref{Rscale} many observers will confuse the stimuli with lowest correlation values.
The three-point property can be stated in a form appropriate for the stochastic case as: if $a < b < c$ then the
probability of judging $ab$ as greater than $ac$ is less than or equal
to 0.5 and similarly for $bc$ and $ac$.

To test the 3-point condition, we must include quadruples of the form 
$\left(a,b;a,c\right)$ with $a<b<c$. As noted above, the inclusion of such quadruples
could introduce an artifact in the experimental design: the subject can correctly
order the intervals based on consideration of only the order of the stimuli.
If the experimenter excludes such quadruples then he cannot test the three-point 
condition, and in any application the experimenter must decide if a test of the three-point condition is 
warranted.

We do not provide a three-point test in the \pkg{MLDS} package. 
If the experimenter does choose to include quadruples of the form 
$\left(a,b;a,c\right)$ with $a<b<c$ (``3-point quadruples''), 
then it is very easy to design a three-point test 
patterned on the six-point test above. 
We use the fitted scale values and estimated $\sigma$ 
to bootstrap an ideal observer matched to the actual. It is appropriate to exclude the 
3-point quadruples in this initial fit of the scale. We then repeatedly compute the log likelihood 
$\hat{\Lambda}_3^*$ of the ideal observer's performance for the ``three-point intervals''
 and then compare
the actual log likelihood $\hat{\Lambda}_3$ to the distribution of bootstrap replications $\hat{\Lambda}_3^*$.
We reject if it falls below the $\alpha$th percentile of the bootstrap values for appropriate choice of $\alpha$.
If the observer's performance is consistent with the three-point condition, then the scale can be re-fitted
using all data including the three-point quadruples.

\subsubsection{Other necessary conditions and alternative axiomatizations} 

\citet[Chapter~4, p.~147, Definition 1]{Krantzetal71} list six conditions (``axioms'') that are jointly necessary and 
sufficient for a data set to be representable by a difference scale in the non-stochastic case.
The three-point and six-point diagnostic tests were based on two of these
conditions (Definition 1, Axioms 3,4, respectively). Of the remaining necessary conditions, two effectively
guarantee that the experimental design contains ``enough'' distinct quadruples, and that the observer can 
order intervals transitively (Axioms 1,2). Axiom 5 is satisfied if the values of the physical scale 
can be put into correspondence with an interval of the real numbers (evidently true in our example
for correlation $-1 \leq r \leq 1$). Axiom 6 precludes the possibility that an interval 
$ab$ with $a<b$ contains an 
infinite number of non-overlapping intervals that are all equal. 
In the stochastic case, these conditions are either evidently satisfied or are replaced by consideration
of the accuracy and stability of the estimated scale.  \citet{MalYang03} have
investigated accuracy, stability and robustness in some detail.

There are alternative sets of axiomatizations of difference scaling such as \citet{Suppes72}
and, of course, all sets of necessary and sufficient conditions are mathematically  equivalent.
We have chosen those of \citet{Krantzetal71} because they lead to simple 
psychophysical tasks. Observers are asked only to order intervals. Either they can do 
so without violating the six-point and three point conditions or they cannot and whether they can or cannot
is an experimental question. \citet[Chapter~4]{Krantzetal71} contains useful discussion of
the link between the axiomatization that they propose and the task imposed on observers.

The reader may wonder why the observer is asked to compare intervals and not just pairs of stimuli.
\citet[Chapter~4, pps.~137--142]{Krantzetal71} contains an extended discussion of the
advantages of asking observers to directly compare intervals. We note only that pairwise comparison 
of the stimuli (i.e. given $a, b$, judge whether $a < b$) does not provide enough information 
to determine spacing along the scale in the non-stochastic case. Any increasing transformation of a putative difference scale
would predict the same ordering of the stimuli. In the stochastic case the observer may judge 
that $a < b$ on some trials and that $b < a$ on others, and the degree of inconsistency in 
judgment could potentially be used to estimate scale spacing using methods due to \citet{Thurstone1927}.
Thurstone scales, however, have three major drawbacks. 
First, the scale depends crucially on the assumed distribution of 
judgment errors (it is not robust) while MLDS is remarkably robust \citep[see][]{MalYang03}. 
Second, stimuli must be spaced closely enough so that the observer's judgments will frequently 
be inconsistent. This typically means that many closely-spaced stimuli must be scaled, and 
the number of trials needed to estimate the scale is much greater than in MLDS. 
The third drawback is the most serious. 
It is not obvious what the Thurstonian scale measures, at least not without further 
assumptions about how ``just noticeable differences'' add up to produce perceptual differences.  
The MLDS scale based on quadruples is immediately interpretable in terms of perceived 
differences in interval lengths since that is exactly what the observer is judging.


\section[Package MLDS]{Package \pkg{MLDS}}

The package \pkg{MLDS} includes the function \code{mlds} for estimating a
perceptual difference scale by MLDS, and the function \code{simu.6pt} for
performing the six-point test as described above.

\subsection[Fitting with mlds]{Fitting with \code{mlds}}

The first argument of \code{mlds} is the data frame containing
the results from a difference scaling experiment. The function expects the
data to be organized as $n \times 5$ columns. Each row represents one trial.
The  first column named \code{resp}, of either type logical or a vector of $0
$s and $1$s, gives the responses of the observer. The next four, named %
\code{S1}, \code{S2}, \code{S3} and \code{S4} indicate the indices of the
four stimuli comprising the quadruple for that trial.

Frequently, the data from an experiment are ordered to indicate the
positioning of the stimuli in the experiment and not the physical ordering
of the stimuli, as \code{mlds} expects. For example, a trial 
\begin{Scode}{echo=FALSE}
c(resp = 1, S1 = 7, S2 = 9, S3 = 2, S4 = 4)
\end{Scode}
might indicate that the stimulus pair $(7, 9)$ was presented below and pair  
$(2, 4)$ above and that the observer chose the second pair as showing the
greater  difference. The function \code{SwapOrder} will check for such
inversions and  outputs a new data frame with the orders sorted and the
responses inverted in  case of inversion. If the results of \code{SwapOrder}
are stored in the original  variable, the original ordering is lost to
subsequent applications.

An object of class \class{mlds.df} is defined to be a data frame from a
difference scaling  experiment, as described above but with attributes, 
\class{invord} and \class{stimulus}.  Attribute \class{invord} is a logical vector
indicating whether the order was reversed  in the original experiment. %
\code{SwapOrder} checks whether its input is  of class  \class{mlds.df} and if
so, uses the ``invord'' attribute to re-order the data.  In this case,
successive applications flip the ordering back and forth  between that of
the experimental and sorted state.
The results from several experiments can be combined into a single object
of class  \class{mlds.df}  using the mlds.df method \code{rbind}, which concatenates
the \class{invord} attributes as well as the rows of the data frame.  

The second argument of \code{mlds} indicates the physical stimulus levels 
to which the indices in the data refer. If \code{NULL}, (the default),  then
it is set to either  a sequence from 1 to the maximum index level in the
data or  a numeric vector stored as the attribute \class{stimulus}, if present.

\code{mlds} uses \code{glm} by default, but \code{optim} may be chosen  with
the argument \code{method}. If the default is chosen, then  the data are
transformed to the design matrix within \code{mlds}  using the function %
\code{make.ix.mat}. \code{ix.mat2df} performs  the inverse transformation. 
If \code{optim} is chosen, its method  defaults to \code{BFGS} but may be
modified with the argument \code{opt.meth}.  Using \code{optim} requires
initial estimates passed through the argument  \code{opt.init}. The default
link is \code{probit} but this  may be modified to alternative links from the
binomial family or  a user-defined link (\proglang{R}~2.4.0 or greater) with
the argument \code{lnk}.  Additional named arguments will be passed along
to \code{glm} or \code{optim}  through the \code{...} argument.

\code{mlds} produces an object of class \class{mlds} which is a list of components:   
\code{pscale}, a numeric vector containing the estimated perceptual scale,  %
\code{stimulus}, a numeric vector of the physical scale,  \code{sigma}, the
value of $\sigma$ (always 1 for \code{method = "glm"}), \code{link}, a character
string indicating the link used and \code{method}, a character string
specifying the method used. For \code{method = "optim"}, there are also, the log
likelihood, Hessian, data frame and convergence condition returned in
components \code{logLik}, \code{hess}, \code{data} and \code{conv},
respectively. For \code{method = "glm"}, the component \code{obj} contains the \class{glm}
object from the fit which is used by the \class{mlds} methods to extract the
information provided in the additional components when \code{optim} is specified.

There are seven methods currently defined for objects of class \class{mlds}: %
\code{print}, \code{summary}, \code{plot}, \code{fitted}, \code{predict}, %
\code{logLik} and \code{AIC}. The \code{plot} method generates a graph of
the estimated perceptual scale as a function of the physical stimulus.
The function \code{boot.mlds} is provided  to estimate standard errors
for each scale value using a resampling procedure \citep{EfrTib93}.
In short, the fitted probabilities are used to generate new responses
to the trials with the function \code{rbinom}.  The estimated scale values for
the new responses provide a bootstrap sample.  In a similar manner, the
function \code{binom.diagnostics} allows running two diagnostics based on
bootstrap replications of the residuals in order to evaluate the suitability
of the model.

For \code{method = "glm"}, the model formula used is 
\begin{Code}
resp ~ . - 1
\end{Code}
There is no \code{update} method defined currently for \code{mlds}. However,
for the default method, the glm object is stored in the returned object as
an element named \code{obj}. This object may be updated if care is taken
also to include the data in the call, since ordinarily the data frame passed
to the glm call is only visible within \code{mlds}. For example, if  
\code{x.mlds} is an object of class \class{mlds} obtained with the default
method, one can try  
\begin{Code}
with(x.mlds, update(obj, .  ~ . + 1, data = obj$data))
\end{Code}
to obtain a model with an intercept term.

\subsection[Running a six-point test with simu.6pt]{Running a six-point test with \code{simu.6pt}}

The initial step in performing a six-point test requires identifying all of
the six-point conditions in the experimental data. The function %
\code{Get6pts} takes an object of class \class{mlds} and returns a list of
three data frames, with an attribute ``indices'' which is a fourth
data frame. The three data frames are named \code{A}, \code{B} and \code{E}
(the last to avoid the \proglang{R} function names \code{C} and \code{D}). All of
these data frames have the same number of rows, and corresponding rows of %
\code{A}, \code{B} and \code{E} provide six-point cases for evaluation. The
data frame attribute provides indices from the original data set, that from
which the \class{mlds} object was generated, to the rows at which each trial
occurred. For example, for the difference scale fit to the data set %
\code{AutumnLab} included in the package, row 4 of the three data frames
generated by \code{Get6pts} is 
\begin{Scode}{echo = FALSE}
data(AutumnLab)
x.mlds <- mlds(AutumnLab)
x.6pt <- Get6pts(x.mlds, nrep = 1)
t(sapply(x.6pt, function(x) x[4, ]))
\end{Scode}
\code{A} gives the comparisons $(a, b ; a^{\prime}b^{\prime})$, \code{B} $%
(b, c ; b^{\prime}, c^{\prime})$ and \code{E} $(a, c; a^{\prime}c^{\prime})$%
. Note that whether or not this example corresponds to a violation of the
six-point condition depends on the differences between  the perceptual scale
values to which these indices correspond. Row 4 of  the attribute is
 \begin{Scode}{echo = FALSE}
unlist(attr(x.6pt, 'indices')[4, ])
\end{Scode}
which are the rows of \code{AutumnLab} from which the three trials of the
six point condition were extracted.

\section{Example: Perception of correlation in scatterplots}
The study of graphical perception for enhancing data presentation has been of
interest to statisticians, at least, since the pioneering work of Cleveland and colleagues
\citep{Cleve84a}.  Scatterplots have often been the subject of investigation, for
example, to determine the characteristics that best reveal the underlying association in the
data \citep{Cleveetal82} or the visual parameters that create the most salient differences
between overlaid data sets \citep{Cleve84b}.  Only a few studies have examined the
sensitivity of human observers to statistical differences in scatterplots 
\citep[see, for example,][]{Leggeetal89}.  MLDS offers a promising method for
approaching such questions.

\subsection{A psychophysical experiment}
Executing  
\begin{verbatim}
runQuadExperiment(DisplayTrial = "DisplayOneTrial", 
	DefineStimuli = "DefineMyScale")
\end{verbatim}
 runs 330 trials 
of the difference scaling
experiment  and records the observer's responses interactively on each trial for the
scatterplot example of Figure~\ref{Rscale}.  
The stimulus from an example trial is shown in Figure~\ref{DSexamp}.  
The observer's task is to decide whether the difference in $r$ is greater
between the lower pair or the upper and to enter a \code{1} or \code{2}, respectively,
from the keyboard.
This function
can be readily modified to any difference scaling application by defining
the functions \code{DefineStimuli}  and  \code{DisplayTrial} that define the stimuli
and display the quadruples, respectively, of non-overlapping intervals on each trial.
After the observer has completed the
experiment, an object of class ``mlds.df'' is returned which can be used
for further analysis.  To preserve its attributes, it should be saved with 
\code{save} or \code{dput}.

One of the authors ran the experiment on himself three times, with
the results stored in objects \code{kk1}, \code{kk2} and \code{kk3}.
 Each of the runs of  330
trials required less than 12 minutes to complete.  
After loading the three data sets in memory, we merge them into one object
of 990 trials with \code{rbind} and apply \code{SwapOrder} to put the
stimulus in physical order.
\begin{Scode}
data(kk1)
data(kk2)
data(kk3)
kk <- SwapOrder(rbind(kk1, kk2, kk3))
\end{Scode}

\subsection{Estimating a perceptual scale}
For comparison, we fit the
data by \code{mlds}  using both \code{glm} and \code{optim} methods.  
Using  \code{method = "optim"} is usually slower.  
\begin{Scode}
kk.mlds <- mlds(kk)
summary(kk.mlds)
kkopt.mlds <- mlds(kk, method = "optim", opt.init = c(seq(0, 1, len = 11), 0.2))
summary(kkopt.mlds)
\end{Scode}
 Note the differences in the summaries.
The glm method fixes $\sigma = 1$ and does not constrain the upper range
of scale values.  The \code{optim} method fixes the extreme values of the scale
to 0 and 1.  Differences in the log likelihood can result because with the
\code{optim} method, we have constrained the estimated scale values to be in
$(0, 1)$.   These are usually quite small, as here, unless \code{optim} has 
found a local minimum.  This also accounts for
the slight discrepancy between the \code{optim} value of $\sigma = $  
0.175
and the reciprocal of the maximum scale value using \code{method = "glm"}, 
0.179.  

We compare the two estimated scales in Figure~\ref{figplot}  by first normalizing
the scale estimated by \code{glm} by its maximum scale value.  This is easily
accomplished by setting the argument \code{standard.scale = TRUE} in the plot method.
The \code{glm} and \code{optim} scales are shown as points and lines, respectively, and it is
clear that any differences between the two are unimportant.
 Note that the
resulting perceptual scale is almost flat for correlations up to 0.3. If we
plot the estimated scale instead as a function of squared correlation 
$r^2$ we see that for correlations above 0.4, the
observer's judgment effectively matches $r^2$, the variance accounted for. 
Below this value, the observer seems unable to discriminate scatterplots.  

\begin{figure}[ht!]
\centering
\includegraphics{Figures/KM_JSS-014}
\caption{Left.  Estimated difference scale for observer KK, from 990 judgments,
distributed over 3 sessions for judging differences of correlation between scatterplots.
The error bars correspond to twice the standard deviation of the bootstrap samples.
Right.  The same scale values plotted as a function of the squared correlation or variance accounted
for.  The diagonal line through the origin has unity slope.}
\label{figplot}
\end{figure}

We have noticed that using \code{glm} can frequently generate a warning message
\begin{Code}
Warning message:
fitted probabilities numerically 0 or 1 occurred in: glm.fit(x = X, y = Y, 
weights = weights, start = start, etastart = etastart,  
\end{Code}
There are several possible sources of this warning.  The obvious possibility that
the fit has perfectly separated the responses of the observer to scale differences
can be discounted by examining the fitted probabilities as a function of the
observer's responses.  For example, Figure~\ref{overlap} demonstrates the
overlap of responses 
for the \code{kk2} and \code{AutumnLab} data sets, both of which generate this
warning.  The warning can be generated as well if  just some of the fitted
probabilities are effectively a 0 or 1.  We can imagine this occurring if some of the intervals
that the observer must differentiate are so large that  errors are never made.
This indeed may occur when comparing large and small suprathreshold differences,
as here, and, in fact, can be taken as an indication that the observer does
exploit an internal scale when making judgments. It is pertinent to note
that this warning disappears for the fit to both of these data sets if the
link is changed to either \code{logit} or \code{cauchit}, suggesting that the
shape of the psychometric function could be at issue, here.  These
link functions differ primarily in the tails of the corresponding distributions.

\begin{figure}[htb!]
\begin{center}
\includegraphics{Figures/KM_JSS-015}
\caption{Fitted probabilities as a function of observer's response for the \code{kk2}
and \code{AutumnLab} data sets.}
\label{overlap}
\end{center}
\end{figure}

A third possibility could arise
from unpatterned response  errors (the observer pressed the wrong key in recording his
judgment on some trials).  This can
produce the Hauck-Donner phenomenon \citep{HD77,VR02}.  In that case,
the Wald approximation to the log-likelihood may be quite poor.  The profile plots
for the coefficients are curved in the above two data sets, and especially so for the
larger coefficients. Under these circumstances, It may be preferable to use \code{optim}
to estimate the scale by direct maximization of the likelihood \citep{VR02},
and a bootstrap approach to estimate standard errors.  Collecting more data 
is recommended to address this warning, and we find that the warning disappears
for \code{kk2} if we combine it with either of the other two replications.  Despite the
warning, the estimated scales using \code{glm} or \code{optim} are nearly 
the same. 


\subsection{Bootstrapping standard errors}
%bootstrap sd's
Bootstrap standard errors of the scale values are obtained with the
function \code{boot.mlds}.  Running 10000 trials on the dataset \code{kk}
required about 14 minutes on a 2.16~GHz Mac Intel Pro 
with 2~Gb of memory.  A list is returned with 4 components,
as indicated by applying \code{str},  below. The first, \code{boot.samp},
is a matrix of the bootstrapped scale values.  The means and standard deviations
are indicated in vectors \code{bt.mean} and \code{bt.sd}, respectively.

\begin{CodeChunk}
\begin{CodeInput}
kk.bt <- boot.mlds(kk.mlds, 10000)
str(kk.bt)
\end{CodeInput}
\begin{CodeOutput}
List of 4
 $ boot.samp: num [1:11, 1:10000]  0.00000 -0.02075  0.00361 -0.01485  ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:11] "" "stim.2" "stim.3" "stim.4" ...
  .. ..$ : NULL
 $ bt.mean  : Named num [1:11]  0.00000 -0.00876  0.00710 -0.01630  ...
  ..- attr(*, "names")= chr [1:11] "" "stim.2" "stim.3" "stim.4" ...
 $ bt.sd    : Named num [1:11] 0.0000 0.0245 0.0252 0.0278 0.0246 ...
  ..- attr(*, "names")= chr [1:11] "" "stim.2" "stim.3" "stim.4" ...
 $ N        : num 10000
\end{CodeOutput}
\end{CodeChunk}
Twice the bootstrap errors are indicated as error bars in the left graph of
 Figure~\ref{figplot}.

\subsection{Goodness of fit}

A first qualitative test of the validity of the estimated scale is to compare it with the
actual stimuli to determine if it does, in fact, capture the perceptual variation displayed.
For example, examination of the stimuli and the estimated scale in Figure~\ref{Rscale} 
confirms that the initial flat part of the scale corresponds to a range of stimuli that
are difficult to distinguish and that subsequent stimuli do appear to increase in 
correlation, as the scale indicates.  The quadratic dependence of the increasing
part of the scale probably cannot be detected in this fashion.  Estimated scales that
do not display such \emph{face validity} should certainly be re-examined.

Several approaches have been suggested to analyze the appropriateness of the model
for binary data.  For comparison, we will consider analyses of the \code{kk} data set
with the probit (default, already calculated above), logit and cauchit links.

\begin{Scode}{}
kk.mlds.logit <- mlds(kk, lnk = "logit")
kk.mlds.cauchit <- mlds(kk, lnk = "cauchit")
\end{Scode}

When \code{method = "glm"} is used, it is easy to extract the residual and null deviances
from the model object component to calculate a deviance accounted for (DAF), analogous
to an $R^2$ calculated for linear models \citep[p.~84]{Wood06}.  For example,
\begin{verbatim}
(kk.mlds$obj$null.deviance - deviance(kk.mlds$obj))/kk.mlds$obj$null.deviance
\end{verbatim}
for the probit link.  The results for the three link functions, given in Table~\ref{GOF} in
the first column, indicate a negligible superiority of the other two links over the default.

\begin{table}[h]
\begin{center}
\begin{tabular}{l||crc}
link & DAF & AIC & Pr(Runs) \\ \hline 
probit & 0.55 & 633 & 0.02 \\
logit & 0.57 &610  & 0.22 \\
cauchit & 0.57 &611  & 0.07\\
\end{tabular}
\caption{\label{GOF} Goodness of fit.}
\end{center}
\end{table}%

The second column of Table~\ref{GOF} displays the AIC values for each link function, obtained with
the \code{AIC} method applied to each model object.  The logit link shows a more decided advantage
over the probit, in this case.  

\citet{Wood06} proposed two diagnostics for evaluating the suitability of the model fit to the data,
each one based on the distribution of the deviance residuals of the fit.  
The first involves a comparison of the empirical distribution of the residuals to
an envelope of the $1 - \alpha$ proportion of the bootstrap-generated residuals. 
The second tests the dependence of the residuals by comparing the number of
runs of positive and negative values in the sorted deviance residuals with 
the distribution of runs from the bootstrapped residuals.  

\begin{figure}[h]
\begin{center}
\begin{tabular}{c}
\includegraphics{Figures/KM_JSS-019}
%\includegraphics[width=0.8\textwidth]{K__JSS-019}
\\
\includegraphics{Figures/KM_JSS-020}
%\includegraphics[width=0.8\textwidth]{K__JSS-020}
\\
\includegraphics{Figures/KM_JSS-021}
%\includegraphics[width=0.8\textwidth]{K__JSS-021}
\end{tabular}
\end{center}
\caption{Diagnostic graphs produced by \code{plot.mlds.diag} for the models fit
with the three link functions.  The left graphs show the empirical cdf of the deviance
residuals (black points) compared to the 95\% bootstrapped envelope (blue lines).
The right graphs display a histogram of the number of runs in the sign of the sorted deviance
residuals from the bootstrap simulations.  The observed value is indicated by
a vertical line.}
\label{diagnostics}
\end{figure}

We provide a function \code{binom.diagnostics} to implement both of these for
objects of class  \class{mlds}.  The function takes two arguments:  \code{obj}, an \class{mlds}
model object and \code{nsim}, the number of bootstrap simulations to run.  For example,
\begin{verbatim}
kk.diag.prob <- binom.diagnostics(kk.mlds, 10000)
\end{verbatim}
performs 10000 simulations.
An object of class \class{mlds.diag} is returned that is a list of 5 components, illustrated
below for the \code{probit} link.
\begin{verbatim}
R> str(kk.diag.prob)

List of 5
 $ NumRuns  : int [1:10000] 195 193 205 199 177 177 201 211 187 209 ...
 $ resid    : num [1:10000, 1:990] -5.25 -4.79 -4.77 -4.56 -4.52 ...
 $ Obs.resid: Named num [1:990]  0.458  0.164  0.413  1.155 -1.610 ...
  ..- attr(*, "names")= chr [1:990] "1" "2" "3" "4" ...
 $ ObsRuns  : int 173
 $ p        : num 0.0159
 - attr(*, "class")= chr [1:2] "mlds.diag" "list"
 \end{verbatim}
\code{NumRuns} is a vector of integer giving the number of runs in the sorted
deviance residuals for each simulation. \code{resid} is a matrix of numeric,
each row of which contains the sorted deviance residuals from a simulation.
\code{Obs.resid} is a vector of numeric providing the residuals from the obtained
fit.  \code{ObsRuns} is the number of observed runs in the sorted deviance residuals, and
finally \code{p} gives the proportion of runs from the simulations less than the observed
number.  A \code{plot} method is supplied to visualize the results of the two analyses
which are shown in Figure~\ref{diagnostics} for each of the link functions.

The distributions of the residuals for the probit and logit links seem reasonable,
based on 10000 simulations.
Tendencies toward deviation from the envelopes are small, in any case.  These contrast
with the cauchit link, that displays systematic deviations from the bootstrapped envelope.

The histograms indicate that there are too few runs in the residuals using the probit link.
For the logit, the observed number falls well within the distribution of bootstrapped values,
while the cauchit value, given its performance with the previous diagnostic,
is debatable.  The proportion of simulated runs less than
the observed value for each link is given in the third column of Table~\ref{GOF}.

Two points on the far left of the cdf's of Figure~\ref{diagnostics} stand out as having 
unusually large residual deviances.  
These points, as well as a third one, are flagged, also, by the diagnostics 
generated by the glm \code{plot} method.  The three trials are simply extracted
from the data set.
\begin{Scode}{}
kk[residuals(kk.mlds$obj) < -2.5, ]
\end{Scode}
Interestingly, if these points are removed from the data set, the value of 
\code{p} for the probit link increases to the value of 0.24, more in line with that obtained using
the logit link.  The number of runs does not change in the observed data, but 
the bootstrapped distribution of the number of runs shifts to a mean near 171.

Judging from the estimated scale as well as the stimuli, it seems surprising that
the observer would have judged the correlation difference between 0 and 0.1 to be
greater than that of 0.3 (or 0.4) and 0.9.  We suspect that these correspond to
non-perceptual errors on the part of the observer, such as fingerslips, lack
of concentration or momentary confusion about the instructions.  A few such
errors nearly always occur in psychophysical experiments, even with practiced and
experienced observers.  In modeling data from detection experiments, it has proven useful
to include a nuisance parameter to account for these occasional errors \citep{WichHill01}.

The error rates are modeled by modifying the lower and upper asymptotes of
the inverse link function.  We can get a sense of the impact of adding these two nuisance
parameters by using links \code{mafc.probit} and \code{probit.lambda} from
the \pkg{psyphy} package, which permit specifying these asymptotes differently from 0 and 1 \citep{psyphy}.  Preliminary experimentation on the full data set indicates that the
AIC is reduced by 48 if the lower estimate is set to about 0.06 and the upper
to 0.007.  These values also lower the number of runs in the distributions of bootstrapped
residuals, so that the observed value yields a \code{p} $ = 0.7$.

\subsubsection{Bias-reduced MLDS}
When using the default argument \code{method = glm} with \code{mlds}
a method argument can be passed to \code{glm} with the argument
\code{glm.meth}.  Its default value is ``glm.fit'' but other methods 
are possible.  For example, the \pkg{brglm} \citep{Kosmidis2007}
 package contains the
function \code{brglm.fit} that peforms a bias-reduced fit based on 
\citet{Firth93}.  For example,
\begin{Scode}{eval=FALSE}
library(brglm)
mlds(kk2, glm.meth = brglm.fit)
\end{Scode}

\subsection{The six-point test}
  
Performing a six-point test  on these data with 10000 simulations
 requires about 15 minutes on the same machine indicated above.
\begin{CodeChunk}
\begin{CodeInput}
kk.6pt <- simu.6pt(kk.mlds, 10000)
str(kk.6pt}
\end{CodeInput}
\begin{CodeOutput}
List of 4
 $ boot.samp: num [1:10000] -488 -539 -531 -502 -447 ...
 $ lik6pt   : num -425
 $ p        : num 0.848
 $ N       :num 10000
\end{CodeOutput}
\end{CodeChunk}

Examination of  the structure of the returned list  with \code{str} shows the p-value and log-likelihood 
for the number of violations of the six-point test from the data and indicates that
the observer did not make a significantly greater number of six-point violations
than an ideal observer. 
Figure~\ref{boot6pt} shows a histogram of the log-likelihoods
from such a simulation with the observed log-likehood indicated by a thick vertical
line. These results support the appropriateness of the scale as a
 description of the observer's judgments.

\begin{figure}[htb!]
\begin{center}
\includegraphics[width=0.6\textwidth]{Figures/boothist}
\caption{Histogram of log-likelihood values from the six-point test to data set \code{kk}.
The thick vertical line indicates the observed six-point likelihood on the data set.}
\label{boot6pt}
\end{center}
\end{figure}

\subsection{Fitting a parametric difference scale}
The results of the difference scaling experiment with scatterplots suggest that
the perception of correlation increases quadratically with correlation 
(Fig.~\ref{figplot}).  We can
refit the data but constraining the estimated difference scaling curve to
be a parametric curve, such as a power function, using the formula
method for \code{mlds}.  Under the hypothesis that perception of
correlation depends on $r^2$, 
we would expect the exponent to be approximately $2$.
\begin{Scode}{keep.source=TRUE}
kk.frm <- mlds(~ (sx/0.98)^p[1], p = c(2, 0.2), data = kk)
\end{Scode}
We specify the parametric form for the difference scale with a
one-sided formula. The operations take on their mathematical meaning here,
as with formulae for \code{nls} but not as for
\code{lm} and \code{glm} without isolation.
 The stimulus scale is indicated by a variable \code{sx}
and the parameters to estimate by a vector, \code{p}.  
The fitting procedure
adjusts the parameters to best predict the judgments of the observer
on the standard scale.
The equation is normalized by the highest value tested
along the stimulus dimension, $r = 0.98$ so that the fitted
curve passes through $1.0$ at this value.
The optimization is performed by \code{optim} and initial values of the 
parameters are
provided by a vector given as an argument in the second position.
The last element of this vector is always the initial value for \code{sigma}.
Finally, a data frame with the experimental results is, also, required.

The estimated parameters are returned in a component \code{par} of
the model object and the judgment variability, as usual, in the component
\code{sigma}.  Standard errors for each of these estimates can be obtained
from the Hessian matrix in the \code{hess} component.
\begin{Scode}{}
c(p = kk.frm$par, sigma = kk.frm$sigma)
sqrt(diag(solve(kk.frm$hess)))
\end{Scode}
The standard error for the exponent provides no evidence to reject
an exponent of $2$.  

The parametric fit with 
$2$~parameters is nested within the $10$~parameter nonparametric
fit.  This permits us to test the parametric fit with a likelihood ratio test.
We extract the log likelihoods from each model object with the
\code{logLik} method.
The degrees of freedom of each fit are obtained from the ``df'' attribute
of the ``logLik'' object.
\begin{Scode}{keep.source=TRUE}
ddf <- diff(c(attr(logLik(kk.frm), "df"), 
	attr(logLik(kk.mlds), "df")))
pchisq(-2 * c(logLik(kk.frm) - logLik(kk.mlds)), 
	ddf, lower.tail = FALSE)
\end{Scode}
The results reject the power function description of the data.  
A predicted curve under the parametric model fit to the
data is obtained from the \code{func} component of the
model object, which takes two arguments: the estimated
parameters, obtained from the \code{par} component of the model object and
a vector of stimulus values for which to calculate predicted values.
The plot of the predicted values against the values obtained by \code{glm}
indicates that the power function fit provides a reasonable 
description of the data for correlations greater than 0.4 and a
 lack of fit for lower correlations, for which this observer shows no
evidence of discrimination (Fig.~\ref{fig:mldsFormMeth}).
\begin{figure}[h!]
\begin{center}
\setkeys{Gin}{width=0.65\textwidth}
\begin{Scode}{fig=TRUE,eps=FALSE,keep.source=TRUE}
plot(kk.mlds, standard.scale = TRUE,
	xlab = expression(r^2),
	ylab = "Difference Scale Value")
xx <- seq(0, 0.98, len = 100)
lines(xx, kk.frm$func(kk.frm$par, xx))
\end{Scode}
\setkeys{Gin}{width=0.8\textwidth}
\caption{Difference scales estimated for the perception of correlation obtained
using the default (points) and formula (line) methods to fit the data. 
The formula used was a power function.}
\label{fig:mldsFormMeth}
\end{center}
\end{figure}

\section{The Method of Triads}   


In the Method of Quadruples, observers compare interval $(a,b)$ and $(c,d)$ and 
it is not difficult 
to see how the resulting difference scale would lend itself to predicting such judgments. We can also
use a slightly restricted method for presenting stimuli that we refer to as the Method of Triads.  On each
 trial, the observer sees three stimuli $(a,b,c)$ with $a<b<c$  and judges whether or not $(a,b) \succ (b,c)$. The observer 
is still judging the perceived size of intervals but now the intervals judged share a common end point $b$.
Despite this apparent limitation, difference scales estimated using the Method of Triads exhibit about the same
variability and lack of bias as difference scales estimated using the Method of Quadruples if the total number
of judgments is the same. 

The change in method entails a change in the design matrix given
that stimulus $b$ participates in the comparison of both intervals.  The
decision variable for a triad experiment is modeled as
\begin{equation}
(\psi_b - \psi_a) - (\psi_c - \psi_b) = 2 \psi_b - \psi_a - \psi_c,
\end{equation}
so that the design matrix has non-zero entries of $(-1, 2, -1)$.  
The data frame from an experiment has only 4 instead of 5 columns,
indicating the response and the indices for the three stimuli on each trial.
The \pkg{MLDS} package includes a function \code{RunTriadExperiment}
that works similarly to \code{RunQuadExperiment}, but presents triads
on each trial and returns an object of class ``mlbs.df''.  
The \code{mlds} function is generic and methods are supplied for
both classes of object.  The user need only supply the results
of either type of experiment and the function dispatches to the correct
method to return the results.

The experimenter is free to choose whichever method proves to be more 
convenient if, for example, it is easier to fit three stimuli simultaneously on an experimental display than four.
For $p>3$ stimuli, there are 
\begin{equation}
\left( 
\begin{array}{c}
p \\ 
3
\end{array}
\right) =\frac{p!}{3!\:(p-3)!}
\end{equation}
possible non-overlapping triads.
If $p=11$, as in the example, there are 
\begin{equation}
\left( 
\begin{array}{c}
11 \\ 
3
\end{array}
\right) =\frac{11!}{3!\:8!}=165
\end{equation}
different, non-overlapping triads. 
Of course, the
number of trials judged by the observer affects the accuracy of the estimated
difference scale (See Maloney and Yang \cite{MalYang03}) as we illustrate below.The time
needed  to judge two repetitions of all 165 trials with triads (330 trials total) is the number of trials
for one repetition of all trials with non-overlapping quadruples. 
\section{Future directions}

There are several directions in which \pkg{MLDS} might be developed, four
of which will be mentioned here.

First,
we do not know what would be the most efficient choice of stimuli along a 
continuum or of quadruples for 
a particular application of MLDS. These depend on the observer's actual scale
and judgment uncertainty $\sigma$, but given pilot data for the observer or previous
results for other observers, it would be interesting to work out methods for selecting 
stimuli and quadruples. For example, having seen the results for one observer in
the scatterplot example, we might consider stimuli that are equally spaced along
the scale $r^2$ and not the scale $r$ in future experiments.

Second, we plan on developing a more systematic method of assessing the
asymptotic probabilities of the inverse link function to take into account unsystematic
errors by the observers.  
The difficulty is that these parameters are not part of the linear predictor.
One possibility is to profile the nuisance parameters (as we did here) or, alternatively,  to develop a
 method that switches back and forth between adjusting the nuisance parameters 
 and the coefficients of the linear predictor.

Third, it would be useful to incorporate random effects that influence the scale when
an observer repeats the experiment or to account for variations between individuals.  
Such heterogeneity is, indeed, apparent if we compare the three scales
obtained on different days in the data set \code{kk}.  For these data,
there is only one random factor, the Run.  It might be possible to treat
this as an effect due to a randomized block \citep[p.~295]{VR02}
The ratio of the scale values between any of the two repetitions is
approximately constant across the physical scale, however, which suggests that the 
estimate of $\sigma$ across runs,
or equivalently the maximum scale value, would be a more likely candidate to explain such
a source of variability, but as a multiplicative rather than as an additive effect.  
We develop some approaches to this problem using the \pkg{lme4}, elsewhere
(Knoblauch and Maloney, \emph{Modeling Psychophysical Data in \proglang{R}},
Springer, in preparation).

%\section*{Computational details}
%
%The bivariate point distributions were generated
%with the \code{mvrnorm} function from the package \pkg{MASS} (version 7.2-40)  \citep{VR02}.
%The \pkg{ggplot} package (version 0.4.2) \citep{ggplot} was used to create Figure~\ref{overlap}.
%All \proglang{R} code for generating the figures is available in an \proglang{R}
%script along with this paper

\section*{Acknowledgments}

This research was funded in part by Grant EY08266 from the National Institute of Health (LTM). 

\bibliography{MLDS}

\end{document}