Note that two things have been done before any
calculation. First the function checks whether the required argument
x
is missing, then it checks whether x
is a data frame
or matrix. This avoids any messy errors that would result by calling the
function without any arguments or with objects that didn't make any sense to the
function. If these checks fail, a simple usage message is
printed to inform the user.
After all this preparation, what does the function do? First, it works out
which columns of the input data are numeric. It then computes the
vector of means as above. Then it makes almost exactly the same call, except
using the function var()
to get the variances. You won't find the
third function in the R help listing.
validn<-function(datavec) { return(sum(!is.na(datavec))) }
All it does is add up the vector of TRUE
and FALSE
values (which happen to equal 1 and 0) produced by the function
is.na()
and reversed by the NOT (!) operator. That value is the
number of valid (NOT NA) observations in the object passed to it, which will be
each column of the incoming data to
dstats()
. So validn()
had to be written
before dstats()
would run. This is a good time to consider when
it is worth writing a separate function to do something. validn()
has the advantage that it returns a useful value that might be used by many
other functions. Also, it's slightly easier to use a function name in
sapply()
than writing in the function definition. It has the
disadvantage that it is a one-liner and could simply
be inserted into the code of the other functions rather than a call.
Your own style of programming will probably dictate at what point you begin to
write separate functions.
dstats()
then calls rbind()
to stick the three
vectors together into a matrix row by row, and calls rownames()
to
assign the desired row labels to the matrix. Finally, dstats()
assigns the class of "dstat", which will allow objects produced by it to be
recognized as a specific sort of object. The object dstat
is then
returned, which means that it will be displayed, and optionally assigned to
another object.
Assume that you would like to have more control over which variables are
included in the output. You know that the mean is a suitable description for
age
and parity
. dstats()
as written
allows you to pass a vector of either integers or names that will specify the
variables to be used.
So, why do we go to all this trouble to define a special object and write a function to produce it? I find that an object that contains means, variances and ns allows me to easily display a summary and produce a basic plot, including error bars, etc. It also allows functions to recognize the object as the correct sort without extensive testing of the contents. Apparently other people have noticed the same thing, and there is at least one other "dstat" object out there. One of the really valuable things about being able to program your stats is that you can customize such objects to suit the sort of work that you do. Therefore, I encourage readers to devise and refine their own objects, a good way to learn the capabilities of R and more importantly, to get what you want.
dstat
object. You
could add a call to the median()
function and include that in the
object returned. mode is a bit more difficult. We can use the
tabulate()
function to calculate it - if the data are all
positive numbers:
> x<-sample(1:10,30,TRUE) xt<-tabulate(x) xmode<-which(xt == max(xt)) if(sum(xt == max(xt))>1) xmode<-NA
xmode
will be the value of the first category that has a frequency
equal to the highest frequency in the table. If there is more than one category
with the maximum frequency, xmode
will be set to NA.
For more information, see An Introduction to R: Vector arithmetic.