Formulae in R bear a passing resemblance to other formulae with which you may be familiar. Their primary purpose is to specify a statistical model, but they may also be used to specify other sorts of relationships between variables. This helps to simplify the user interface, although it does exert some pressure upon the beginner to learn the syntax of formulae .
The most straightforward use of formulae is in specifying a linear model, such as the following:
y = ax1 + bx2 + cx3 + i
where the x terms are variables and the a, b, c and i terms are numeric
constants. Requesting a computation of this model in
R from the lm()
function would
look something like this:
lm(y~x1+x2+x3,data=mydata.df)
Note that the The variable on the left of the tilde ('~') is the response (or dependent) variable, and those on the right are the terms of the model (sometimes the independent variables). So far, so good.
Because formulae are a vital concept in specifying models, the syntax is rich and sometimes confusing, allowing the usual interactions between variables and the inclusion of various terms that specify the details of calculation.
The reader will have noticed that the simple formula
construction was used to specify how to breakdown a variable by a number of
factors in the brkdn()
function. In this case, the formula representation was used as a convenient
way to specify the breakdown to the function rather than a linear model.
As with the xtab()
function, formulae may be used to specify a number of relationships between
variables in R. It is best to ensure that
you know how a formula representation is being used by a particular function, as
simply sticking one together and sending it to the function often results in
particularly confusing error messages.
For more information, see Introduction to R: Defining statistical models; formulae