Preface -
As part of the Factor Analysis course I took this Winter Quarter, I had
to write a 2-page definition of factor analysis as if I were writing for an
encyclopedia being produced for those involved in quantitative research.
This piece is an elaboration of that homework assignment, including all of
the material I had to cut out for reasons of space and correcting certain
things that the instructor pointed out to me.
The first time any bit of technical terminology is used, I'll type it in
ALL CAPS to help set it off (I used boldface in the original). A few of
these terms don't get much explanation in context, so I put in an endnotes
section to cover them.
FACTOR ANALYSIS -
Put simply, factor analysis is a method for analyzing multiple
measurements and looking for underlying causes for any relationships between
the measurements. Put more simply, it's a way (but not the only way) to take
a bunch of tests and see what they have in common. Of course, this
simplification misses many important details, but it's a good start.
When comparing the results of several tests, one can find CORRELATIONS
between the results of any two tests taken together, which suggests the two
tests have something in common. The theory behind factor analysis is that
the reason these two tests are correlated is that both are trying to get at
an unmeasurable FACTOR...because the factor influences both tests, a
subject's score on one test can tell you something about their score on the
other test.
A factor is a trait of some sort, like intelligence, emotional stability
or artistic ability, which cannot be known exactly. Any test can only
approximate a given factor, and usually contains measurements of more than
one factor at a time. Methods of factor analysis try to figure out how much
a particular factor influences a given test without actually knowing what the
value of the factor is.
So, in general, factor analysis tries to use what can be measured
(called MANIFEST VARIABLES) to make some sort of statement about how the
tests are linked to what cannot be measured (called factors or LATENT
VARIABLES). This statement is in the form of FACTOR LOADINGS, a measure of
how much someone's score on a test is influenced by their unmeasurable
ability in a given factor.
Specifically, factor analysis makes the assumption that a subject's
deviation from the mean score on a test is due to three different types of
unmeasurable factor:
COMMON FACTORS: These are factors which influence more than one test in
a battery (group of tests). Finding the factor loadings of these latent
variables is the main goal of most factor analysis methods.
SPECIFIC FACTORS: These are latent variables that only influence one
test in a battery. A single specific factor may actually be the result of
several traits, so long as none of these traits affects any other test. A
specific factor may become a common factor if a new test, influenced by that
factor, is added to the battery.
ERROR FACTORS: While common and specific factors measure things that
presumably remain fairly constant in a subject across administrations of a
particular test, error factors represent the effect of unreliable influences,
such as the subject's mood, state of readiness or environment.
Because specific and error factors relate only to performance on a
single test, they are usually grouped together as a single UNIQUE FACTOR for
each test. Each specific factor only affects a single test, and each error
factor only affects a specific administration of that test, so there is no
correlation between unique factors.
Many methods of factor analysis exist, but all have in common that they
try to separate the common factors from the unique factors so as to explain
correlations between the manifest variables. The two major groups these
methods are usually split into are EXPLORATORY and CONFIRMATORY factor
analysis.
Exploratory Factor Analysis involves attempting to find the model which
best fits the available data, without influence of any prior theories.
Because there are infinitely many solutions which qualify as "best,"
separated only by ORTHOGONAL TRANSFORMATION or ROTATION, the results given by
a single solution may not be interpretable, and often will look useless. The
goal of rotation is to find a solution that can be interpreted. Software
packages exist which will do all of this for you, but be warned: the most
commonly coded method for obtaining a rotated solution has also been shown
repeatedly to be highly dubious...it survives due to a form of academic
inertia. The best rotations are not orthogonal, but OBLIQUE, as they don't
assume that factors are uncorrelated. Without getting too much further into
material which can't be adequately treated in this short paper, suffice to
say that if your computer program offers a choice between orthogonal and
oblique rotation (Varimax is the most common orthogonal, Direct Quartimin a
frequently-used oblique rotation), pick oblique.
Aside from the question of rotation, exploratory methods also require
the researcher to pick how many factors to keep. In general, adding more
factors will make the model fit better, but will also make it harder to
interpret and generally less useful. A number of rules of thumb exist and
are even programmed into software packages, but consider how smart your
average thumb is. The best advice is to use several criteria for picking the
number of factors to keep, and analyze the problem in terms of the situation
at hand. The final criterion is to look at the final, rotated solution and
see if it can be interpreted sensibly.
Confirmatory Factor Analysis starts with some theory about how the
factors should relate to the manifest variables, such as grouping all math
tests under one factor and all language tests under another factor. By doing
this, it avoids both the need for rotation and the need to pick the number of
factors after seeing the data. Extremely impractical to perform by hand, it
can be done fairly easily by computers now, with the right software (SYSTAT
for Windows 95 has a program in it called RAMONA which performs some
confirmatory factory analysis). These methods, while not giving fits as good
as exploratory factor analysis, yield statistical measures of how well the
model fits. To use an analogy, exploratory factor analysis is like getting
the mean very accurately, while confirmatory factor analysis is like getting
the mean with slightly less accuracy, but also knowing the standard
deviation. One should always devise models before running confirmatory
factor analysis, since using the data to drive a confirmatory factor analysis
(i.e. changing your model after seeing that it doesn't fit well) is cheating.
Finally, it should be noted that there is a method known as PRINICPAL
COMPONENTS ANALYSIS which shares many of the mathematical methods of factor
analysis, but which should not be mistaken for factor analysis. Principal
components analysis allows "unique factors" to be correlated, which means
they're no longer actually unique factors, and the factor loadings obtained
from analysis will not as accurately explain the correlations between the
manifest variables. Principal Components Analysis is a tool for reducing a
large set of data to a few managable numbers, not for explaining relations.
Be very cautious when you see principal components analysis presented as
being factor analysis...while it often gives good results, the fact that it
violates the underlying assumptions of the factor analysis model is good
reason to be careful.
Endnotes:
Here's some elaboration on a few of the capitalized terms which weren't
defined in the text above, and might not be clear from context to those who
haven't studied statistics and linear algebra. Let me know if you think this
section should be added to.
Correlation: Suppose you have results of two tests on your class. The
better the correlation between the two tests, the more likely you can predict
a student's score on test A based on the score on test B. Correlations are
bound between -1 and +1, with 0 meaning there's no relationship at all
between the two sets of data. When two things are highly correlated (close
to either +1 or -1), this doesn't necessarily mean that one causes the other,
but it suggests that they at least have a common cause. Factors are a way of
representing these common causes.
Orthogonal Transformation: In two dimensions, this just means rotating
your axes. The idea is that if you graphed all your data, you hope it will
cluster in such a way that you can rotate the axes and have each clump on an
axis. For example, if you take students results on their midterm as the
x-axis and results on the final as the y-axis, you might find that most of
the points lie on one line. You can rotate your axes so that one axis is
along that line, and name the axis "Physics Ability" or something else
appropriate (or even "ability to take tests well").
Oblique Transformation: When you get up to higher dimensions (more
tests), you might get tight clusters of data which lie on lines that aren't
orthogonal to each other. Trying to simply rotate axes will result in not
catching all of these clusters very well. So you relax the requirement that
the axes be orthogonal, and just move them to where they fit the data best.
Since the axes represent factors, this means that you have factors which are
correlated (the projection of one onto another is not zero), implying a
deeper factor underlying the current set.
Dave Van Domelen
Physics Education Research Group
The Ohio State University
dvandom@pacific.mps.ohio-state.edu