Simulation Methods for Probit and Related Models Based on Convenient Error Partitioning
by
Kenneth E. Train
Department of Economics
University of California, Berkeley
Abstract
Two probit simulators are described that are conceptually and computationally simple. The first is based on simulating the utilities of the non-chosen alternatives and calculating the probability that the chosen alternative's utility exceeds this maximum. This simulator is apparently new. The second, which is implicit in the discussions of McFadden (1989) and Bolduc (1992), is applicable when the covariance among utilities arises from random parameters and/or error components that are common across alternatives. The parameters and common error components are simulated, and then the probability that the observed event occurs is calculated conditional on these values. Both simulators are unbiased, strictly positive, and continuous. The second is twice-differentiable, while the first has points of non-differentiability. Both are easy to program and can be expected to be very fast computationally.
Key words: Probit, Simulation
JEL#: C13, C15, C25
Simulation Methods for Probit and Related Models Based on Convenient Error Partitioning
Kenneth E. Train
Department of Economics
University of California, Berkeley
June 1, 1995
I. Introduction
Probit models take the form P=I1(,0A)f(,)d, where P is the probability of the observed event, , is a vector with a multivariate normal distribution denoted f, A is the set of ,'s for which the observed event occurs, and 1(.) is an indicator that , is in A. The difficulty with probit models is that the integration cannot be performed explicitly to obtain a closed form for P. Numerical integration is required. Various simulation methods (discussed below) have been proposed that provide numerical approximations to the integral. In this paper, we present two simulators that might prove attractive in certain circumstances. The two simulators utilize the concept of partitioning the random terms, though in different ways. The general approach is as follows. Partition , into two subvectors ,1 and ,2 with marginal and conditional distributions f1(,1) and f2(,2|,1), respectively, where the partitioning is chosen to provide desirable properties of the resulting simulator. Then P=Ip(,1)f1(,1)d,1 where p(,1) / I1((,1,,2)0A)f2(,2|,1)d,2 is the probability that the observed event occurs conditional on ,1. If drawing ,1 from its marginal distribution is easy, and p(,1) is easy to calculate, then P can be readily simulated with this error partitioning. The simulation is performed by (1) drawing a value of ,1 from its distribution, (2) calculating p(,1) for this value of ,1, (3) repeating this process for repeated draws of ,1 and averaging the results. The simulator is unbiased by construction. Its other properties are determined by the characteristics of p(,1). Ideally, a partitioning will provide a p(,1) that is (a) strictly positive, so that the simulated probability can be inserted into the log-likelihood function for parameter estimation, (b) continuous in underlying parameters and variables, to avoid "jumps" in the simulated log-likelihood function and the calculation of arc elasticities, and (c) differentiable, so that derivatives of the simulated probability can be used in the search for the maximum of the simulated log-likelihood function and in the calculation of point elasticities.
The two simulators described below differ in their properties. Both are strictly positive, continuous, and very easy to calculate. The first simulator has points of non-differentiability, while the second is twice-differentiable at all points. Each is applicable to particular types of models. The first is applicable to cross-sectional probit with a general covariance structure for the errors. The second is applicable to random-parameters, error-components probits. It can be used with cross-sectional or panel data and can be easily adapted to non-normal distributions. Both methods are simple conceptually and easy to program.
The fact that neither of these simulators is applicable to all latent variables models with normal errors is, in a sense, a characteristic of simulators created through error partitioning. A successful partitioning (that is, a partitioning that results in a simulator with desirable properties) exploits the structure of the model. By utilizing the structure, the simulator has the potential to be simple and fast; however, its reliance on the structure prevents it from being used in situations without that structure. In this regard, the approach of error partitioning differs from the standard approach to developing simulators, which is to define simulators that operate for all types of probit models.
II. Simulating the Maximum Utility for the Non-Chosen Alternatives
In standard probit, the agent chooses the alternative with the highest utility. The chosen alternative's utility is therefore higher than the maximum of the utilities of the non-chosen alternatives. Using this fact, the probabilities can be simulated as follows. Simulate the utilities for the non-chosen alternatives, drawing from their unrestricted normal distribution. Conditional on these utilities for the non-chosen alternatives, calculate explicitly the probability that the utility for the chosen alternative exceeds the maximum utility for the non-chosen alternatives. This probability is a one-dimensional cumulative normal, which is easy to calculate. With only one repetition, the simulated probability is strictly positive, unbiased for the true choice probability, and continuous in the underlying parameters. The method is described below and then compared with other simulators.
An agent faces a choice among J alternatives and chooses the one which provides the greatest utility. From the researcher's perspective, utility vector U is distributed N(V,S). Denote the chosen alternative as i with utility Ui. The utility vector for the non-chosen alternatives is denoted UN, which is U with the i-th element removed. The chosen alternative's utility is higher than the maximum of the utilities of the non-chosen alternatives: Ui>K(UN) where K(UN)/
max{Uj, j=1,...,J; j
i}. The probability that alternative i is chosen is therefore:
Pi = I [Prob(Ui>K(UN) | UN} ] f(UN) dUN.
where f(UN) is the distribution of UN marginal over Ui. The term in brackets is a one-dimensional cumulative normal, which is easy to calculate, and the integral over UN can be approximated by simulation. Conditional on UN, the distribution of Ui is normal with mean "(UN)/Vi+ Ti,N(SN)-1 (UN-VN) and variance F2 = Tii - Ti,N(SN)-1(Ti,N)', where Ti,N is the i-th row of S with the i-th element removed, SN is S with the i-th row and column removed, and oii is the i-th element of the i-th row of S. Therefore
Prob(Ui>K(UN) | UN) = 1-M((K(UN)-"(UN))/F) = M(("(UN)-K(UN))/F)
The simulation is performed as follows. Draw J independent standard normal deviates, and label the vector of these as ,1 to denote that this is the first simulation. Calculate the associated utilities: U1 = V + ',1 where ' is the lower triangular Choleski factor of S. (Note that all utilities are simulated, though only the simulated utilities for the non-chosen alternatives will be used; instead, the utilities of only the non-chosen alternatives could be simulated. Computation-ally, it is probably faster to simulate all utilities than to locate the chosen alternative and not simulate it. Also, conceptually, simulating all utilities and ignoring the chosen alternative's utility is a direct way of drawing utilities for the non-chosen alternatives from their marginal distribution.) Observe the largest of the simulated utilities for the non-chosen alternatives and label it K1. Calculate "1/Vi+ Ti,N(SN)-1 (UN1-VN), and then calculate M(("1-K1)/F). Repeat this process R times, labeling each repetition appropriately. The simulated approximation to Pi is the average of the R cumulative normal probabilities:
SPi =(1/R)EM(("r-Kr)/F) .
SPi is an unbiased estimator of Pi for a finite number of repetitions; it is sufficient to show for R=1:
E(SPi) = E{M(("1-K1)/F)} = IM(("1-K1)/F) f(UN) dUN = Pi.
SPi is strictly positive for any R including 1, and is continuous in the underlying parameters and variables. It is not globally differentiable in parameters and variables. In particular, it has kinks at the points where the identity of the non-chosen alternative that has highest utility changes: that is, where K(UN) goes from being the utility of one non-chosen alternative to a different non-chosen alternative. However, these are kink points, not jumps, and consequently do not appreciably hamper numerical procedures for parameter estimation and forecasting.
Essentially, the simulated probability is based on the recognition that Pi is actually a three dimensional integral, rather than, as usually expressed, a J-1 dimensional integral. That is, Pi is the value of M(("-K)/F), which is itself an integral, integrated over all possible values of the maximum utility for the non-chosen alternatives, K, and the conditional mean of the utility of the chosen alternative, ":
Pi = IIM(("-K)/F) f(",K) d" dK.
The joint distribution of K and " is not normal and, to my knowledge, has not been derived. As a result, this expression cannot be integrated analytically, However, it can serve as the basis of simulation without knowing the distribution explicitly, since the distribution is obtained from the distribution of utilities for the non-chosen alternatives. That is, values of " and K are drawn from their distribution by drawing values of the utilities of the non-chosen alternatives from their joint distribution and calculating the associated values of K and ".
The proposed simulator is closely aligned to accept/reject methods (Lerman and Manski, 1981) and maintains their conceptual simplicity. Like accept/reject, the utility of each non-chosen alternative is simulated; however, instead of simulating the utility of the chosen alternative and determining whether it exceeds that of the non-chosen alternatives (as in accept/reject), the probability of this event is calculated explicitly. The accept/reject probability can be zero for a finite number of draws, if no "accept" is obtained. Also, the relation between the accept/reject probability and the underlying parameters and variables is a step function: as a parameter or variable changes, there is no change in the accept/reject probability until a simulation moves from being a "reject" to an "accept" or vice versa. Numerical procedures for parameter estimation and for forecasting the impact of changes in explanatory variables (such as calculating arc elasticities numerically) are hampered as a result. These limitations are avoided in the proposed procedure by recognizing that, for any values of the utilities of the non-chosen alternatives, there is a strictly positive and continuous probability that the utility for the chosen alternative will be higher than those for the non-chosen alternatives.
The procedure has a flavor of the GHK probability simulator, in that a truncated normal is calculated for one unobserved term conditional on simulated values for other unobserved terms (Hajivassiliou and McFadden, 1992; Borsch-Supan and Hajivassiliou, 1993) However, the recursive triangulation of GHK is not followed. GHK has been found in Monte Carlo studies to perform better than other simulators including accept/reject (Hajivassiliou, McFadden, and Ruud, 1992; Borsch and Hajivassiliou, 1993.) Two computation simplifications to GHK are provided by the proposed approach: (1) utilities are drawn simultaneously from an unrestricted normal rather than sequentially from truncated normals, and (2) only one factorization is required rather than for each chosen alternative. For GHK, the utility of the chosen alternative is subtracted from that of each non-chosen alternative, such that the covariance matrix for these utility-differences, and hence the factorization that is used in drawing them, depends on which alternative is chosen. The utility-differences are drawn sequentially, with each drawn from a truncated normal whose truncation point depends on the previously drawn utility-differences. The truncation point assures that the utility-difference is negative (meaning that the non-chosen alternative has lower utility than the chosen alternative), given that each previously simulated utility-difference is also negative. (This last consideration captures the correlations among utility-differences.) By contrast, the proposed procedure operates on utility rather than utility-differences and draws simultaneously from an unrestricted normal. The distribution from which utilities are drawn does not depend on which alternative is chosen, so that one Choleski factorization is applicable for all agents.,
Finally, the procedure is conceptually similar to the Clark algorithm (Clark, 1961; Daganzo, Bouthelier, and Sheffi, 1977) in that both emphasize the comparison of the chosen alternative's utility with the maximum utility for the non-chosen alternatives. For the Clark algorithm, the concept that the maximum of two normal variables is approximately normal is applied sequentially for each non-chosen alternative to obtain a normal distribution that approximates the true distribution of the maximum utility for the non-chosen alternatives. The probability that the chosen alternative's utility is higher than the maximum non-chosen utility is then calculated explicitly by integrating over this one-dimensional normal distribution. The proposed approach uses simulation of the maximum non-chosen utility rather than the Clark algorithm to reduce the dimension of integration to one. The simulations are drawn from the true distribution for the maximum non-chosen utility, but for any finite number of draws the distribution of simulated values for the maximum non-chosen utility is an approximation to the true distribution. In effect, the two procedures are simply approximating the distribution for the maximum non-chosen utility in different ways. Horowitz, Sparmonn and Daganzo (1984) show that the Clark algorithm can provide a poor approximation under various circumstances. With enough draws, the proposed procedure is necessarily more accurate.
III. Simulating Random Parameters and Error Components
In many situations, the correlation in the unobserved portion of utility arises from random parameters, which Hausman and Wise (1978) give as a motivation for probit over logit. The correlations might also reflect error components that are common to all alternatives in a group, a specification that provides a probit analog to nested logit (McFadden, 1978). In these situations, probit probabilities can be simulated in a straightforward and computationally attractive way. The procedure has been recognized for some time but does not seem to have been written down. Given the multitude of empirical situations that can be meaningfully represented by random-parameter/error-component probits, and the simplicity of the simulator, an explicit description seems warranted.
Let there be J alternatives with utility vector U=W"+D:+, where " is a normally distributed vector of L random parameters associated with variables W; D is a matrix composed of Q indicator variables whose element djq=1 if alternative j is a member of group q; : is a vector of length Q with zero mean whose element :q is the common unobserved term that enters the utility for all alternatives in group q; and , is a vector of J independent normal deviates with possibly unequal variance representing the unobserved factors that are specific to each alternative separately. Correlations across alternatives in the unobserved portion of utility are captured in (a) the common effects : for alternatives within a group and (b) the correlations in explanatory variables across alternatives, which enter the unobserved portion of utility because the coefficients " are random.
The error components are actually random coefficients of dummy variables that identify groups of alternatives, with the means of the coefficients constrained to zero. The utility vector can therefore be expressed more succinctly as U=X$ + ,, where X is a matrix of K explanatory variables for J alternatives, including dummies for groups of alternatives, such that K=L+Q; and $ is distributed normal with mean b (with some elements perhaps constrained to zero) and covariance 7. The proposed procedure is to simulate $ (which is the sources of correlation among unobserved utility) as well as the element of , for the chosen alternative, and then explicitly integrate over the remaining elements of ,. Since these elements of , are independent normals, this remaining integration is computationally simple. The simulated probabilities are strictly positive and smooth (continuous and twice-differentiable) in underlying parameters and variables.
Let M be a Choleski factor of 7. (If 7 is diagonal, representing no covariances among random parameters and error components, then the elements of M are the square root of the elements of 7; of course, even with diagonal 7, the unobserved portion of utility is correlated over alternatives). Define T as the diagonal matrix whose elements are the standard deviations of the elements of , . The utility vector can be expressed as:
U = Xb + XM0 + T:
where 0 and : are vectors of iid standard normal deviates, with lengths K and J, respectively. Subscripts are used to denote rows of these arrays, such that the utility of alternative j is Uj = Xjb + XjM0 + tj:j , where tj is the standard deviation of ,j.
The probability that alternative i is chosen is:
Pi = Prob(Ui > Uj j
i)
= II1((Xib + XiM0 + ti:i > Xjb + XjM0 + tj:j) j
i ) f(:) f(0) d: d0
= IIProb[ tj:j < (Xib + XiM0 + ti:i ) - (Xjb + XjM0) j
i | 0, :i] f(:i) f(0) d:i d0
= II AM(ARGi-j) f(:i) f(0) d:i d0
where ARGi-j = ((Xib + XiM0 + ti:i) - (Xjb + XjM0))/ tj and the product A is over all j
i. The partitioning of errors occurs in the third equality: the probability Prob[.] is the explicit integral over :j for all j
i, conditional on the values of 0 and :i. As given in the last line, this integral is easy to calculate and, since 0 and :i are independent standard normal, drawing from their distributions is easy.
This expression for Pi is simulated as follows. (1) Draw a value of 0 and :i. That is, draw K+1 independent standard normal deviates. (2) Calculate M(ARGi-j) for all j
i, and take their product. (3) Repeat this process for numerous draws and average the results. The simulated probability obtained in this way is, with as few as one draw, unbiased for the true probability, strictly positive, and continuous and twice-differentiable in the underlying parameters and variables.
Denote with a superscript r the value of ARGi-j that is obtained with the r-th draw of 0 and :i, and define SRir= AM(ARGi-jr). Then the simulated probability is SPi =(1/R)3 [AM(ARGi-jr)] =(1/R)3 SRir . The derivatives of the simulated probability with respect to underlying parameters are:
*SPi/*b = (1/R) 3r SRir 3j
i [N(ARGi-jr) /M(ARGi-jr)] [(Xi-Xj)/tj]
*SPi/*M = (1/R) 3r SRir 3j
i [N(ARGi-jr) /M(ARGi-jr)] [(Xi-Xj)/tj] 0r
*SPi/*tj = - (1/R) 3r SRir [N(ARGi-jr) /M(ARGi-jr)] [(ARGi-j)/tj] for j
i
*SPi/*ti = (1/R) 3r SRir 3j
i [N(ARGi-jr) /M(ARGi-jr)] [:ir/tj] .
To aid numerical search, the second derivatives can also be calculated.
This simulator is essentially an example of Stern's (1987) decomposition simulator that utilizes the special structure of a random-parameters/error-components model. Stern suggests, for utility vector U distributed normal with covariance matrix S, to find a 8 such that U can be decomposed into Y+W where Y is normal with covariance 82I and W is normal with covariance S-82I. That is, the utility vector is decomposed into a part that carries the correlation and a part that is uncorrelated over alternatives. Given this decomposition, the values of W are simulated through repeated draws from its distribution, and the integration over Y given W is performed explicitly. This integration is easy to perform numerically since the elements of Y are independent. For general S, the Stern decomposition can be difficult, since finding a 8 that provides a positive definite S-82I is computationally burdensome, and the task must be performed for each trial value of the parameters. As pointed out by Hajivassiliou, McFadden, and Ruud (1992), the accuracy of the general Stern simulator fails when the number of alternatives is large and the eigenvalues of S are uneven. This difficulty is most likely the reason that the Stern simulator performed poorly in the Monte Carlo experiments of these authors (since, given 8, the simulator is very fast to compute.) However, when S takes the form implied by a random-parameters/error-components model, calculation of 8 is not needed: the structure of the model provides a natural decomposition. The computationally difficult part of the Stern simulator is avoided while the extremely fast aspects of it are retained. Since many empirical situations can be adequately represented by random parameters and error components, the numerical and conceptual ease of the approach in these situations is particularly appealing.
The simulator allows straightforward generalization to other distributions for the error terms. The parameters $ can have any distribution that allows $ to be expressed as b + M0 with b and M being parameters and 0 being a vector of random variables whose distribution (1) does not depend on unknown parameters and (2) is easy to draw from. Multivariate normal $ can obviously be expressed in this way, with the elements of 0 being standard normal deviates. Other possibilities for 0 are gamma, which is useful when the parameters are known to take only one sign, and uniform, which is useful primarily because of simplicity. The independent random elements, ,, can have other distributions that allow for simple integration (either numerical or analytic) conditional on $. A prime candidate is iid extreme value, for which simulation is even easier than with iid normal ,. Conditional on 0, the probability that Ui>Uj j
i is the logit formula, such that:
Pi = I [exp(Vi)/3exp(Vj)] f(0) d0
where Vj = Xjb + XjM0. Simulation is performed by drawing values of 0, calculating the logit formula in brackets for each draw, and averaging the results. This simulation can be expected to be faster than with normal , because: (1) the i-th element does not need to be simulated, and (2) the logit formula is easier to calculate than the product of cumulative normals.
Non-normal distributions for the random parameters and error components do not change the derivatives, since their distribution is captured by drawing 0 from a parameter-free distribution and constructing $ the same as when 0 is normal. (That is, the distribution of 0 affects the values of the simulated probability but not its formula conditional on 0.) The derivatives are different, of course, when the distribution of , is changed.
The simulator is applicable to panel data when the dependence over time in the unobserved portion of utility is due to parameters and/or error components being constant over time for each agent (time-dependence due to customer heterogeneity; Heckman, 1981.) Let subscript n denote the agent and t denote the time period. The utility vector is specified as: Unt = Xnt$n + ,nt = Xntb + XntM0n + T:nt. That is, the variables vary over time and agents; the parameters associated with the variables and the error components that are common to groups of parameters (which represent the agent's "taste" for that group) vary over agents but are constant over time for each agent; finally, the alternative-specific errors vary independently over time and agents. Let i(nt) be the chosen alternative of agent n in time t, and let Pn* be the probability of the sequence of agent n's choices. Since 0 is constant over time for each agent, the choices in each period conditional on 0 depend only on the value of :, which is independent over time and agents. We have
Pn* = I I At [ Aj
i(nt) M(ARGnt,i(nt)-j) ] f(:i(nt)) f(0) d0 d:i(nt)
where ARGnt,i(nt)-j = ((Xnt,i(nt)b + Xnt,i(nt)M0 + ti(nt):i(nt)) - (Xnt,jb + Xnt,jM0))/ tj. The probability is simulated for each agent as follows. (1) Draw a value of 0 and one :i(nt) for each time period. (2) Calculate M(ARGnt,i(nt)-j) for each time period, using the agent's drawn value of :i(nt) for that period and the value of 0 for all the time periods. (3) Take the product of these cumulative normals over all non-chosen alternatives in all time periods. (4) Repeat the process for numerous draws and average the results.
References
Bolduc, D., 1992, "Generalized Autoregressive Errors in the Multinomial Probit Model," Transportation Research, Vol. 26B, No. 2, pp. 155-170.
Borsch-Supan, A., and V. Hajivassiliou, 1993, "Smooth Unbiased Multivariate Probability Simulators for Maximum Likelihood Estimation of Limited Dependent Variable Models," Journal of Econometrics, Vol. 58, pp. 347-368.
Clark, C., 1961, "The Greatest of a Finite Set of Random Variables," Operations Research, Vol. 9, pp. 145-162.
Daganzo, C., F. Bouthelier, and Y. Sheffi, 1977, "Multinomial Probit and Qualitative Choice: A Computationally Efficient Algorithm," Transportation Science, Vol. 11, pp. 338-358.
Hajivassiliou, V., and D. McFadden, 1992, "The Method of Simulated Scores for the Estimation of LDV Models," forthcoming, Econometrica.
Hajivassiliou, V., D. McFadden, and P. Ruud, 1992, "Simulation of Multivariate Normal Rectangle Probabilities and their Derivatives: Theoretical and Computational Results," forthcoming, Journal of Econometrics.
Hausman, J., and D. Wise, 1978, "A Conditional Probit Model for Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences," Econometrica, Vol. 48, No. 2, March, pp. 403-426.
Heckman, J., 1981, "Statistical Models for Discrete Panel Data," in C. Manski and D. McFadden, eds., Structural Analysis of Discrete Data with Econometric Applications, MIT Press.
Horowitz, J., J. Sparmonn, and C. Daganzo, 1984, "An Investigation of the Accuracy of the Clark Approximation for the Multinomial Probit Model," Transportation Science, Vol. 16, pp. 383-401.
Lerman, S. and C. Manski, 1981, "On the Use of Simulated Frequencies to Approximate Choice Probabilities," in C. Manski and D. McFadden, eds., Structural Analysis of Discrete Data with Econometric Applications, MIT Press.
McFadden, D., 1978, "Modelling the Choice of Residential Location," in A. Karquist, et al, eds., Spatial Interaction Theory and Planning Models, Amsterdam: North-Holland Publishing Company.
McFadden, D., 1989, "A Method of Simulated Moments for Estimation of the Multinomial Probit Without Numerical Integration," Econometrica, Vol. 57, pp. 995-1026.
Stern, S., 1992, "A Method for Smoothing Simulated Moments of Discrete Probabilities in Multinomial Probit Models," Econometrica, Vol. 60, pp. 943-952.