ABSTRACT: Spatial econometrics has become a prominent topic in the recent scientific literature. For this reason, it is being used in research as well as teaching both undergraduate and graduate econometrics courses. GeoDaSpace is a software package for the estimation and testing of spatial econometric models in an intuitive and easy-to-use point and click enviromnent. It is still an alpha release freely downloadable from the GeoDa Center (Arizona State University), which incorporates a wide range of estimation methods (OLS, 2SLS, ML, GM/GMM) and models (spatial lag, spatial error, spatial lag and error, spatial regimes), with options for spatial and non-spatial diagnostics, non-spatial endogenous variables and heteroskedasticity/HAC covariance estimators. GeoDaSpace is a very useful teaching resource that can be used by both teachers and students.
Keywords: spatial autocorrelation models, endogenous variables. heteroskedasticity, spatial regimes, GeoDaSpace.
RESUMEN: La econometría espacial se ha convertido en un tema de gran relevancia en la literatura científica reciente. Por este motivo, se está empleando no sólo en la investigación sino también en la docencia, tanto de cursos de pregrado como de posgrado. GeoDaSpace es un paquete infonnático especializado en la estimación y contraste de los modelos econométricos espaciales dentro del entorno intuitivo y amigable del"apuntar-y-cliquear". Se trata aún de una versión alfa, que está disponible gratuitamente desde el GeoDa Center (Universidad del Estado de Arizona), que incorpora una amplia gama de métodos de estimación (MCO, MC2E, MV, GM/GMM) y modelos (retardo espacial, error espacial, retardo y error espacial, regímenes espaciales), con opciones para la obtención de contrastes espaciales y no espaciales, tratamiento de regresores estocásticos no espaciales y de estimaciones robustas a la heteroscedasticidad en los errores (HAC). GeoDaSpace es un recurso docente útil que puede ser utilizado tanto por estudiantes como profesores.
Palabras clcn'es: modelos de autocorrelación espacial, variables endógenas, heteroscedasticidad, regímenes espaciales, GeoDaSpace.
(ProQuest: ... denotes formulae omitted.)
1. Introduction
The explosive diffusion of geographic information systems (GIS) technology and the associated availability of geo-coded socioeconomic data sets have created a need for specialized methods to deal with the distinguishing characteristics of such geographic data. For this reason, spatial econometrics has become a prominent topic specialized in the treatment of spatial data in all branches of social sciences and economics, particularly in regional science, urban and real estate economics, economic geography and enviromnental economics (see for example Anselin et al., 2004 for a review). Accordingly spatial regression techniques are now becoming an established component in the applied econometrics toolbox, as witnessed by the increasing attention given to this topic in standard econometrics textbooks (Maddala, 2001; Wooldridge, 2002; Gujarati, 2003; Kennedy, 2003; Baltagi, 2008). As a matter of fact these methods are also being implemented in the fonnal curricula of undergraduate and postgraduate university programs as part of the econometric courses in economics, business administration, marketing, geography, enviromnental studies or epidemiology, between others.
GeoDaSpace is a very useful teaching resource that can be used by both teachers and students in the classroom. This software package, which is still an alpha release freely downloadable from the GeoDa Center at Arizona State University (https://geodacenter.asu.edu). is a stand-alone program that has been designed for the estimation and testing of spatial econometric models in an intuitive and easy-to-use point and click enviromnent. This package is part of the PySAL project developed by the professors Luc Anselin and Serge Rey (Rey and Anselin, 2010), from the GeoDa Center, as an open source cross platfonn modular library of spatial analytical functions (running in Windows, Macintosh and Linux), which are written in the Python scripting language (http://pysal.org). It is conceived as the foundational framework to deliver spatial analytical functionality in many different fonns. Particularly, GeoDaSpace is the front-end GUI (Graphical User Interface) for the spatial econometric routines contained in the PySAL library (Anselin, 2012). These routines include the functionalities for input/output data fonnats (csv, dbf, slip), the creation and transfonnation of spatial weight matrices and the regression modules, which includes the state-of-the-art spatial econometrics estimation methods (OLS, 2SLS, ML1, GM/GMM) and models (spatial lag, spatial error, spatial lag and error, spatial regimes), with options for spatial and non-spatial diagnostics, non-spatial endogenous variables and heteroskedasticity/HAC covariance estimators.
Since GeoDaSpace is intended for the user who is somewhat familiar with the methods, but prefers a point and click enviromnent to the coimnand line, the number of available options are those that are most common but less than in other spatial econometrics routines present in Stata (Drukker et al. 2013), Matlab (Le Sage and Pace 2009) orR (Bivand 2002, Piras 2010).
The rest of the paper is organized as follows. Section 2 provides some background information on basic characteristics of the design and functionality of GeoDaSpace. Sections 3 and 4 present the spatial weight matrix and spatial regression menus, which are the bulk of this program. Section 5 illustrates the previous items with an empirical application. Section 6 provides a brief summary and some concluding remarks.
2. Design and functionality
GeoDaSpace is geared to the econometric analysis of discrete geospatial data, that is, objects characterized by their location in space either as points (point coordinates) or polygon (polygon boundary coordinates). Differently from its predecessor -GeoDa- the design of GeoDaSpace does not consist of an interactive environment combining maps with statistical graphs, using a technology of dynamically linked windows. The full set of functions of this software is listed in Table 1 and can be classified into three categories:
* spatial data utilities: data input and output,
* spatial weight matrix: creation and manipulation,
* spatial regression: estimation and diagnostics of linear spatial regression models.
The functionality of GeoDaSpace is invoked by clicking toolbar buttons located in different parts of the main screen, as illustrated in Figure 1. Once a data file is invoked, the program opens a new menu with the complete set of variables, which must be dragged to the corresponding boxes in the specification menu.
There are six toolbars with different subdivisions:
(1) Main menú, in the upper part of the main screen:
(a) Open (new or existing) model files.
(b) Save model files.
(c) Open de variable list.
(d) Show the results window.
(e) Show advanced settings for the model estimation:
(i) Computation of the standard deviation of the coefficients.
(ii) GMM settings.
(iii) Instruments settings.
(iv) Output files options.
(v) Spatial regime models.
(vi) Other estimation options.
(2) Data file menu, which allows opening dBase (dbs) and Comma Separated Values (csv) input databases.
(3 ) Model spatial weights matrix menu:
(a) Creation of a spatial weights matrix:
(i) Input map file (slip).
(ii) Select/add an ID variable for the weights file.
(iii) Contiguity matrix: queen, rook, higher orders of contiguity, inclusion of lower orders.
(iv) Distance-based matrix: distance metrics (Euclidean, arc distance in miles and kilometres), k-nearest neighbors (number of neighbors), binary distance band (cut-off point) and inverse distance (power and cut-off point).
(b) Open an existing spatial weights matrix: ArcGIS (dbf, swm, txt), dat, gal, geoBUGS, gwt, kwt, MatLab, MatrixMarket, STATA text files.
(c) Properties for selected weights: name, transform (B: binary, R: row-standardization, which is the default, D: double standardizations, V: variance stabilizing, O: restore original transformation), islands, neighbors of each observation, cardinalities, ids, histogram and neighborhood map viewer.
(4) Kernel spatial weights matrix menu:
(a) Creation of a kernel weights matrix:
(i) Input map file (shp).
(ii) Select/add an ID variable for the weights file.
(iii) Adaptive kernels: distance metrics (Euclidean, arc distance in miles and kilometres), kernel function (uniform, triangular, Epanechnikov or quadratic, quartic or bisquare, Gaussian), number of neighbors.
(b) Open an existing spatial weights matrix: ArcGIS (dbf, swm, txt), dat, gal, geoBUGS, gwt, kwt, MatLab, MatrixMarket, STATA text files.
(c) Properties for selected weights: name, transform (B: binary, R: row-standardization, which is the default, D: double standardizations, V: variance stabilizing, O: restore original transformation), islands, neighbors of each observation, cardinalities, ids, histogram and neighborhood map viewer.
(5) Model specification menú:
(a) Dependent variable box.
(b) Independent variables box.
(c) Endogenous variables (regressors) box.
(d) Instrumental variables box.
(e) Regime indicators (dummy) variable box.
(6) Model estimation:
(a) Model type: standard, spatial lag, spatial error, spatial lag+error.
(b) Method: OLS (2SLS), spatial GMM (and ML in a near future).
(c) Standard errors: White, E1AC, Kelejian-Prucha HET.
In the following sections, some applications are highlighted, focusing on some distinctive features of this software: spatial weight matrices, specification of the model and estimation methods.
3. Spatial weight matrix
One of the major distinguishing characteristics of spatial data analysis (as opposed to a mere non-spatial analysis) is that the spatial arrangement of the observations is taken into account. This is formally expressed in a spatial weights matrix, W, with elements ir,(. where the ij index corresponds to each observation pair. The conceptual idea of spatial weights is that in this n/n matrix the diagonal elements (\r") are set to zero by definition and the rest of the non-zero cells (ir,,) capture the potential of spatial interaction. That is to say, for a spatial data set composed of n locations (points, areal units, network edges, etc.), the spatial weights matrix expresses the potential for interaction between observations at each pair ij of locations. There is a rich variety of ways to specify the structure of these weights, and GeoDaSpace supports the creation, manipulation and analysis of spatial weights matrices across three different general types (see http://pvthonliosted.org/PvSAL/users/tutorials/weights.html):
* Contiguity based weights: ij locations interact when sharing a common border.
* Distance based weights: ij locations interact when being within a critical distance band.
* Kemel weights: only ij locations within a critical distance band interact, though following a distance-decay function (triangular, uniform, quadratic, etc.)
For the moment, the first two types (contiguity and distance based) are the only ones that can be used -in GeoDaSpace- for the specification of spatial models (spatial lag, spatial error, spatial lag+error) and the LM tests. The kernel spatial weights matrices can be used in the specification of the spatial HAC covariance matrix.
Spatial weight matrices tend to be fairly sparse (i.e. many cells contain zeros) and hence a full rt/rt array would not be an efficient representation, especially when dealing with large datasets because of computer memory constraints. For this reason, GeoDaSpace increases the computation speed by storing the spatial weights matrix in a compact fonnat such that the spatial neighbors of each observation (all them identified by its ID) are only listed with their corresponding weight values. For example, in the case of a simple contiguity weight matrix if observation '03' has three neighbors sharing a common border (observations 'OF, '02' and '04') their corresponding weight value is 0.33. Instead of including the complete set of zero- one values for all the observations present in the system, the compact spatial weight matrix only lists the following sentence for observation '03': {'OF: 0.33, '02': 0.33, '04': 0.33}
3.1. Contiguity based spatial weights matrix
When working with polygonal spatial data, contiguity is defined as having a coimnon border. In the specific case of data that are arranged on a regular square or rectangular lattice (or grid), the contiguity structure can be defined in three ways: having a common border (rook criterion), having a common comer (bishop criterion), and having either a border or comer in coimnon (queen criterion). Typically, this results in four contiguous grids according to the rook or bishop criteria, and eight contiguous grids following the queen criterion. GeoDaSpace only includes the rook and queen options, since the bishop matrix can be computed as the difference between the queen and the rook cases.
To illustrate this issue, we constmct a contiguity matrix for the set of seven districts located in downtown Madrid (Spain). In the model weights menu of GeoDaSpace, we must select the name of the map file where the districts polygons are stored, as well as the option for the weights matrix: contiguity - queen. Finally, the familiar Windows dialog requests the file name of the GAL (Geographic Algorithm Library) weights matrix: ALMONDIS.gal. And the name of this file will appear in the model weights menu box. In the weights properties editor menu, it is possible to obtain (and edit) some useful information about this selected spatial weights matrix, as shown in Figure 2.
Besides the name of the file and ID variable, the menu shows the type of transfonnation of the weight matrix, which is by default R: row-standardization (global sum=n). The row-standardized of the weight matrix is necessary to yield a meaningful interpretation of the results of many spatial statistics, and particularly for the estimation of spatial regression models, since they are all built with spatial weight matrices. The row standardization consists of dividing each element in a row by the corresponding row sum. Each element in the new matrix thus becomes:
... (1)
Next for each observation, the menu indicates its corresponding neighbors. Since there are 7 districts in downtown Madrid, we could be thinking that the weights are stored in a 7x7 matrix. Nevertheless due to the sparseness of these matrices, the contiguity relations are stored in a compact form and only the set of non-zero elements (and weights values) of each row are shown. In our example, district 'OF has 4 neighboring districts with either a border or comer in common (queen criterion): '02', '03', '04' and '07', so as rt',-,-0.25 (one fourth each). This interaction between district 'OF and its 4 neighbors can be visualized by the neighbourhood map viewer (Figure 1).
It is also possible to have information about the existence of islands (unconnected observations) and the cardinality of the neighbour relations; i.e. the number of neighbors of each district: {'02': 2, '03': 3, '01': 4, '06': 2, '07': 4, '04': 4, '05': 3}. In this case, there are no islands since all the districts have at least one neighbour (in fact, the less connected districts are '02' and '06' with two neighbors). Finally, there is also a connectivity histogram attribute, which is a set of tuples indicating the cardinality of the neighbor relations. In this case: [(2, 2), (3, 2), (4, 3)], which means that there are 2 districts with 2 neighbors ('02' and '06'), another 2 districts with 3 neighbors ('03' and '05'), and 3 districts with 4 neighbors ('OF, '04' and '07'). This histogram is very important to detect strange features of this distribution, which may affect spatial autocorrelation statistics and spatial regression specifications. Two features in particular warrant some attention. One is the occurrence of islands, the other a bimodal distribution, with some locations having very few (such as one) and others very many neighbors (Anselin, 2005).
3.2. Distance based spatial weights matrix
Interactions or neighborhood between spatial units can also be defined in function of the distance that separates them using information on latitude (X coordinate) and longitude (Y coordinate)2. Two units are then considered to be neighbors if these points are less than a specified critical distance apart. In practice, when working with polygonal spatial units, this distance is computed between their centroids or other meaningful points (e.g. mean centers, capital cities, etc.). GeoDaSpace provides two distance metrics: Euclidean distance (based on the straight line) and Arc Distance on a sphere or great circle distance (in miles or kilometres), which is the appropriate measure when the spatial scale of the data analysis is global. The arc distance d,i between two locations / and j is calculated as follows:
... (2)
where the X and Y are first transformed to radians, asX= (90 * LAT + k) / 180 and Y = LON - / 180.
In GeoDaSpace, three types of distance based spatial weights matrices can be constructed:
(1) Distance band weights: it yields a simple contiguity matrix by using a critical distance cut-off point; i.e. the neighbour set for each spatial unit is defined as those units falling within a threshold distance (or distance band) of the focal unit. The software proposes the minimum nearest neighbour distance, which is given by the pair of units that are the furthest apart, to assure of at least one neighbour per unit. In this case, the number of neighbors is likely to vary across observations.
(2) K-nearest neighbors (knn) weights: it also yields a simple contiguity matrix by considering as neighbors to a given unit a set of k locations (1, 2, 3, etc.) previously defined. It is more appropriate when the minimum nearest neighbour distance is driven by two pairs of points which are significantly distant each other. In these cases, the minimum distance will not be representative for the rest of the distribution, but the knn weight matrix will assure the same number of neighbors for all the observations.
(3) Inverse distance weights: it uses any integer power of the inverse distance between two observations as the weights. This matrix can be full (i.e. computed for the whole set of units) or it can also be specified as a distance band weights matrix taking on continuous values (rather than binary, as in the previous case), with the values set to the inverse distance separating each pair within a given threshold distance. This can be used to construct measures of potential interaction between two observations as squared inverse distance weights, in accordance with the gravity model of spatial interaction: wy =
3.3. Kernel spatial weights matrix
Distance band weighting schemes suffer from the problem of discontinuity over the study area: it seems unnatural that the spatial association between the units ends so abruptly farther apart from a specific cut-off point. In order to solve this problem, it is possible to specify a ir,, as a continuous and monotone decreasing function of dy. The kernel functions or kernels has been suggested for constructing this kind of weights, in which a constant value (h) provides some control of the range of the circle of influence of each observation /' (Chasco et al., 2008). They could be considered as a combination of distance based thresholds together with continuously valued weights. The distance threshold is here equivalent to the maximum distance of a ^-nearest neighbour distance for each observation, which constitutes what is called the 'bandwidth' attribute: h = max(riira!),V/ where dknn is a vector of the k- nearest neighbor distances (the distance to the k'h nearest neighbor for each observation). The weights of the kernel functions are, then, defined as wy =-diJjh. As it can be seen, h produces a decay of influence with distance.
In this case, the bandwidth (or threshold distance) h is fixed across observations and one optimum spatial kernel is determined and applied uniformly across the study area. Such approach, however, suffer from the potential problem that in some parts of the region, where data are sparse, the local regressions might be based on relatively few data points. To offset this problem, spatially adaptive weighting functions can be incorporated in order to define different bandwidths (distances), expressing the number or proportion of observations to retain within the weighting kernel "window", irrespective of distance: on the one hand, relatively small bandwidths in areas where the data points are densely distributed and on the other hand, relatively large bandwidths where the data points are sparsely distributed. In other words, they are able to adapt themselves in size to variations in the density of the data so that the kernels have larger bandwidths where the data are sparse and have smaller ones where the data are plentiful. In these cases, the bandwidth is adaptive in size.
GeoDaSpace specifies adaptive bandwidths from a predefined number of k-nearest neighbour observations (this number of neighbors is possible to be endogenously changed by the user). In this case, the weights of the kernel functions are defined as wij = being hi the adaptive bandwidth for observation which produces a decay of influence with distance. The form of the kernel function determines the distance decay in the derived continuous weights. The program includes de following kernel functions:
(1) Uniform (the default in GeoDaSpace): K (w) = |uj if |uj < 1, and 0 otherwise
(2) Triangular: K (w) = 1 -|uj if |uj < 1, and 0 otherwise
(3) Quadratic or Epanechnikov: K (w) = 3/4-(l-w2) if | m'| < 1, and 0 otherwise
(4) Quartic or bisquare: K (w) = 15/16 * (l - w2 ) if \r < 1, and 0 otherwise
(5) Gaussian: K(w) = exp(-w2/2) if | m'| < 1, and 0 otherwise
If /' and j coincide, the weighting of data at that point will be the unity: wu = 1. The weighting of other data will decrease according to the kernel curve as the distance between i,j increases. For data a long way from the weighting will fall to virtually zero. Changing the bandwidth (e.g. by a change in the number of nearest neighbors) results in a different exponential decay profile, which in turn produces weights that vary more or less rapidly over space.
4. Spatial regression
GeoDaSpace includes two menus for spatial regression, "specification" and "model estimation", as well as the submenu "show advanced settings" in the main menu. The specification menu presents a set of boxes or windows in which insert different kind of variables: dependent, independent, endogenous, instrumental and spatial regimes (see Figure 1). The model estimation menu includes three boxes: one for the model type (spatial lag, spatial error and spatial lag+error), other for the estimation method (OLS/2SLS, spatial GMM and ML in a near future), and the last one for different methods to deal with both heteroskedasticity and spatial autocorrelation in the residuals (White, HAC and Kelejian-Prucha HET).
These menus focus on the implementation of tests for spatial autocorrelation in models that may include endogenous variables and on general method of moments (GMM) estimation methods. The suite of routines includes implementation of Kelejian and Prucha's recent techniques to deal with both spatial autocorrelation and heteroskedasticity, including a heteroskedastic autocorrelation consistent (HAC) estimator.
4.1. Non-spatial models: estimation methods and diagnosis
The traditional basic econometric model is specified as a linear relationship between a dependent variable (y) and a set of explanatory variables (X) as follows:
... (3)
where y is the dependent variable (in vector form, with N rows), X is a matrix with observations on K explanatory variables (with N rows and K columns), ß is a vector with K regression coefficients (i.e., of dimension K by 1), s is a random error term (in vector form, with N rows), a2 is the population error variance, and I is an identity matrix of dimension N by N.
4.1.1. Ordinary least squares (OLS) estimation of the basic model
According to a number of criteria, the method of ordinary least squares (OLS) estimation accomplishes the main objectives of finding a good match or fit between predicted values Xb and observed values of the dependent variable and determining the variables that explains it significantly in the linear relationship. The OLS estimators ( ß ). which are found by minimizing the sum of the squared prediction errors, are BLUE (Best Linear Unbiased Estimators). In order to achieve these good properties in the estimators, we must make certain assumptions about the random error:
* The random error has mean zero (i.e., there is no systematic misspecification or bias in the population regression equation: E (y ) =0, Vi.
* The random error terms are uncorrelated and have a constant variance (homoskedastic): E (y * y ) =0, Vi A j and E (y * y )=cr2, Vi = j
* The random error term follows a normal distribution: n(o,cj2)
The first two assumptions are crucial to obtain the unbiasedness and efficiency of the OLS estimates while the third one is needed in order to carry out hypothesis tests and to assess significance of the regression coefficients. These assumptions introduce an additional parameter to be estimated (in addition to the regression coefficients ß). i.e., the error variance er. GeoDaSpace reports both an unbiased and a maximum likelihood estimate for cr, as well as their square root (the standard deviation for the error tenu). It also reports some measures of fit like the if. adjusted R2 and some measures based on the maximum likelihood (ML) estimation method (log-likelihood, Akaike and Schwartz information criterion).
We can practice with GeoDaSpace estimating by OLS a linear model of a production function for a sample of 1,171 European NUT3 regions, based on the application presented in Chasco et al. (2012), in which the ratio of the Gross Domestic Product or GDP per area for 2006, in logarithms (LG06) is explained by a set of "first nature" geographical variables, such as Southerly latitude (FSLAT), Westerly latitude (FSWEST), elevation over the sea level (FSEA) and a dummy variable for the presence (or not) of mineral extraction sites (miner), as well as two second nature (man-made agglomeration) variables (also in logarithms): population (PI06) and productivity or GDP per employee (DEL06). As a benchmark, we begin with the OLS estimation of a base model which must be performed as shown in Figure 3: after opening the dBase file of the map (NUTS306.DBF), we must drag the variables into their corresponding box in the specification menu of GeoDaSpace.
Pressing the buttom "Run", we can easily obtain a full summary of the results of an OLS regression in an output file nicely fonnatted and ready to be printed (Figure 4). We obtain a very significant relationship between each explanatory variable and the log of GDP per area as pointed out by the high values (in absolute tenus) of their conesponding t-statistic results (and low p-values). Regarding the measures of fit, the R2 indicates that only a 45.57% of the dependent variable variance is captured by these regressors.
4.1.2. Specification diagnostics of the OLS estimation of the basic model
GeoDaSpace contains some statistics to test for the existence of potential misspecification problems in the OLS estimation of spatial linear regression models, such as multicollinearity, non-normality, heteroskedasticity and spatial autocorrelation.
(1) Multicollinearity takes place with the existence of a strong linear relation (i.e. correlation) between the explanatory variables included in the regression specification, which in principle, should be totally uncorrelated. As a consequence, the OLS estimates will have very large estimated variances and, therefore, very few coefficients will be found to be significant (low t statistics), even though the regression as a whole may seem to achieve a reasonable fit (high R2). GeoDaSpace includes the condition number that, as a rule of thumb, should take values lower than 20 or 30 (a total lack of multicollinearity yields a condition number of 1).
(2) Non-nonnalilty in the error terms is the basis for most hypothesis tests and a large number of regression diagnostics. The GeoDaSpace regression output includes the results of the Jarque and Bera test (the statistic and its associated probability): a low probability indicates a rejection of the null hypothesis of a normal error.
(3) Heteroskedasticity is a coimnon situation in cross-section regression models where the random regression error does not have a constant variance over all observations. As a consequence, while the OLS estimates are still unbiased, they will no longer be most efficient and, more importantly, inference based on the usual t and F statistics will be misleading, and the R2 measure of goodness- of-fit will be wrong. In spatial data analysis, this problem is frequently encountered when using data for irregular spatial units (with different area) or when there are systematic regional differences in the relationships you model (i.e., spatial regimes). Hence, an indication of heteroskedasticity may point to the need for a more explicit incorporation of spatial effects, in the form, for instance, of spatial regimes. GeoDaSpace includes some tests against heteroskedasticity, in which the null hypothesis is always homoskedasticity: the Breusch-Pagan Lagrange Multiplier (LM) test, which is not powerful for non-normal errors in small samples), its studentized version, the Koenker- Basset test (the best option when dealing with non-normal errors), and the White test, which is robust to any unspecified form of heteroskedasticity.
(4) Spatial autocorrelation, or more generally, spatial dependence, is the situation where the dependent variable or error term at each location is correlated with observations on the dependent variable or values for the error term at other locations. The consequences of ignoring spatial autocorrelation in a regression model, when it is in fact present, depend on the form for the alternative hypothesis: either a spatial lag model (OLS estimates are biased and inference will be incorrect, as for the omitted variables problem) or a spatial error model (OLS estimates are unbiased but no longer efficient, as for heteroskedasticity). GeoDaSpace contains six tests for spatial dependence, three of which pertain to the spatial error case: an extenstion of the Moran's / for the residuals, a simple LM test against the presence of a spatial error model and its robust version. There are also two LM tests against the presence of a spatial lag model (simple and robust version) and a last LM test against the joint presence of both a spatial lag and a spatial error model (SARMA test).
In Figure 5, we present the specification diagnosis for the OLS estimation of the GDP per area model which has no multicollinearity problems. However the residuals are clearly non-normal heteroskedastic and exhibit high degree of spatial autocorrelation in both spatial lag and spatial error fonns (though the slightly higher value of the robust LM test-error seems to point out to a spatial error model). Anyway, the high non-normality of the error tenus could put the LM tests under suspect, though the sample size is not very low in this model. In order to compute the LM tests on spatial autoconelation we have first created and specified -in the "model weights" menu of GeoDaSpace- a contiguity (queen) spatial weights matrix.
4.1.3. Two-stage least squares estimation (2SLS) of the basic model with endogenous regressors.
Other misspecification problem for the OLS estimation in linear regression models is the non-detenninistic nature of one or more explanatory variables, which will be endogenous (or stochastic regressors), as follows:
... (4)
where Y is a matrix with observations on G endogenous explanatory variables (with N rows and G columns) and y is a vector with G regression coefficients (of dimension G by 1).
The inclusion of endogenous variables on the regressors in systems of simultaneous equations invalidates the OLS estimations, which will be biased. The instrumental variables estimator (IV) or two-stage least squares estimator (2SLS) is consistent. but not necessarily very efficient. The principle of instrumental variables estimation is based on the existence of a set of instruments (Q) that are strongly correlated with the original endogenous explanatory variables (Y), but asymptotically uncorrelated with the error term. Once these instruments are identified, they are used to construct a proxy for the endogenous variables (Y), which consists of their predicted values ( Y ) in a regression on the instruments (Q) and the exogenous variables (X). This proxy ( Y ) is then used in a standard least squares regression. The instruments (Q) are the other "excluded" exogenous variables or the time-lag of the exogenous variables, since they are contemporaneously uncorrelated with the error terms.
Besides the 2SLS estimation results and measures of fit, GeoDaSpace also reports the Anselin-Kelejian test (Anselin and Kelejian 1997), which allows us to test for the presence of remaining spatial autocorrelation in the residuals of the 2SLS estimation. This is a version of the classical Lagrange Multiplier error test adapted for the case of residuals from a 2SLS regression. It is computed as the Moran's / statistic for the residuals from a 2SLS estimation, which is distributed as a xl *
In the previously shown example for the EU regions, one problem is that the second- nature variables, population (PI06) and productivity (DEL06) are endogenous and determined simultaneously with GDP leading to a simultaneity bias in the OLS estimators. For this reason, we propose using as instmments both time-lagged variables (LPO04, LPO05, LPR04, LPR05) and spatial-lagged variables (WLPO06, WLPR06), since they are highly correlated with the stochastic regressors but also asymptotically uncorrelated with the error terms.
The 2SLS estimators are quite similar to the OLS ones except for the stochastic regressors that are higher in value. The analysis of the residuals with the Anselin- Kelejian test does not allows us to accept the null of spatial autocorrelation, as in the OLS case.
4.1.4. OLS/2SLS plus spatial HAC of the basic model without/with endogenous regressors.
When spatial heteroskedasticity and autocorrelation in the error terms are very significant in a model, it is possible to implement a robust inference on the estimated OLS or 2SLS residuals. It is similar to the well-known adjusted White variance, which consists of estimating the covariance matrix of OLS parameter estimators, from the covariance matrix of the heteroskedasctic error terms, in order to perform an inference which is robust to this problem. Kelejian and Prucha (2007) develop a spatial HAC estimator for a situation in which the error terms are not only heteroskedastic but spatially autocorrelated. They model spatial dependence in terms of a spatial weighting matrix. In spatial HAC estimation literature, an economic distance is commonly employed to characterize the decaying pattern of the spatial dependence.
The covariance of random variables at locations i and j is a function of tlr the economic distance between them; as the economic distance increases, the covariance decreases in absolute value and vice versa. The existence of such an economic distance enables GeoDaSpace to use the (adaptive) kernel method for the standard error estimation, which is a weighted smn of sample covariances with weights depending on the relative distances, that is: . = -dtjjh for some bandwidth parameter /?,.
This way the robust inference based on the 2SLS estimates presented in Figure 6 presented by GeoDaSpace leads to the results in Figure 7. Though the 2SLS coefficients are the same, their estimated standard errors are different and robust to the already demonstrated existence of both heteroskedasticity and spatial autocorrelation. The HAC standard errors have been computed with the help of an adaptive Gaussian kernel function of the distances between the EU NUTS3 region centroids, for 11 observations (kemelgauss.kwt file).
4.2. Spatial models
4.2.1. Spatial two-stage least squares (S2SLS) estimation of the spatial lag model without/with endogenous regressors
The spatial lag model or mixed regressive spatial autoregressive model includes a spatial lagged dependent variable, Wy, as one of the explanatory variables:
... (5)
where Wy is a N by 1 vector of spatial lags for the dependent variable, p is the spatial autoregressive coefficient, and e is a N by 1 vector of normally distributed random error terms, with means 0 and constant (homoskedastic) variances cr.
The presence of the spatial lag is similar to the inclusion of endogenous variables on the RHS in systems of simultaneous equations. For this reason, this model is therefore often referred to as the simultaneous spatial autoregressive model. Typically the p coefficient is unknown and must be estimated jointly with the regression coefficients {ß). The main consequence of the inclusion of Wy on the RHS of the specification is that OLS no longer achieves consistency. This is similar to what happens in systems of simultaneous equations. Instead of OLS, estimation must be based on an instrumental variables approach, proposed by Anselin (1988), leading to what is called spatial two-stage least squares (S2SLS), or an explicit maximization of the likelihood function when the error tenus are nonnally distributed (this method will be implemented in GeoDaSpace soon).
The S2SLS estimation method consists of the construction of a proper instrument for the spatial lag variable. As in 2SLS, the resulting estimate ( p ) is consistent, but not necessarily very efficient. In the spatial lag model a number of suggestions have been fonnulated for the choice of the best instruments for the p parameter. Kelejian and Robinson (1993) showed that the proper set could be a series of spatially lagged exogenous variables, for first order and higher order contiguity matrices (though in practice, this series may be truncated and only the first order spatially lagged explanatory variables may be included).
By the inclusion of Wv in addition to other explanatory variables, the spatial lag specification can be interpreted in two different ways. First it can be a way to assess the degree of spatial dependence, while controlling for the effect of these other variables; hence, the main interest is in the spatial effect. Alternatively, it allows assessing the significance of the other (non-spatial) variables, after the spatial dependence is controlled for. In this second case, it is important to note that the interpretation -and inference- of the coefficients of the exogenous variables in a spatial lag model is not straightforward, as they do not correspond to marginal effects (LeSage and Pace 2009 for a further review).
In the application for the EU regions, we introduce the spatial lag of the endogenous variable (GDP per area) with the aim to assess the significance of the first and second nature variables controlling by this variable (Figure 8).
Besides the previously defined instrumental variables for the second nature variables (population and productivity), the S2SLS method needs to include the spatially lagged exogenous variables as new instruments for the spatial autoregressive parameter (p): W_FSEA, W_FSLAT, W_FWEST and W_MINER). As we can see in the output window of GeoDaSpace, though the autoregressive coefficient seems to be relevant to GDP per area, it cannot completely capture the spatial autocorrelation effect in the residuals, since the Anselin-Kelejian test is still very significant. This result confirms the outcome of the LM spatial autocorrelation tests in the OLS estimation of the basic model, which pointed out to a joint spatial lag and spatial error model (Figure 5).
4.2.2. S2SLS plus GM/GMM/spatial HAC estimation of the spatial error model without/with endogenous regressor
The spatial error model is a special case of a so-called non-spherical error model, i.e., a regression specification for which the assumptions of homoskedastic (constant variance) and/or uncorrelated errors are not satisfied. The spatial dependence in the error term can take on a number of different forms. In the current version of GeoDaSpace, only a spatial autoregressive process for the error term can be estimated. This model is the standard regression specification with a spatial autoregressive error term:
... (6)
where We is the spatial lag of the errors, X is the autoregressive parameter and s is a "well-behaved" error, with mean 0 and variance matrix o2.
The consideration of non-spatial endongenous explanatory variables (Y) is model implies the estimation of a new group of y coefficients, leading to the following expression: ..., (7)
The principles of the implementation of general methods of moments (GMM) estimation of the spatial error model were originally presented in Kelejian and Prucha (1998, 1999), and more recently generalized in a series of papers by Kelejian and Prucha (2010), Arraiz et al. (2010) and Drukker et al. (2011), jointly referred to in what follows as K-P-D. The estimation strategy outlined by K-P-D consists of two major steps. The first one has to do with the estimation of the model coefficients using a feasible generalized least squares approach (Spatially Weighted Least Squares), which is a kind of spatial Cochrane-Orcutt transformation, in order to obtain an consistent (though not still efficient) spatial autoregressive parameter X. The second step deals with obtaining an efficient estimate for X (see Anselin et al., 2012 for technical details). This is the method employed by GeoDaSpace (Figure 9) in order to estimate a spatial error model with endogenous explanatory variables by S2SLS and GMM (GS2SLS). Additionally, we have estimated a robust inference of the estimator covariance matrix in presence of both spatial heteroskedasticity and autocorrelation (KP-HET), proposed by Kelejian and Pmcha (2010).
4.2.3. S2SLS plus GM/GMM/spatial E1AC estimation of the joint spatial lag and spatial error model without/with endongenous regressors
The joint spatial lag and spatial error model has the following specification:
... (8)
and the inclusion of additional endogenous explanatory variables leads to the following model:
... (9)
The estimation strategy outlined by K-P-D consists of three major steps( see Anselm, 2011 for technical details). The first one has to do with the estimation of the model coefficients using a feasible generalized least squares approach (Spatially Weighted Least Squares), which is a kind of spatial Cochrane-Orcutt transfonnation, in order to obtain an consistent (though not still efficient) spatial autoregressive parameter A. The second step deals with obtaining an efficient estimate for A by means of a weighting matrix that is necessary to obtain the optimal (consistent and efficient) GMM estimate of A in the second iteration. A third step consists of estimating the regression coefficients (ß and p) in a spatially weighted regression, using filtered variables that incorporate the optimal GMM estimate of A.
Again, it is also possible to perform a robust inference of the coefficient covariance matrix in presence of both spatial heteroskedasticity and autocorrelation (KP-HET), proposed by Kelejian and Prucha (2010).
As shown in Figure 10, both p and A autoregressive parameters are very significant (especially A), as well as the rest of coefficients, which do not experience great changes with respect to previous estimations.
4.3. Spatial regimes model
This model is suitable in certain instances in which the assumption of a fixed relation between the explanatory variables and the dependent variable across the complete system is not tenable. Instead, heterogeneity may be present, in the form of different intercepts and/or slopes in the regression equation for subsets of the data. This is often referred to as structural instability or structural change in the econometric literature and may be expressed in the form of switching regression models. When the different subsets in the data correspond to regions or spatial clusters, it is called spatial regimes model (Anselin, 1988).
In GeoDaSpace, models of spatial regimes are implemented by jointly estimating the coefficients for all the predefined spatial regimes. An augmented matrix of observations on the explanatory variables is constructed, of dimension N by MK (with M as the number of regimes), by transforming each explanatory variable into as many new variables as there are regimes. The new variables are zero for all observations that do not fall in the regime to which they correspond.
In all other respects, a model with spatial regimes is treated as a regression model, allowing the full range of estimation methods (OLS, ML, 2SLS, S2SLS, GS2SLS, spatial HAC, KP-HET, etc.). The only condition is the introduction of the variable name for a categorical indicator variable, with the spatial regimes information, in the corresponding box of the GeoDaSpace "specification" menu. This indicator variable may only take on consecutive integer values, with each value corresponding to a regime. The software internally allocates the observations to their proper regimes.
GeoDaSpace also implement a test on the stability of the regression coefficients over the regimes. This is a spatial Chow test on the null hypothesis that the coefficients are the same in all regimes: // = ßr =... = /i" . In the case of the spatial models, only one autoregressive parameter (p and/or Á) is estimated for the whole model. This test is implemented for all coefficients jointly, as well as for each coefficient separately.
Figure 10 shows the complete set of estimates of a joint spatial lag and spatial error spatial regimes model of the log of GDP per area, with two endogenous non-spatial explanatory variables (population and productivity) and KP-HET robust inference with heteroskedasticity and spatial autocorrelation. The spatial regimes are 4 groups of European regions located in the center-core of the EU (regime #1), Northern periphery (regime # 2), Eastern periphery (regime #3) and Southern periphery (regime # 4). The spatial Chow test is very significant both in its global version and for each of the coefficients separately, which justifies the consideration of the regimes in order to control for spatial heterogeneity in this spatial system.
The spatial autoregressive coefficients are also very significant. Regarding the coefficients for the second nature variables, population (PI06) and productivity (DEL06), they are also very significant in the four regimes, unless they suffer important changes in size. Whereas the population coefficients are specially high in the Northern and Southern peripheries, the productivity ones are higher in the center-core and Eastern periphery, revealing different patterns of man-made agglomeration effects in the location of production in the EU.
5. Conclusions
The expansion of spatial analysis and particularly, spatial econometrics has been outstanding in all branches of social sciences and economics. Accordingly, spatial regression techniques are now becoming an established component in the applied econometrics toolbox and even, in standard econometrics textbooks. For this reason, this methods are also being part of the econometric and statistics courses in economics, business administration, marketing, environmental studies, epidemiology and many other sciences.
GeoDaSpace is a recent software packages that has revealed as very useful for both teachers and students in the classroom, since it is an easy-to-use point a click program, which is freely downloadable from the GeoDa Center. It is still an alpha release which is constantly incorporating new items. The new announced methods for the near future are the implementation of Maximum Likelihood for the estimation of the spatial models and the estimation and diagnosis of spatial panel data models.
Acknowledgements
The author acknowledges financial support from the Spanish Ministry of Economics and Competitiveness (Grant No. EC02012-36032-C03-01), Project UAM-Santander and Xunta de Galicia (Grant No. 10SEC201032PR).
1 Maximum Likelihood estimation is not present in the last version (version 0.8.1, May 2013), though it will come soon.
2 The coordinates must be stored as a decimal number. Therefore, if you start from information on the coordinates in degrees you must first convert it to decimal format with the following linear combination: decimal = degrees x 1 + minutes x 0.01666667 + seconds x 0.00027778.
References
1. L. Anselin, R.J.G.M. Florax and S. Rey, Econometrics for spatial models: Recent advances, in Advances in spatial econometrics methodology tools and applications, eds L. Anselin, R.J.G.M. Florax and S. Rey (Springer Verlag, New York, 2004), pp 1-25.
2. G. Maddala, "Econometrics" (McGraw-Hill, New York, 2001 ).
3. G. Woolridge, "Econometric analysis of cross section and panel data" (MIT Press, Cambridge, MA, 2002).
4. D. Gujarati, "Basic Econometrics", fourth edition (McGraw-Hill, New York, 2003).
5. P. Kennedy P, "A guide to econometrics", fifth edition (Blackwell Publishers, Oxford, 2003).
6. B. Baltagi, "Econometric analysis of panel data" (John Wiley and Sons, Chichester, 2008).
7. S. Rey and L. Anselin, PySAL: A Python library of spatial analytical methods, in Handbook of applied spatial analysis: Software tools, methods and applications, eds. M. Fischer and A. Getis (Springer, Berlin, 2010).
8. L. Anselin, From SpaceStat to CyberGIS. Twenty Years of Spatial Data Analysis Software, International Regional Science Review 35 (2012) 131-157.
9. D. M. Drukker, P. Egger and I. Prucha, On two-step estimation of a spatial autoregressive model with autoregressive disturbances and endogenous regressor, Econometric Reviews 32 (2013)686-733.
10. J. LeSage and R.K. Pace, "Introduction to spatial econometrics" (Chapman and Hall/CRC, Boca Raton, FL., 2009).
11. R. Bivand, Spatial econometrics functions in R: Classes and methods, Journal of Geographical Systems 4 (2002) 405-421.
12. G. Piras, Spatial models with heteroskedastic innovations in R, Journal of Statistical Software 35 (2010) 1-21.
13. L. Anselin, "Exploring Spatial Data with GeoDa(TM): A Workbook" (University of Illinois at Urbana-Champaing, 2005).
14. C. Chasco, J. Vicéns and I. Garcia, Modeling spatial variations in household disposable income with geographically weighted regression, Estadística Española 50 (2008) 321-360.
15. C. Chasco, A. López and R. Guillain, The influence of geography on the spatial agglomeration of production in the European Union, Spatial Economic Analysis 7 (2012) 247-263.
16. L. Anselin and H. H. Kelejian, Testing for spatial error autocorrelation in the presence of endogenous regressor, Int. Regional Sei. Rev. 20 (1997) 153-182.
17. H. H. Kelejian and I. R. Pmcha, HAC estimation in a spatial framework, J. Econometrics 140 (2007) 131-154.
18. L. Anselin, "Spatial econometrics: Methods and models" (Kluwer Academic Press, Dordrecht, 1988).
19. H. H. Kelejian and D. Robinson, A suggested method of estimation for spatial interdependent models with autocorrelated errors, and an application to a county expenditure model, Pap. Reg. Sei. 72(1993)297-312.
20. H. H. Kelejian and I. R. Prucha, A generalized spatial two-stage least squares procedures for estimating a spatial autoregressive model with autoregressive disturbances, J. Real Estate Financ. 17(1998) 99-121.
21. H. H. Kelejian and I. R. Prucha, A generalized moments estimator for the autoregressive parameter in a spatial model, Int. Econ. Rev. 40 (1999) 509-533.
22. H. H. Kelejian and I. R. Prucha, Specification and estimation of spatial autoregressive models with autoregressive and heteroskedastic disturbances, J. Econometrics 157 (2010), 53-67.
23. I. Arraiz, D. M. Drukker, H. H. Kelejian and I. R. Prucha, A spatial Cliff-Ord type model with heteroskedastic innovations: Small and large sample results, J. Regional Sei. 50 (2010) 592- 614.
24. D. M. Drukker, I. R. Prucha and R. Raciborski, "A command for estimating spatial- autoregressive models with spatial-autoregressive disturbances and additional endogenous variables", Technical Report (Stata Corp, College Station, TX, 2011).
25. L. Anselin, "GMM estimation of spatial error autocorrelation with and without heteroskedasticity", Note (GeoDa Center, Arizona State University, 2011).
26. L. Anselm, P. V. Amaral and D. Arribas-Bel, "Technical aspects of implementing GMM estimation of the spatial error model in PySAL and GeoDaSpace", Working Paper 2/12 (GeoDa Center, Arizona State University, 2012).
CORO CHASCO
coro, chasco&uam. es
Departamento de Economía Aplicada, Universidad Autónoma de Madrid
Avenida de Francisco Tomás y Valiente 5, 28049Madrid
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright Ramón Sala Garrido 2013
Abstract
Spatial econometrics has become a prominent topic in the recent scientific literature. For this reason, it is being used in research as well as teaching both undergraduate and graduate econometrics courses. GeoDaSpace is a software package for the estimation and testing of spatial econometric models in an intuitive and easy-to-use point and click enviromnent. It is still an alpha release freely downloadable from the GeoDa Center (Arizona State University), which incorporates a wide range of estimation methods (OLS, 2SLS, ML, GM/GMM) and models (spatial lag, spatial error, spatial lag and error, spatial regimes), with options for spatial and non-spatial diagnostics, non-spatial endogenous variables and heteroskedasticity/HAC covariance estimators. GeoDaSpace is a very useful teaching resource that can be used by both teachers and students. [PUBLICATION ABSTRACT]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer