_{1}

Identifying vehicular crash high risk locations along highways is important for understanding the causes of vehicle crashes and to determine effective countermeasures based on the analysis. This paper presents a GIS approach to examine the spatial patterns of vehicle crashes and determines if they are spatially clustered, dispersed, or random. Moran’s
*I *and Getis-Ord Gi* statistic are employed to examine spatial patterns, clusters mapping of vehicle crash data, and to generate high risk locations along highways. Kernel Density Estimation (KDE) is used to generate crash concentration maps that show the road density of crashes. The proposed approach is evaluated using the 2013 vehicle crash data in the state of Indiana. Results show that the approach is efficient and reliable in identifying vehicle crash hot spots and unsafe road locations.

Identifying vehicular crash high risk locations along highways is a useful tool that can help transportation agencies allocate limited resources more efficiently, and find effective countermeasures. A crash hot spot is a location showing concentration of incidents, and hot spot analysis is a method for analyzing the spatial tendency between points or events within this location [

• A constant distance that represents the weight for any two different locations.

• A fixed weight for all observations within a specified distance.

• k nearest neighbors that represents a fixed weight, and all others non-neigh- bors are zero.

• Weight could be proportional to the inverse distance, or inverse distance squared.

There are a number of indices or statistics that attempt to measure spatial autocorrelation for continuous data, such as Moran’s I, Geary’s C, and Getis-Ord G_{i} statistic [_{i} statistic in a spot indicate a high spatial clustering (hot spot), whereas low and significant values indicate a low spatial clustering (cold spot). The type of clustering and its statistical significance is evaluated based on a confidence level and on the output z-scores and the correspondent p-values. These will determine whether a data point or a location belongs to a hot spot (denoted by High-High, HH), cold spot (denoted by Low-Low, LL) or an outlier (a high data value surrounded by low data values or vice versa, denoted by High-Low, HL or Low-High, LH). Other methods for studying spatial patterns of crash data as point events have recently been developed. One of the most widely used is the Kernel Density Estimation (KDE). The goal of KDE is to develop a continuous surface of density estimates of discrete events such as road crashes by summing the number of events within a search bandwidth. Many recent studies have used the 2-D planar KDE for hot spot analysis. However, this method has been criticized in relation to the fact that road crashes usually happen on the road links and need to be considered in a road network space represented by 1-D dimension. Therefore, some studies have extended the planar KDE to network spaces, which estimates the crash density over a distance unit in a 1-D measurement instead of an area unit [

Vehicle crashes have been investigated from different spatial and temporal perspectives by different researchers using varied procedures. The Black and Thomas [

Moran’s I [_{i} a mean

where,

w_{ij}: are the elements of the weight matrix;

S_{0}: is the sum of the elements of the weight matrix:

Values for this index typically, range from −1.0 to +1.0, where a value of -1.0 indicates negative spatial autocorrelation, and a value of +1.0 indicates positive spatial autocorrelation. When nearby points or segments have similar values, their cross product is high. Conversely, when nearby points or segments have dissimilar values, their cross-product is low. The expectation of Moran’s I is:

with a Moran’s I value larger than E(I), indicates positive spatial autocorrelation, and a Moran’s I less than E(I), indicates negative spatial autocorrelation. In Moran’s formulation, the weight variable, w_{ij}, is a contiguity matrix. If zone j is adjacent to zone _{i}, the product receives a weight of 1.0. Otherwise, the product receives a weight of 0.0. The z-scores of Moran’s I can be computed by Equation (3):

where E(I) is the expected value of I, and V(I) is the variance of I, as shown in Equation (4):

The distribution of the z-scores is assumed to be approximately normal with a mean of 0.0 and a variance of 1.0 (Cliff and Ord 1981). A statistically significant positive z-score indicates that the distribution of the observations is spatially autocorrelated producing High-High (HH) clusters, whereas a negative z-score indicates that the observations tend to be more dissimilar producing Low-Low (LL) clusters. A z-score close to zero indicates that observations are randomly and independently distributed in space. By assuming a z-score is from a standard normal distribution, their associated p-value can be obtained, and can be used to determine the significance of the index at each location [_{0} is that there is no spatial autocorrelation among the observations. The null hypothesis can be rejected, if the p-value shows that the z-score is significant.

The Getis-Ord G_{i} statistic is another index of spatial autocorrelation [_{i} statistic computes a single statistic for the entire study area, while the local G_{i} statistic is an indicator for local autocorrelation for each data point. There are two types of G_{i} statistics, although almost the two types produce identical results [_{i}, does not include the autocorrelation of a zone with itself, whereas the _{i} statistic does not include the value of X_{i} itself, but only the neighborhood values, but _{i} as well as the neighborhood values), and formally both can be computed by the formulae [

where, d is the neighborhood (threshold) distance, and w_{ij} is the weight matrix that has only 1.0 or 0.0 values, 1.0 if j is within d distance of i, and 0.0 if its beyond that distance. These formulae indicate that the cross-product of the value of X at location i and at another location j is weighted by a distance weight, w_{ij} which is defined by either a 1.0 if the two locations are equal to or closer than a threshold distance, d, or a 0.0 otherwise. The G statistic can vary between 0.0 and 1.0. The statistical significance of the local autocorrelation between each point and its neighbors is assessed by the z-score test and the p-value. The expected G value for a threshold distance, d, is defined as:

where, W is the sum of weights for all pairs of locations (

n is the number of observations. Assuming normal distribution, the variance of G(d) is defined as [

The standard error of G(d) is the square root of the variance of G. Therefore, a z-test can be computed by:

Where, a positive z-value indicates spatial clustering of high values, while a negative z-value indicates spatial clustering of low values. Sometimes, the G statistic may not follow a normal standard error, and the distribution of the statistic may not be normally distributed, such as the case of a skewed variable with some points having very high values while the majority of other points having low values. In this case, a permutation type simulation should be used [_{0}). This will maintain the distribution of the variable z but will estimate the value of G under random assignment of this variable, and the user can take the usual 95% or 99% confidence intervals based on the level used.

Kernel Density Estimation is a non-parametric method to estimate the probability density function of a variable that produces a smooth density surface of point events over a 2-D geographic space (i.e. planar space). Kernel density estimations are closely related to histograms, but can be constructed with properties such as smoothness or continuity by using a suitable kernel. The disadvantages of histograms provide the motivation for kernel estimation. When we construct a histogram, we need to consider the width of the bins in which the whole data interval is divided by, and the end points of the bins. As a result, the problems with histograms are that they are not smooth, and therefore we can alleviate these problems by using kernel density estimation that centers a kernel function at each data point [

Where λ(s) is the density at location s, r is the search radius (bandwidth) of the KDE, k is the weight of a point i at distance d_{is} to location s. The kernel function k is usually considered as a function of the ratio between d_{is} and r. As a result, the longer the distance between a point and location s, the less that point is weighted for calculating the overall density. All points within the bandwidth r of location s are summed for calculating the density at s. A number of distributions can be used to measure the spatial weights k, such as Gaussian, Quartic, Conic, Minimum variance function, negative exponential, and epanichnekov [

The Gaussian function:

The Quartic function:

Where K is a scaling factor to ensure the total volume under Quartic curve is 1.0, and usually used as ¾.

The minimum variance function:

To find the KDE value, two key parameters must be chosen: the kernel function k; and the search radius (bandwidth) r. Many studies have found that the type of the distribution of the kernel function k has a very little effect on the results compared to the choice of search bandwidth r [

where,

SD: the standard distance

D_{m}: the median distance

n: the number of points if no population field is used, or if a population field is supplied, n is the sum of the population field values, and min: means that whichever of the two options that results in a smaller value will be used. Silverman [

where, the SD is the standard deviation of the samples provided that the kernel function is Gaussian type and that samples follow normal distribution. The cell size depends on the user choice and the dataset. Okabe et al. [

In the real world, there are many kinds of network-constrained events, such as traffic crashes, street crimes, leakages in gas pipe lines along roadways, and river contamination. In planar KDE, the space is characterized as a 2-D homogeneous Euclidian space and density is usually estimated at a large number of locations that are regularly spaced over a grid. However, in analyzing the hot spots of network-constraint events, the assumption of homogeneity of 2-D space does not hold and the relevant KDE methods may produce biased results [

Instead of calculating the kernel density over an area unit, the equation estimates the density over a linear unit, and any of the different forms of kernel functions k may be used.

The analysis is conducted on a dataset that presents a road network in the state of Indiana as shown in

events on the network.

The Global Moran’s I evaluates whether the overall network crashes are clustered, dispersed, or random, and assesses the overall spatial pattern of the crash data. The GIS spatial statistics tool is used to compute the Global Moran’s I, and five values are generated from running this tool: The Moran’s I Index, the Expected Index, the Variance, the z-score, and the p-value as shown in

The results of the analysis are interpreted within the context of the null hypothesis. For the Global Moran’s I statistic, the null hypothesis states that the attributes (i.e. crashes) being analyzed are randomly distributed among the features in the study area (i.e. no global spatial autocorrelation exists for the entire network). However, since the p-value being generated is less than 0.01 (using a confidence level of 99%), then this indicates that the Global Moran’s I spatial autocorrelation is significant, and hence, we can reject the null hypothesis, and state that it is quite possible that the spatial distribution of the overall network crashes is the result of clustering pattern, and there is less than 1% probability that this pattern could be the result of random process.

Similarly, the Global (General) Getis-Ord

Since the p-value being generated is less than 0.01 (using a confidence level of 99%), then this indicates that the General G_{i} statistic spatial autocorrelation is significant, and hence, we can reject the null hypothesis, and state that it is quite possible that the spatial distribution of the overall network crashes is the result of clustering patterns, and there is less than 1% probability that this pattern

Global Moran’s I | Expected Index | Variance | z-score | p-value | Decision |
---|---|---|---|---|---|

0.135847 | −0.000335 | 0.000006 | 53.817107 | 0.000000 | significant |

General Gi | Expected Index | Variance | z-score | p-value | Decision |
---|---|---|---|---|---|

0.128449 | 0.106109 | 0.000001 | 19.233837 | 0.000000 | significant |

could be the result of random process. This result is analogous to the Global Moran’s I in determining the overall clustering pattern of the crashes.

Next, the statistically significant hot spots, cold spots, and spatial outliers are identified using the Anselin Local Moran’s I, and the local

In addition, the extent and locations of hot spots, cold spots and outliers differ from one method to the other. For example, cluster # 1 is identified by the

The planar Kernel Density Estimation is determined by the ArcMap spatial analyst tools. Kernel Density calculates the density of a point around each output raster cell, and a smoothly curved surface is fitted over each point. Density surfaces show where point features are concentrated. The surface value is highest at the location of the point being analyzed and decreases with increasing distance from the point, reaching zero at the search radius distance from the point.

ties per km^{2}. It can be seen that the planar KDE has identified seven clusters with different crash densities. For example, cluster # 1 contains 8 density levels rang-

Method | Hot Spot HH | Cold Spot LL | Outlier HL | Outlier LH |
---|---|---|---|---|

Anselin Moran’s I | 102 | 287 | 79 | 82 |

G_{i}* statistic | 157 | 307 | 48 | 0.0 |

ing from the highest density value of 147 crashes/km^{2} to the lowest density value of 16 crashes/km^{2}. Cluster # 2 and # 3 contain 3 density levels ranging from the highest density of 65 crashes/km^{2} to the lowest density of 16 crashes/km^{2}. Cluster # 4 and # 5 contain 2 density levels that decreases from the highest value of 48 crashes/km^{2} to the lowest value of 16 crashes/km^{2}. Cluster # 6 and # 7 contain only one density level of at least 16 crashes/km^{2}. The remaining white raster area contains the minimum density (between 0 ? 15 crashes/km^{2}).

The network-constrained Kernel Density Estimation is determined using the SANET V4.1 software [

Since each method has identified different clustering patterns, therefore we recommend using a combination of these methods in hot spot analysis. Comparable results can show more diverse and flexible interpretations among the clustering patterns. Using only one method can result in misleading conclusions. For example, as we saw above, cluster # 1 is identified as a pure HH hot spot in

and high density spot in network KDE (density from 4.8 to 5.7 crashes/km), while it is identified as a mixed HH and LH in Moran’s I. Likewise, cluster # 2 is identified as a pure non-significant spot in both Moran’s I and

Hot pot analysis focuses on highlighting areas which have higher than average incidence of events, and it is a valuable technique for visualizing the concentration of events on networks. This paper presented two methods: Moran’s I and Getis-Ord

Moran’s I | Gi* statistic | Network KDE |
---|---|---|

measures spatial correlation, identifies hot spots with high-high values, and cold spots with low-low values | measures spatial correlation, identifies hot spots with high-high values, and cold spots with low-low values | measures probability density function, identifies only hot spots in term of density per linear distance unit |

Identifies outliers (dispersed incidents) with high-low values and low-high values | does not identify outliers | does not identify outliers |

Looking at the value in the context of its neighbors’ values within the inverse distance between locations | Looking at the value in the context of its neighbors’ values that fall within a specified distance of each other | conduct density calculation based on the user-specified search radius and raster cell size |

does not include the interaction of a zone with itself but only with its neighborhoods in measuring spatial correlation | includes the interaction of a zone with itself in addition to its neighborhoods in measuring spatial correlation | does not include the interaction of a zone with itself but only with its neighborhoods in measuring kernel density |

reports an index-value, and a z-score | reports a combined index-value and a z-score | reports a linear density value, and a z-score |

reports a p-value | reports a p-value | does not report a p-value |

presents the statistical significance of clustering | presents the statistical significance of clustering | does not present the statistical significance of clustering |

ranges from -1.0 to + 1.0 | ranges from 0.0 to + 1.0 | any positive value |

vehicle crashes and determines if they are spatially clustered, dispersed, or random using the 2013 vehicle crash data in the state of Indiana. The Global values of both Moran’s I and

Abdulhafedh, A. (2017) Identifying Vehicular Crash High Risk Locations along Highways via Spatial Autocorrelation Indices and Kernel Density Estimation. World Journal of Engineering and Technology, 5, 198-215. https://doi.org/10.4236/wjet.2017.52016