Applying Machine Learning to identify Earthquake Patterns: 2013

Monday, August 5, 2013

Linear Discriminant Analysis

The Linear Discriminant Analysis (LDA) is a classification algorithm, used in machine learning and pattern detection. For linearly classifiable data, it classifies the data faster than the Support Vector Machine. It also allows for dimensionality reduction, if required. It is used if two or more features convey the same information; for optimizing, the redundant features are not used in deciding the classification criteria. Basically, the LDA maximizes the ratio of the between-class variance to the within-class variance.

For practical purposes, the LDA seems to be used more often than the SVM. So we tried this classification method on our data. We obtained the following results for a Single Sensor LDA.

LDA classification results

Picking on Quiet Data too!

The idea of running the KSigma algorithm on the filtered data seemed promising after seeing the plots in the previous blog. But, what we did not realize all this while was that, if the algorithm was picking better in the quake time, it would pick better on the quiet time too. We were basically enhancing the frequencies that distinguish the quiet time from the quake time, that seemed like the reason why there were more picks in the quiet time.

Below is a plot showing this effect. The top sub plot is the original data from the SAC file. The two blue lines are the TauP estimations of the P and S wave arrival times, the first is for the P wave and the second is for the S wave.

Quiet time picking

Friday, July 19, 2013

Trying to get cleaner picks!

The single SVM sensor seemed to work fine, atleast for the station that we picked randomly. We are currently trying the same for the other stations. Also, the previous results were for a single channel- N. We are now trying to combine channels.

Dr. Chandy suggested that a computationally inexpensive way to derive the frequency characteristics was to pass the data through different filters and it would essentially give the same information as the FFTs. Also, now the SVM has been trained and the weights for different features have been derived, the data obtained after filtering should be weighted with the weight obtained for that feature (the bins and the filters should correspond to the same frequency intervals). Since the data has been now tweaked to better classify into quake and non-quake times, it is expected that it should now be able to pick better, and the kSigma algorithm picks would have a higher confidence value.

Below is an illustration of the algorithm.

Algorithm illustration

Thursday, July 18, 2013

Classification Results for the SVM of a single sensor

We tried to implement an SVM to classify positive and negative picks for a single sensor. We implemented the following algorithm in Python.

The data in the SAC files is divided at the point of the first pick. Suppose the first pick occurs at 34 seconds for some earthquake. The algorithm takes an FFT for each 5 second interval before and after the pick, i.e. 0-5, 5-10, 10-15, 15-20, 20-25, 25-30 seconds for non-quake period and 35-40, 40-45, 45-50, 50-55, 55-60 seconds for quake period (similarly for 10 second intervals). These are stored in an array and the array is then randomly sorted.

One concern was that the SVM seemed to be heavily biased towards the side (positive/ negative) that has more training examples (even one more training example).

We tried this for the following bin sizes.

0-1, 1-2, 2-3, 3-4, 4-5, 5-6, 6-7, 7-8 Hz
0-1, 1-2, 2-4, 4-8 Hz
0-2, 2-4, 4-6, 6-8 Hz

These were all done for both 5 second and 10 second intervals, using both the linear and the RBF kernels. For obtaining the results, we kept the number of training examples for positive picks and negative picks exactly equal.

The table shows the results for the linear kernel for bins of interval 2 Hz and 5 second data intervals.

For bins: 0-2, 2-4, 4-6, 6-8 Hz
5 second intervals: 56 training samples, 8 test samples
Training Data
	5 seconds- linear kernel	5 seconds- rbf kernel
True Positive	71.42	25
False Positive	7.2	0
True Negative	92.8	100
False Negative	28.58	75

Test Data
	5 seconds- linear kernel	5 seconds- rbf kernel
True Positive	100	0
False Positive	0	0
True Negative	100	100
False Negative	0	100


10 second intervals: 24 training samples, 4 test samples
Training Data
	10 seconds- linear kernel	10 seconds- rbf kernel
True Positive	71.42	25
False Positive	7.2	0
True Negative	92.8	100
False Negative	28.58	75

Test Data
	10 seconds- linear kernel	10 seconds- rbf kernel
True Positive	50	100
False Positive	0	0
True Negative	100	100
False Negative	50	0

Thursday, July 11, 2013

Issues with Spectrogram

Since our last post on the Spectrograms and their utility, we have spent some time thinking about the current challenges to the approach and how do we refine our spectrograms to meet the expectations.

Interestingly, all the answers are contained within the way ObsPy creates spectrogram, or any spectrogram is created for that matter. Here is a good link which explains some of the technical aspects of a spectrogram.

In our last post concerning spectrogram, we talked about poor resolution of the spectrograms that we created. It appears to us that whole issue was due to the three important variables in ObsPy spectrogram that we had failed to understand clearly:

Wlen (stands for Window Length): In concept, every spectrogram is created by dividing enitre data stream into small windows, and then FFT of each window is taken and stacked side by side to produce the plot. Therefore 'wlen' defines size of each such window. Now, a very large wlen implies less stacks (hence poorer resolution) while very low wlen implies too less data for single FFT(hence poorer resolution on color scale). We had to carefully tweak the settings for our purpose.
Over Lap: To produce better quality spectrograms, two consecutive windows in a spectrogram usually have some overlapping data points. But a very high overlap can disturb your X axis scaling.

We had test these settings one by one.By tweaking above two parameters, we were able to create somewhat better spectrograms, like the one displayed below.

Several spectrograms have been stacked one over another for LADWP building, with decreasing floor numbers(top to bottom)

Wednesday, July 10, 2013

Issues with a single sensor SVM

It would be very useful if we could train an SVM on a single sensor and detect picks on that sensor correctly. There are a few issues with doing that.

We classify non-quake times as negative picks and quake times as positive picks. The training data is separated at the first time domain kSigma pick. Suppose it occurs at 34 seconds for some earthquake and some station. The algorithm takes an FFT for each 5 second interval before and after the pick, i.e. 0-5, 5-10, 10-15, 15-20, 20-25, 25-30 seconds for non-quake period and 35-40, 40-45, 45-50, 50-55, 55-60 seconds for quake period. We're not sure at this point if this is infact a good idea. It might not capture the information correctly. Also, since 5 seconds is a very small interval, and therefore is lesser data, even slight abnormalities can lead to disastrous results.

We also face data insufficiency issues.There are only a few sensors for which we have the data for all earthquakes during the past few years. Even among those, a few stations did not pick for some of the earthquakes. Also, some stations picked at the 54th second out of the 60 seconds data. This causes there to be more quiet time data than quake time data. The SVM, in such cases, is highly biased towards negative picks and is very likely to predict a positive pick as a negative one.

Tuesday, July 9, 2013

Restrictions due to renaming of Phidgets

The SVM trained on the non-normalized data gave better results than expected. There is a very low probability that this is by chance since the training and test data were chosen such that the a possibility of good results by mere chance is eliminated. Nevertheless, to be sure, it could be tested for a single sensor.

The availability of data is an issue here. The CSN, being a relatively new project, has recorded only a limited number of earthquakes. Also, once a Phidget is unplugged and setup again from a different machine, it is assigned a new identification number. So, it becomes very difficult to figure out a pair of identification numbers which actually belong to the same Phidget. Maybe, in the future, the CSN develops a work around.

Normalization of the FFT histograms

Dr. Bunn suggested normalizing FFT histograms for the SVM to be able to generalize better. This seemed logical and we tried it out. Surprisingly, the results were not better, infact, for quake predictions, the results were more likely to be predicted correctly by a random guess (less than 50% accuracy). But, the number of false positives, did go down considerably.

Results:
Training Data:
Even the training data was not fit by the SVM trained on it.
True positive: 48.90%

True Negative: 97.23%

False Positive: 2.77%

False Negative: 51.10%

Test Data:
Since even the training data was not fit by the SVM, we did not even expect the test data results to look good.
True positive: 30%

True Negative: 100%

False Positive: 0%

False Negative: 70%

Support Vector Machines

For a proof of concept, we tried to train a support vector machine to distinguish between quiet time and quake time periods. The results are promising!

The support vector machine was trained on 7 different earthquake for a set of stations for the quiet time and the quake time. Currently, the first pick is considered as the boundary between the quiet time and quake time periods. A set of random data points- 3 random stations for each quake for both quiet time and quake time was chosen as the training data.

We had a dataset of 2240 data points in all.

Results:

True positive: Correctly predicted quake time as quake time

True Negative: Correctly predicted quiet time as quiet time

False Positive: Incorrectly predicted quiet time as quake time

False Negative: Incorrectly predicted quake time as quiet time

Training Data:

The Support Vector Machine seemed to be able to fit the training data well enough.

True positive: 78.93%

True Negative: 84.19%

False Positive: 15.81%

False Negative: 21.07%

Test Data:

The Support Vector Machine was also able to correctly classify the test data.

True positive: 80.00%

True Negative: 88.57%

False Positive: 11.43%

False Negative: 20.00%

The idea of detecting earthquakes by capturing the differences in the frequency domain properties, seems logical now. We will try changing a few parameters to figure the best possible way for the classification to get optimal results.

More on the Spectrograms

In the last post we discussed how we were able to plot spectrograms using the spectrogram module in ObsPy. A few days ago, we were able to to show some of those plots to Dr. Monica Kohler. Our plots seemed very pixelated as compared to the plots that she normally used. We were looking for a way to plot spectrograms of good resolution and palpable color distribution within different intensities for the Los Angleles Department for Water and Power (LADWP) multi-storeyed building. Following plot can be considered a perfect example for the scenario.


Spectrogram plotted by Ming Hei, a graduate student, for the Millikan Tower, a multistory Building at Caltech. Make note of intense horizontal lines, which refer to harmonics of the building.

Dr. Kohler offered few suggestions that could help us improve our plots-

To apply a 2 pass 2-pole Butter-worth Bandpass filter for 0.1 to 20 Hz before calculating spectrograms
To change the color scheme such that maximum spectral amplitudes appear as bright red (see spectrogram above)

We will try to incorporate these suggestions.

Monday, July 1, 2013

The Spectrograms

Less than a week ago, we were able to show some of our work to Dr. Monica D. Kohler, a Senior Research Fellow in mechanical and civil Engineering, Caltech. She being a seismic expert in the CSN group, was interested in the effects of seismic waves on tall buildings, and whether it was possible to observe oscillations of tall building in the frequency transforms. She suggested the use of Spectrograms: the Frequency Time Distribution of the seismic data from the SAC files.

Fortunately, the Obspy package itself contains 'obspy.imaging.spectrogram' method, which allows to read and plot spectrograms directly from SAC files. This optimized for speed method takes stream and bunch of other arguments(for plotting) as input and plots spectrogram within a second for 15 min SAC data.

A Spectrogram showing activity during Anza (Mar 11, 2013) tremors for all 3 channels. Notice the red and yellow shades mark the beginning of quake, when according to K Sigma

Though some work has to be done if you wish to transfer these plots to Matplotlib.pyplot, or arrange them on Subplots. This because obspy uses its own plotting methods which doesn't plot on a subplot by default.
The workaround is to use 'plt.gca()' to get axis, and then use obspy to plot. Also make sure to give 'show=False' as an argument to spectrogram, to popping up windows at every subplot. A sample code looks for single subplot looks like this:

                    plt.subplot(311)
                    ax_HNN=plt.gca()
                    spectrogram(tr_HNN.data,tr_HNN.stats.sampling_rate,show=False,axes=ax_HNN)

We are planning to plot spectrograms for some CSN sensors in Los Angeles Department of Water & Power (LADWP), to figure some correlation between frequency and respective floor of installation of the sensors.

Friday, June 28, 2013

The TauP curves

TauP is a seismic time travel calculator; depending on the depth of the seismic activity and the location, it takes into account the earth composition to estimate the velocity of the seismic waves and thereby calculate the estimated arrival time of the wave at a particular location. It predicts the travel times and hence the arrival times for the P wave and the S wave. ObsPy also has a package for plotting the P wave. You can find the details here. We could not locate one for plotting the S wave.

To verify the accuracy of the ObsPy package, we ported the TauP code used by the CSN, from Java to Python. The plot below shows the actual picks in the S wave and the P wave and the estimated S wave and P wave locations.

The TauP plot for the Anza earthquake (4.7M)

The red points show the actual picks by the CSN stations. The magenta curve is the TauP estimate for the P wave of the quake and the blue curve is the estimate for the S wave.

Friday, June 21, 2013

Undersampling causes Aliasing!

The earthquake data from the CSN stations (in SAC format) is at a sampling rate of 50Hz. We put some thought into the possibility of something going wrong because of the sampling rate. We figured that it might be the case; undersampling causes aliasing!

If a 60 Hz sine wave is sampled at 50 Hz and then the Discrete Fourier Transform is taken (by the FFT algorithm), one might expect no data in the plot since there is no frequency in the range 0 Hz - 25 Hz. (25 Hz is 50/2 Hz, the Nyquist rate). But that is not the case. Since the data has been undersampled the 60 Hz signal distorts the Frequency Transform to bring up a peak at 10 Hz (60 Hz - 50 Hz). This is demonstrated in the plot below.

Aliasing caused by the 60 Hz signal sampled at 50 Hz

This could make a lot of the analysis go wrong, a peak at 1 Hz could have been caused by signals of 1 Hz, 49 Hz, 51 Hz, 99 Hz, 101 Hz etc. Further analysis in the frequency domain thus requires more information about the different frequencies the stations could possibly catch.

The x-axis for the FFT plot

A nice combination of NumPy and ObsPy enables us to quickly plot FFTs from the event SAC files. Finding FFT is the easy part, thanks to the 'numpy.fft' method. The main roadblock for us, as it turned out, was the x-axis scale part.

To plot a FFT, matplotlib.pyplot needs to be provided with two arrays, sets of y points and corresponding x points. NumPy.fft does the former for us. Latter can be created through another method of Numpy: 'fft.fft.fftfreq (n=number of points,d=spacing in sample)'.

We were able to locate this function quickly, however we committed mistakes in implementing it, all because we were giving it the wrong arguments. The documentation declares 'd' as spacing, which we thought was the spacing between two consecutive data points along the x-Axis in the frequency domain.
// d= sampling_freq/total Number of points

This turned out to be wrong, and gives incorrect domain to the the DFT, and the scale showed ticks above Nyquist frequency. So we looked at the source code, and found that 'd' represented distance between two sample in the time domain.
// d= 1/sampling_freq

This gave us the x-scale that was expected, i.e. within the range of Nyquist frequency. Below is one such plot showing the correct scale, for all the three channels.

Stacked FFTs, with sampling rate of 50Hz. Notice the max freq on X axis is half sampling rate, or the Nyquist frequency

Monday, June 17, 2013

The picking algorithm

It seemed like the frequency transforms would make more sense if they were separated into Quiet Time and the Quake Time data at the point of the first pick. This required us to port the current picking K-sigma algorithm from Java to Python. While this seemed like an easy task, a particular mistake that we had made in the Python code, lead us to disastrous results. The error, not being a syntactical one, required some debugging to be pointed out. Once that was resolved, the code was up and running. To be sure that there were no more errors, we compared the results of the Java code with that of the Python code; both reported the same data point as the one with the first pick for all the stations. Well, that was some progress!

Sunday, June 16, 2013

Frequency Transforms: Debugging!

In the last post, we stated how to get started with plotting the Frequency Transforms of seismic data from SAC files. We also complained about the plethora of difficulties faced regarding proper visualization of data. The good news, this week was that we were able to solve most of them and write a Python module that could create a frequency transform with proper scale and visualization requirements.

In this post we shall list out how we solved some of the roadblocks.

The first thing to do is to set the scale of x-axis in frequency domain. We can use 'numpy.fft.fft (array)' to return a complex FFT array of the input data. However to plot them, is needed another array, containing X-points corresponding to these y-points. This can be done in two ways, manually, or by using the 'numpy.fft.fftfreq (length,sample spacing )' method. We would recommend the second one, it saves one trouble of writing a few extra lines of code. One important point to note here is that the sample spacing is defined on the frequency axis, which in our case was 1/(npts/sampling rate).
Detailed information on this method can be found here.

Using this feature, we could set the correct scaling, and as we verified the scaling, we bumped into another issue: there were horizontal lines in the graph running across the plot dataset!

Notice blue horizontal lines in frequency domain

The culprit! Notice the line running all the way through!

We tried using the Scatter plot as well, but it didn't do any good. 'Googling' around wasn't too much help either; which was pretty annoying. So we decided have a look at source code of 'numpy.fft.fftfreq()', as we hadn't faced the problem prior to this. A simple bug was palpable in the source code. The 'fftfreq', method returns a float array, containing frequency bins, however first half of the array contains bins >0 , while the second half contains bins<0, hence to connect last and first points two halves, 'numpy.plot()' joins them, for the sake of continuity. To solve the problem, we plot half the range on x and y.

As pointed out in the last post, the 'log' scale wasn't quite what we expected. This problem can be solved by plotting the absolute value of intensity rather, since all that matters is the amplitude of a particular frequency. We got the plots like we desired, at this point.

FFT with proper scaling, windowing plot for half range: Single channel (top) and all three channels (bottom)

Next big thing to be done is the picking algorithm, and in particular, writing the K-Sigma picking algorithm in Python.

Thursday, June 6, 2013

The Frequency Transforms

It appears that seismologists have always been interested in frequency aspects of an Earthquake. Thinking on those lines, we tried plotting the fast fourier transforms(FFT) of few earthquakes, both significant in magnitude and properly recorded by Phidgets.

Obspy doesn't provide a comfortable way to create and directly plot FFT's from sources (ex:SAC files, which contains seismic data we use ). However, numpy does, and it is installed as an additional package when you install Obspy.

Commands for creating FFT's of an array (or list ) are very well listed out here. , along with sufficient description.
Combined with 'matplot.pyplot', we were able to Plot data along their FFT's.

There are few things to be kept in mind though :

Detrend: numpy.fft transforms only discrete data in array form. However data in SAC files needs to converted in an list. One way to do so use detrend feature of numpy.
Obspy Plotting presents only very limited options for plotting, matplotlib, however opens a new dimension. You just have to provide two lists, sets of 'x' and 'y' points.
Windowing: As explained by Prof. Bunn, the transform of data would probably have DC component, contributing to edge effects. This can be reduced by applying transformation windows. One can choose from several windows to apply to treat this effect, depending upon requirenments.
Special attention has to be paid in selecting number of points in your FFT plots, which closely based on points in data and Sampling Rate.
Using lograthmic scale may help in distinguishing peaks. However, this depends mostly on number of points you have for your FFT, which in my case, if low displayed imporper graphs.

Piece Wise Frequency Transforms, Anza '13 (S0153) , showing problems due to logarithmic scale

By applying thses concepts, I was able to obtain transforms for Anza quake which looked the one below.

Piece Wise 512-point FFT for same event(& station), notice disturbances in frequency towards the end, including high freqeuncy disturbances

Further on, we shall:

FFT zoomed in for 0-10 Hz, notice that straight lines in frequency distribution may indicate insufficient points in sample data.

Try to focus on creating plots for low frequency band, (0-10 Hz, which typically represents earthquake data). One challenge to this part is that not enough data sample points are present to create sensible graph within this frequency band, and plot looks like below.
Also we may try to explain some of the high frequency disturbances during earthquake, which are otherwise absent. Prof. Bunn suspects this might be due location of the particular station.
Another useful implementation of this system could be 'dynamic plotting' of data, with time, or dynamic frequency plot displaying live stream of data.
Try to reduce noise in quite time data by using averaging out the non quake time data.

Errors with the sampling rate

The plots shown in the previous post seemed to have some discrepancies; some of the subplots were not synchronized in time with the rest. Looking at the sampling rates and the number of data points for the SAC files revealed an issue. Two of the stations had reported a sampling rate of 62.4999961853 Hz instead of 50 Hz. Another 6 stations reported 3002 data points instead of 3001 like the rest.

One way to resolve the first issue was to simply ignore those stations, another was to downsample the data to 50 Hz.

For the stations reporting 3002 data points, the issue seemed trivial and it looked like it would not make a difference to the plot. But that does not seem like the case now that I have tried to plot it. The plot is shown below; clearly there is still an error.

Removing these 8 stations from the data set however resolves the issue. There should be a more elegant solution than simply eliminating data points, however for now, it works!

It also took a lot of time to figure out how to get rid of the axis labels and ticks for 183 subplots, while still having one for the entire plot. It does look cleaner now-

Tuesday, June 4, 2013

Linux saves the day!

Having failed in trying to get Obspy to work on Windows 7/8, we next tried it on Linux (Ubuntu 12.04 LTS codename precise) on two machines; it worked on both. We thus finally have all the setup ready to start analyzing the data and hopefully finding interesting patterns.

We had the data for the recent 4.7 magnitude quake in Anza. For each station, we plot the quake data, its fast Fourier transform, the data for the quiet time at that station and its fast Fourier transform. The plot for one such station is shown below; we are yet to analyze it.

We used an Obspy class- to calculate the distance between the epicenter and each of the stations. We stored the station name and corresponding distance from the epicenter in an array, and then sorted it in the order of increasing distance. The data from each station was then plotted. We seem to be able to observe a pattern on looking closely. A slight modification of different parameters like the clearance between the subplots, pixels per inch and the thickness of the plot lines might make the visualization clearer.

Data from 100 stations for the Anza earthquake plotted at 250 dpi for each subplot

Data from all 192 stations for the Anza earthquake plotted at 600 dpi for each subplot

Monday, June 3, 2013

Installation difficulties

The SAC (Seismic Analysis Code) file format was designed to be of use to researchers for faster analysis and processing and for the generation of quality graphs. A Python based open source project for processing seismological data in sac format is Obspy. Obspy has effective data manipulation and visualization tools; it seemed a good idea to use it for the project.

We started with trying to install Obspy on Windows 7/8 which took a while. To test if it was working correctly we tried the first example given in the Obspy tutorial. The very first command, the one that imports the obspy.core module gave the error- ImportError: No module named core.

A google search of the error led to us a possible solution: Copying the obspy.core-0.6.2-py2.7.egg file from C:\Python27\Lib\site-packages\Obspy\Lib\site-packages to C:\Python27\Lib\site-packages. It seemed that the error would resolve now, and it did.

But resolving the error did not seem worthwhile when the second command raised an error- DistributionNotFound: obspy.core; this time an error that we could not resolve.

The installation was tried with Python 2.7.4 on a Windows 8 machine and with Python 2.7.5 on a Windows 7 machine; both resulted in the same error.

Friday, May 31, 2013

Project Background and Motivation

During our SURF (Summer Undergraduate Research Fellowship Program) at the California Institute of Technology, we will work towards understanding the different patterns in the seismic data during quake and non-quake periods. We hope to be able to apply machine learning techniques on this data to analyse these patterns to help better the early warning systems.

Earthquake data for the past several years has been collected by the California Integrated Seismic Network (CISN), the Community Seismic Network (CSN) at Caltech and a few other such institutions. Current early warning systems however do not make intelligent use of this data to make the process of detecting earthquakes any quicker; they rely solely on picking algorithms which although have proven to be effective, can be made smarter with the use of predictive intelligence of machine learning techniques. Analysis of the patterns of the past earthquakes can help select out a few patterns that have been seen repeatedly over the years to detect potential future threats.

Our research will primarily be based on the fields of data mining and machine learning, the line distinguishing these two fields being thin. The project will be guided by Prof. Julian Bunn, prinicipal computational scientist at Caltech's Centre for Advance Computing Research and Prof. Mani Chandy, professor of Computer Science.