Applying Machine Learning to identify Earthquake Patterns: June 2013

Friday, June 28, 2013

The TauP curves

TauP is a seismic time travel calculator; depending on the depth of the seismic activity and the location, it takes into account the earth composition to estimate the velocity of the seismic waves and thereby calculate the estimated arrival time of the wave at a particular location. It predicts the travel times and hence the arrival times for the P wave and the S wave. ObsPy also has a package for plotting the P wave. You can find the details here. We could not locate one for plotting the S wave.

To verify the accuracy of the ObsPy package, we ported the TauP code used by the CSN, from Java to Python. The plot below shows the actual picks in the S wave and the P wave and the estimated S wave and P wave locations.

The TauP plot for the Anza earthquake (4.7M)

The red points show the actual picks by the CSN stations. The magenta curve is the TauP estimate for the P wave of the quake and the blue curve is the estimate for the S wave.

Friday, June 21, 2013

Undersampling causes Aliasing!

The earthquake data from the CSN stations (in SAC format) is at a sampling rate of 50Hz. We put some thought into the possibility of something going wrong because of the sampling rate. We figured that it might be the case; undersampling causes aliasing!

If a 60 Hz sine wave is sampled at 50 Hz and then the Discrete Fourier Transform is taken (by the FFT algorithm), one might expect no data in the plot since there is no frequency in the range 0 Hz - 25 Hz. (25 Hz is 50/2 Hz, the Nyquist rate). But that is not the case. Since the data has been undersampled the 60 Hz signal distorts the Frequency Transform to bring up a peak at 10 Hz (60 Hz - 50 Hz). This is demonstrated in the plot below.

Aliasing caused by the 60 Hz signal sampled at 50 Hz

This could make a lot of the analysis go wrong, a peak at 1 Hz could have been caused by signals of 1 Hz, 49 Hz, 51 Hz, 99 Hz, 101 Hz etc. Further analysis in the frequency domain thus requires more information about the different frequencies the stations could possibly catch.

The x-axis for the FFT plot

A nice combination of NumPy and ObsPy enables us to quickly plot FFTs from the event SAC files. Finding FFT is the easy part, thanks to the 'numpy.fft' method. The main roadblock for us, as it turned out, was the x-axis scale part.

To plot a FFT, matplotlib.pyplot needs to be provided with two arrays, sets of y points and corresponding x points. NumPy.fft does the former for us. Latter can be created through another method of Numpy: 'fft.fft.fftfreq (n=number of points,d=spacing in sample)'.

We were able to locate this function quickly, however we committed mistakes in implementing it, all because we were giving it the wrong arguments. The documentation declares 'd' as spacing, which we thought was the spacing between two consecutive data points along the x-Axis in the frequency domain.
// d= sampling_freq/total Number of points

This turned out to be wrong, and gives incorrect domain to the the DFT, and the scale showed ticks above Nyquist frequency. So we looked at the source code, and found that 'd' represented distance between two sample in the time domain.
// d= 1/sampling_freq

This gave us the x-scale that was expected, i.e. within the range of Nyquist frequency. Below is one such plot showing the correct scale, for all the three channels.

Stacked FFTs, with sampling rate of 50Hz. Notice the max freq on X axis is half sampling rate, or the Nyquist frequency

Monday, June 17, 2013

The picking algorithm

It seemed like the frequency transforms would make more sense if they were separated into Quiet Time and the Quake Time data at the point of the first pick. This required us to port the current picking K-sigma algorithm from Java to Python. While this seemed like an easy task, a particular mistake that we had made in the Python code, lead us to disastrous results. The error, not being a syntactical one, required some debugging to be pointed out. Once that was resolved, the code was up and running. To be sure that there were no more errors, we compared the results of the Java code with that of the Python code; both reported the same data point as the one with the first pick for all the stations. Well, that was some progress!

Sunday, June 16, 2013

Frequency Transforms: Debugging!

In the last post, we stated how to get started with plotting the Frequency Transforms of seismic data from SAC files. We also complained about the plethora of difficulties faced regarding proper visualization of data. The good news, this week was that we were able to solve most of them and write a Python module that could create a frequency transform with proper scale and visualization requirements.

In this post we shall list out how we solved some of the roadblocks.

The first thing to do is to set the scale of x-axis in frequency domain. We can use 'numpy.fft.fft (array)' to return a complex FFT array of the input data. However to plot them, is needed another array, containing X-points corresponding to these y-points. This can be done in two ways, manually, or by using the 'numpy.fft.fftfreq (length,sample spacing )' method. We would recommend the second one, it saves one trouble of writing a few extra lines of code. One important point to note here is that the sample spacing is defined on the frequency axis, which in our case was 1/(npts/sampling rate).
Detailed information on this method can be found here.

Using this feature, we could set the correct scaling, and as we verified the scaling, we bumped into another issue: there were horizontal lines in the graph running across the plot dataset!

Notice blue horizontal lines in frequency domain

The culprit! Notice the line running all the way through!

We tried using the Scatter plot as well, but it didn't do any good. 'Googling' around wasn't too much help either; which was pretty annoying. So we decided have a look at source code of 'numpy.fft.fftfreq()', as we hadn't faced the problem prior to this. A simple bug was palpable in the source code. The 'fftfreq', method returns a float array, containing frequency bins, however first half of the array contains bins >0 , while the second half contains bins<0, hence to connect last and first points two halves, 'numpy.plot()' joins them, for the sake of continuity. To solve the problem, we plot half the range on x and y.

As pointed out in the last post, the 'log' scale wasn't quite what we expected. This problem can be solved by plotting the absolute value of intensity rather, since all that matters is the amplitude of a particular frequency. We got the plots like we desired, at this point.

FFT with proper scaling, windowing plot for half range: Single channel (top) and all three channels (bottom)

Next big thing to be done is the picking algorithm, and in particular, writing the K-Sigma picking algorithm in Python.

Thursday, June 6, 2013

The Frequency Transforms

It appears that seismologists have always been interested in frequency aspects of an Earthquake. Thinking on those lines, we tried plotting the fast fourier transforms(FFT) of few earthquakes, both significant in magnitude and properly recorded by Phidgets.

Obspy doesn't provide a comfortable way to create and directly plot FFT's from sources (ex:SAC files, which contains seismic data we use ). However, numpy does, and it is installed as an additional package when you install Obspy.

Commands for creating FFT's of an array (or list ) are very well listed out here. , along with sufficient description.
Combined with 'matplot.pyplot', we were able to Plot data along their FFT's.

There are few things to be kept in mind though :

Detrend: numpy.fft transforms only discrete data in array form. However data in SAC files needs to converted in an list. One way to do so use detrend feature of numpy.
Obspy Plotting presents only very limited options for plotting, matplotlib, however opens a new dimension. You just have to provide two lists, sets of 'x' and 'y' points.
Windowing: As explained by Prof. Bunn, the transform of data would probably have DC component, contributing to edge effects. This can be reduced by applying transformation windows. One can choose from several windows to apply to treat this effect, depending upon requirenments.
Special attention has to be paid in selecting number of points in your FFT plots, which closely based on points in data and Sampling Rate.
Using lograthmic scale may help in distinguishing peaks. However, this depends mostly on number of points you have for your FFT, which in my case, if low displayed imporper graphs.

Piece Wise Frequency Transforms, Anza '13 (S0153) , showing problems due to logarithmic scale

By applying thses concepts, I was able to obtain transforms for Anza quake which looked the one below.

Piece Wise 512-point FFT for same event(& station), notice disturbances in frequency towards the end, including high freqeuncy disturbances

Further on, we shall:

FFT zoomed in for 0-10 Hz, notice that straight lines in frequency distribution may indicate insufficient points in sample data.

Try to focus on creating plots for low frequency band, (0-10 Hz, which typically represents earthquake data). One challenge to this part is that not enough data sample points are present to create sensible graph within this frequency band, and plot looks like below.
Also we may try to explain some of the high frequency disturbances during earthquake, which are otherwise absent. Prof. Bunn suspects this might be due location of the particular station.
Another useful implementation of this system could be 'dynamic plotting' of data, with time, or dynamic frequency plot displaying live stream of data.
Try to reduce noise in quite time data by using averaging out the non quake time data.

Errors with the sampling rate

The plots shown in the previous post seemed to have some discrepancies; some of the subplots were not synchronized in time with the rest. Looking at the sampling rates and the number of data points for the SAC files revealed an issue. Two of the stations had reported a sampling rate of 62.4999961853 Hz instead of 50 Hz. Another 6 stations reported 3002 data points instead of 3001 like the rest.

One way to resolve the first issue was to simply ignore those stations, another was to downsample the data to 50 Hz.

For the stations reporting 3002 data points, the issue seemed trivial and it looked like it would not make a difference to the plot. But that does not seem like the case now that I have tried to plot it. The plot is shown below; clearly there is still an error.

Removing these 8 stations from the data set however resolves the issue. There should be a more elegant solution than simply eliminating data points, however for now, it works!

It also took a lot of time to figure out how to get rid of the axis labels and ticks for 183 subplots, while still having one for the entire plot. It does look cleaner now-

Tuesday, June 4, 2013

Linux saves the day!

Having failed in trying to get Obspy to work on Windows 7/8, we next tried it on Linux (Ubuntu 12.04 LTS codename precise) on two machines; it worked on both. We thus finally have all the setup ready to start analyzing the data and hopefully finding interesting patterns.

We had the data for the recent 4.7 magnitude quake in Anza. For each station, we plot the quake data, its fast Fourier transform, the data for the quiet time at that station and its fast Fourier transform. The plot for one such station is shown below; we are yet to analyze it.

We used an Obspy class- to calculate the distance between the epicenter and each of the stations. We stored the station name and corresponding distance from the epicenter in an array, and then sorted it in the order of increasing distance. The data from each station was then plotted. We seem to be able to observe a pattern on looking closely. A slight modification of different parameters like the clearance between the subplots, pixels per inch and the thickness of the plot lines might make the visualization clearer.

Data from 100 stations for the Anza earthquake plotted at 250 dpi for each subplot

Data from all 192 stations for the Anza earthquake plotted at 600 dpi for each subplot

Monday, June 3, 2013

Installation difficulties

The SAC (Seismic Analysis Code) file format was designed to be of use to researchers for faster analysis and processing and for the generation of quality graphs. A Python based open source project for processing seismological data in sac format is Obspy. Obspy has effective data manipulation and visualization tools; it seemed a good idea to use it for the project.

We started with trying to install Obspy on Windows 7/8 which took a while. To test if it was working correctly we tried the first example given in the Obspy tutorial. The very first command, the one that imports the obspy.core module gave the error- ImportError: No module named core.

A google search of the error led to us a possible solution: Copying the obspy.core-0.6.2-py2.7.egg file from C:\Python27\Lib\site-packages\Obspy\Lib\site-packages to C:\Python27\Lib\site-packages. It seemed that the error would resolve now, and it did.

But resolving the error did not seem worthwhile when the second command raised an error- DistributionNotFound: obspy.core; this time an error that we could not resolve.

The installation was tried with Python 2.7.4 on a Windows 8 machine and with Python 2.7.5 on a Windows 7 machine; both resulted in the same error.