There are two extremely useful numerical algorithms I end up using almost all the time when processing data. This was particularly true during my final year undergraduate project and my PhD, when like most people I had raw data which had to be processed and conclusions made. Here I will outline how to use them in python (which I think is the most useful and fastest coding all-purpose language) with a specific example, which is explained at the end.
Levenberg-Marquardt: Curve fitting
It is often useful to fit an analytical function to some raw, messy data. In this case, a numerical algorithm will iterate through different values of input parameters, until a mathematical function is as close as possible to the data's independent variable. The aim here might be to extract some hypothesized physical parameters; for example, Planck discovered the constant named after him, when fitting a function to black-body spectra.
In python, suppose we have two lists (with the same length) x and y. First, we define the function we think best describes the data, or one which might have the most scientific relevance. Here I have chosen a linear function, combined with a shifted hyperbolic tangent:
We want to use scipy's well-optimized curve_fit routine (which we import to our script) to fit this to our data. First, we define the left hand side of the above equation as fit_function, where the first argument is always our independent variable (here it's x) and the others are fit parameters. We then provide this function and the data as arguments to curve_fit:
from numpy import *; from scipy.optimize import curve_fit def fit_function(x,m,c,B,x0,w): return m*x
+c
+B*tanh((x-x0)/w) p,errors=curve_fit(fit_function,x,y,p0=[10.0,10.0,10.0,20.0,10.0]) y_fit=fit_function(copy(x),p[0],p[1],p[2],p[3],p[4])
We have made the algorithm's job a lot easier by specifying an initial guess p0 for the parameters. This is generally useful, otherwise it may converge slowly, or get stuck in a basin of attraction far from the optimal results. Our parameters are returned in an array p, along with a covariance matrix. In order to plot the result, we now use the parameters to define y_fit and allow us to plot the following:
However, the linear coefficient m = -0.11 has come out negative. This may well be the best fit to our data, but suppose that we demand that it be positive (for physical reasons). We can then trick the algorithm by making our function return an absurd value if m is negative:
def fit_function2(x,m,c,B,x0,w):
if m<0.0:
return ones((len(x)))*1E100
else:
return m*x+B*tanh((x-x0)/w)+c
This does indeed give a positive (albeit small) value of m = 1.53E-8 and the following figure:
Brent: root finding
Now that we have an analytical model for our data, we can make some conclusions from it. For example, suppose we want to predict the value of x for which y = 17.75. There is no analytic inverse function for such a mix of linear and hyperbolic tangent. To solve the equation numerically, we rearrange it so that we are finding the roots of some expression, namely:
The most efficient generic algorithm to do this was invented by Richard Brent. It combines the method of bisection with interpolation, meaning that it requires two bounds (a,b) on the root; for it to work, the expression should go to zero once and only once in this interval. Strictly, the algorithm looks for a change in sign of a function, which could also happen if it does so when going through infinity, as for example 1/x does at x=0.
Brent's algorithm is implemented in scipy in a function called brentq, which takes as its inputs a function and the two bounds. The function argument must itself take only a single input; this is a problem in our case, because we all the parameters which we calculated in the previous section to supply to our fit function as well. We get around this by using the lambda notation in python, which can force function taking multiple variables to take just one. Our implementation is as follows:
from scipy.optimize import brentq
threshold_1775=brentq(lambda x: 17.75-fit_function2(x,p[0],p[1],p[2],p[3],p[4]), 10.0,30.0)
We have checked the graphs from previously to get the bounds (10,30) between which our result lies and have also checked that y = 17.75 is has only a single solution in this region (otherwise there would be multiple roots and the algorithm would fail). The result is x = 24.7.
The data
In fact, the data shown corresponds to gender diversity in physics; here x+1950 is the year and y is the percentage of Physics A-levels awarded to women. Many thanks to @KarenLMasters (with the raw data available here [1] [2]) for providing the following graph:
I'm using it here as an example, without getting into a discussion into its societal and cultural meaning. This is a curve fit we obtained with curve_fit:
Depressingly, as we see above, the best fit is a hyperbolic tangent (meaning that the percentage has "plateaued" at around 23%) with a negative or at best an insignificantly small linear trend on top. It is up to teachers, politicians and the rest of society to get rid of any linear transient and get B = 50%.
Extracting numerical data from screenshots
As the raw data was difficult to obtain, I ended up using python script to extract the raw data from the JPEG figure itself, which is also an extremely useful trick for scientists.The script takes a minimalist version of the graph in question, meaning that the image file must end at the axes of the graph - all white space and text has to be removed. I provide the script to extract the data from the following minimalist image RawData.png here:
The broad idea is to find the image co-ordinates of (in this case) any blue pixels and then transform to graph co-ordinates. In this case, given that there is one star per year, this allows the center of each star to be identified easily (rather than, say, its points).
#Needed for PNG
import Image
from numpy import *
#Other stuff
import matplotlib.pyplot as Plot
from scipy.optimize import curve_fit
#~~# Graph data acquisition
imageFile="RawData.png" #The image here has to be the axes only (no padding, no labels)
x_limits=array([1950.0,2020.0]) #Fill in these limits from the graph labels
y_limits=array([10.0,26.0])
im1=Image.open(imageFile)
rgb_im = im1.convert('RGB')
dimensions=rgb_im.size # 0 - x, 1 - y
X=linspace(x_limits[0],x_limits[1],dimensions[0])
Y=-ones((dimensions[0]))
for idx_x in range(0,dimensions[0]):
convergence=0
while convergence==0 and Y[idx_x]<dimensions[1]-1:
Y[idx_x]+=1
if rgb_im.getpixel((idx_x,int(Y[idx_x])))[2]>220 and rgb_im.getpixel((idx_x,int(Y[idx_x])))[0]<40:
convergence=1
for idx_y in range(0,dimensions[0]):
Y[idx_y]=(y_limits[1]-y_limits[0])*(dimensions[1]-Y[idx_y])/dimensions[1]+y_limits[0]
#**# Analysis
years=linspace(1951,2016,num=66)
percentage=zeros((len(years)))
def find_nearest(x,x_desired):
nearest_idx=0
nearest_difference=abs(x[0]-x_desired)
for idx in range(1,len(x)):
difference=abs(x[idx]-x_desired)
if difference<nearest_difference:
nearest_idx=idx
nearest_difference=difference
return nearest_idx
for idx in range(0,len(years)):
percentage[idx]=Y[find_nearest(X,years[idx])]
#Output
output_file=open('ExtractedData.txt', 'w')
for output_idx in range(0,len(years)):
print >> output_file, years[output_idx], percentage[output_idx]
output_file.close()
Plot.figure(1)
Plot.plot(X,Y)
Plot.plot(years,percentage,'o')
Plot.show()
References
Thanks again to Karen Masters for making me aware of the following data, accessed at the time of writing (6th December 2016):[1] http://www.iop.org/policy/statistics/gender/page_67095.html
[2] http://www.gatsby.org.uk/uploads/education/reports/pdf/18-physics-in-schools-patterns-and-policies-august-2006.pdf
No comments:
Post a Comment