Jupyter, Venus, the Moon and Gibbs Hill weather
# magic
#!pip install astropy
import astropy
astropy.__githash__
# magic
#!pip install astropy
import astropy
astropy.__githash__
I have been thinking a lot about the relative merits of free and non-free software and have started writing a piece about that.
This week a the Intercept published a piece on how the NSA's google on steroids, XKEYSCORE works under the hood.
Micah Lee, an expert in information security, and Linux enthusiast was heavily involved in the piece. It is always good when people who really understand the technology are involved like this.
XKEYSCORE is built with the free software, LAMP stack: Linux, Apache, MySQL and Perl/Python/PHP. RedHat are providing the Linux.
So, the NSA uses free software for its critical infrastructure. Further, it depends on free software to keep its operations secure.
This revelation fundamentally changes the debate about the relative merits of free and non-free software from a security point of view.
On the one hand it is a ringing endorsement for LAMP: it is secure enough and robust enough for critical infrastructure.
Writing about information security is difficult. Every facet is double edged, sometimes multi-edged.
Clearly the NSA engineers understand Linux very well. They are using it to develop their systems. They know how to be productive in that environment. In short, free software works.
They themselves work in a very collaborative environment. They have highly intelligent, passionate and ingenious people who love a challenge. And they use free software.
They will understand how Linux works at a pretty deep level and know how to make it run well, how to get the most out of it.
So do Google and a multitude of other organisations doing high performance computing, where shaving 5% off the resources can make a big impact. Mostly though they will be looking to cut things by an order of magnitude, or two.
We are seeing daily how insecure data that is stored on computers is. Then there is the integrity of that data? How much of it can we really trust?
Individuals have three types of information based on how many people they wish to share that information with:
The first two are just about feasible to handle with today's technology and software.
The third is extremely difficult with today's technology. Witness all the security breaches in the news.
It may not even be a technological problem. Rather, it is a social issue. Or rather it is an immensely complex mix of social issues.
In the context of information on computers there are so many issues it is hard to know where to begin. It looks like a fairly radical re-think is in order.
The good news is there is a lot of excellent work being done in the free software community. There are many very smart people working on some very difficult problems and making good progress.
Meanwhile there are others that are fixing problems in existing systems and helping the users of their software to plug holes too. This is how free software works.
Now in the public debate we are often given stark choices. If X happens the sky will fall in.
These arguments reinforce the mistaken belief that the choices are so stark or indeed so simple.
As an example, there is a thing called homeomorphic encryption. The idea is to take some data and to turn it into something else that has the same sort of structure, but has thrown away some information in the process.
There is a whole cottage industry of people these day who will sell you tools to anonymise data, mostly for privacy reasons.
The data is invaluable for strategic planning, but it is very difficult to get the balance between respecting privacy and the good of the community. And anonymising data is really, really hard to do well.
Especially if others are anonymising related data. And they are.
I explain homeomorphic in more detail later.
I live in Bermuda. The island is small. There are about 1.5 degrees of separation, everybody knows everybody.
Children learn from a young age how fast the Bermuda grapevine can be. If they are doing something they shouldn't be doing the news invariably gets home to mum before they do.
As a result, Bermudians tend to be very respectful of people's privacy. They have their own forms of homeomorphic encryption in the way they relate stories, often using terms such as ace-boy or ace-girl to protect identities.
Security researchers can learn a lot from the way these people handle information.
Yet another interesting twist in the world is the restoring of more normal relations between Cuba and the US.
I learned the other day that the Cubans have their own version of Ubuntu, callend Nova.
Information security currently comes down to who or what are you going to trust?.
For now, I am placing my trust in the free software community. It may not be perfect, but by my judgement it is by far and away the best option right now.
There is a thing in mathematics called homeomorphism. I expect the term turns up in many other areas too.
In mathematics it means some sort of trasnformation of an object that leaves certain properties unchanged. There are different kinds of homeomorphism of different branches of mathematics.
In topology, if the objects are made of rubber that can be deformed and you can change your object into the other one without doing things like cutting the rubber or filling in holes, then you have created a homeomorphism.
Topologists are people who think a mug and a doughnut are the same thing.
For those interested in some mathematics behind information, Shannon's Theorem is a good place to start.
Work in Progress -- starting to add commentary and tidy up
I connected a BMP180 temperature and pressure centre to a raspberry pi and have it running in my study.
I have been using this note book to look at the data as it is generated.
The code uses the Adafruit python library to extract data from the sensor.
I find plotting the data is a good way to take an initial look at it.
So, time for some pandas and matplotlib.
# Tell matplotlib to plot in line
%matplotlib inline
# import pandas
import pandas
# seaborn magically adds a layer of goodness on top of Matplotlib
# mostly this is just changing matplotlib defaults, but it does also
# provide some higher level plotting methods.
import seaborn
# Tell seaborn to set things up
seaborn.set()
# just check where I am
!pwd
infile = '../files/light.csv'
!scp 192.168.0.133:Adafruit_Python_BMP/light.csv .
!mv light.csv ../files
data = pandas.read_csv(infile, index_col='date', parse_dates=['date'])
data.describe()
# Lets look at the temperature data
data.temp.plot()
Looks like we have some bad data here. For the first few days things look ok though To start, lets look at the good bit of the data.
data[:4500].plot(subplots=True)
That looks good. So for the first 4500 samples the data looks clean.
The pressure and sealevel_pressure plots have the same shape.
The sealevel_pressure is just the pressure recording adjusted for altitude.
Actually, since I am not telling the software what my altitude it is
It is a bit of a mystery what is causing the bad data after this.
One possibility is I have a separate process that is talking to the sensor that I am running in a console just so I can see the current figures.
I am running this with a linux watch command. I used the default parameters and it is running every 2 seconds.
I am wondering if the sensor code, or the hardware itself has some bugs if the code polls the sensor whilst it is already being probed.
I am now (11am BDA time July 3rd) running the monitor script with watch -n 600 so it only polls every 10 minutes. Will see if that improves things.
So, lets see if we can filter out the bad data
data.temp.plot()
# All the good temperature readings appear to be in the 25C - 32C range,
# so lets filter out the rest.
data.temp[(data.temp < 50.0) & (data.temp > 15.0)].plot()
That looks good. You can see 8 days of temperatures rising through the day and then falling at night. Only a couple of degree difference here in Bermuda at present.
The Third day with the dip in temperature I believe there was a thunderstorm or two which cooled things off temporarily.
I really need to get a humidity sensor working to go with this.
Now lets see if we can spot the outliers and filter them out.
def spot_outliers(series):
""" Compares the change in value in consecutive samples to the standard deviation
If the change is bigger than that, assume it is an outlier.
Note, that there will be two bad deltas, since the sample after the
bad one will be bad too.
"""
delta = series - series.shift()
return delta.abs() > data.std()
outliers = spot_outliers(data)
# Plot temperature
data[~outliers].temp.plot()
data[~outliers].altitude.plot()
data[~outliers].plot(subplots=True)
data[~outliers].sealevel_pressure.plot()
def smooth(data, thresh=None):
means = data.mean()
if thresh is None:
sds = data.std()
else:
sds = thresh
delta = data - data.shift()
good = delta[abs(delta) < sds]
print(good.describe())
return delta.where(good, 0.0)
smooth(data).temp.cumsum().plot()
smooth(data).describe()
start = data[['temp', 'altitude']].irow(0)
(smooth(data, 5.0).cumsum()[['temp', 'altitude']] + start).plot(subplots=True)
Bingo! we have clean plots. Of course the irony is that I also seem to have found the problem with the bad data I was getting: don't have two processes querying these sensors at the same time, at least not with the current software. So the recent data no longer needs this smoothing.
So the daily rise and fall of temperature is pretty clear. There is only 2C spread most days.
The pressure plot is more interesting. Over the last week or so it has been generally high, but there is an interesting wave feature.
The other day I was at the Bermduda Weather service and mentioned this to Ian Currie, who immediately pointed out that air pressure is tidal.
So, my next plan is to dig out scikit-learn and some lunar data, maybe using astropy and see if we can fit a model to the pressure data for the tidal component.
I have been thinking much about information security over the last few months.
Over the years I have thought about the relative merits of free software versus non-free.
Free as in freedom, or logicel libre if you prefer. Free software comes with important freedoms. You are free to examine how it works, make changes and experiment.
Free software is everywhere, for example, `openssl`_ is used in many different operating systems.
Even when you are using non-free, you are almost surely using lots of free software at the same time. Depending on how that software is licensed you might be able to find just which software is used.
I find one key differentiation in licenses is in the restrictions they choose to put on the licensees.
So the most permissive licenses tend to be along the lines of, "here is some software, do what you like with it, don't sue me".
Others might insist on attribution (for example if you modify it and pass the resulting application onto somebody else) and have things to say about trademarks etc.
You can find broader discussion of licenses at the following links:
The key clause of the `General Public License`_ is that it insists that if you pass on the same rights you received to any that you pass it onto.
The idea is that recipients should have all they need to explore, modify and experiment with the code.
Things are complicated. On many levels the major effects apply regardless of the license.
The non-GPL have been adopted more readily by proprietary vendors, for example `Apple's OS is derived from BSD Unix`_.
Python has a BSD style license. Python seems to like to make it avaiable to anyone who is interested in the language. It does say that if you create a derivative work, that you need to include a brief summary of the changes made to python.
The license begins with some history of python, and some background on `Guido van Rossum`_,
** CHECK THIS **
The GPL has also been used as a way to encourage to license the product under a more permissive license. One notable example was `MySQL`_ which was licensed uder the GPL but that enabled the company doing the bulk of the deveolopment and driving the project to get license income from those that appreciated the support.
It allowed other businesses to build a business on top of MySQL, but with their own custom adaptions. Often these adaptions would eventually end up in the project itself. At some point it is better to share the maintenance burden of a new feature. Often that point is now.
Python makes itself accessible and hence it has become ubiquitous.
The main effect of licenses is that the non-GPL software tends to be more widely adopted by commercial organisations. The main reason is it allows them to produce derived works and not have to distribute their own customisations and improvements.
However, for smaller commercial organisations it is generally more effective to work with the project itself and share their work.
Keeping your fixes proprietary comes with costs as well as benefits. A number of questions you need to ask are:
provided tools others can use freely.
supports free experimentation, gives a world of ideas which you can explore with others
created tools to help with collaborative working
found ways to collaborate across the internet
enabled local people to gain skills needed in their environment
runs on old hardware
it is a significant voice in the debate on information security and privacy. It will very likely play an important role.
Since people have access to the software and are free to experiment and explore how it works they can look for potential security holes, report the problems and perhaps fix them if they have the skills.
Of course, the bad guys also have this advantage over non-free software.
When you work with free software you are showing the world your code. This takes a certain amount of courage. Anyone who has taken part in a code review knows that showing your code to others is baring your soul.
Most effective code shops have some form of code review in their development process. Two pairs of eyes are always good. It helps share knowledge, educate both parties and create better software.
Once developers get comfortable with sharing with their peers, sharing with the wider world becomes less daunting.
Positive feedback can be very helpful here: you get an immediate benefit for sharing.
Many software engineers (and I include myself in this) are quite insecure about their code. Most programmers are `sort of average`_, but knowledgeable enough to know that with more time and research their code could be better.
Regardless, code that is shared openly is likely going to be of a `higher quality`_ by depending on the metric you choose to use to measure quality. The authors of the code will likely care about their reputation in the free software community and hence take care to share quality work, or at least identify the code as a quick hack or whatever.
With non-free software you are working with a black box. You get to choose inputs and observe outputs. If you are lucky you can learn about how the code works, but it is much harder than
If you have an executable there are tools that will allow you to take the binary code and create human readable assembly code. This is generally missing comments and variable names. It is a low level description of the code, closer to the final op codes that a computer runs.
However, for those with skill and experience, reverse engineering is a powerful technique.
Sometimes the license will explicitly say you must not reverse engineer the code. Of course, bad guys will not necessarily obey the license.
However, security researchers will often decide not to break the license. The result is that only the bad guys are looking closely at the code for vulnerabilities.
This is not a place you want to be.
This week I spent two half days trying to teach about free software with raspberry pi's.
The students were all local, in full time further education in IT fields and hoping to have a career in IT.
There were 4 students, a mixed group. Fortunately for me, a previous graduate of this summer programme came along to help out.
The first session was a lesson in the problems of working with tech in education. I had made a visit to the room for the training the previous week.
I was supplying raspberry pi's and SD cards, but we needed monitors, keyboards, mice, HDMI cables and wired network connections.
Now we were only able to find monitors with DVI ports, not HDMI and only two monitors. Fortunately, I had DVI to HDMI adaptors and we managed to cobble together the remaining bits and pieces.
I'll skip the problems we had connecting to the network and perhaps cover that in a future post on security and other matters.
Free software was mostly new to the students. Where to begin. I wanted to show them linux, but through the command line. I wanted them to start to develop a better understanding of how a computer actually works, one of the goals the raspberry pi project shares.
My helper was fantastic. Since there were two pi workstations set up, he worked with one pair of students and me with the other.
One goal on the first day was to introduce the students to version control using git on the command line.
Now we soon hit the editor problem. There is always the dilemma between showing powerful tools with a steep learning curve and simple, quick to learn tools in this sort of training.
I wanted to give the students a glimpse of emacs, in part because it is a classic free software tool.
I first encountered emacs around 1985 when attending an Introduction to Unix course at a local technology college in the UK. The course was a couple of days and they taught us some simple C-shell and an introduction to Unix systems.
Since an editor was needed for the examples, they showed us emacs. I recall writing a review to the effect that whilst emacs seemed to be super powerful, it took a disproportionate amount of the time for the course.
My next experience with emacs was around 1989. My workplace had acquired shiny new Sun workstations. Running Unix. So the first thing I needed was an editor. A colleague explained there were two practical choices: emacs and vi. Emacs had a vi mode, so basically that sealed it. Emacs it was.
This time I was going to be using it to write code. An investment of a few hours learning how to use it well seemed worthwhile, so I read the tutorial. Soon I was hooked. This thing was so much more powerful than anything I had used before.
I am still using emacs, some 25 years later. I've used it to edit code in fortran, C, perl, tcl, python, lisp and who knows what else. I've played tetris, read email, browsed newsgroups, read twitter, run ipython notebooks, used git, read man pages and who knows what else.
For the training though, I probably should have just pointed the students at this raspberry pi page on editors.
By the end of the first session the students had created a git repository and were able to make changes to files, state the changes and commit them to the repository.
I had a bit of a re-think after the first session. We decided to bring in some more equiment to give us more options in the room.
The students are on a 3 month programme, involving internships in local firms, with 1 week per month in training.
They also have to do some sort of project for the course. One idea they are considering is to create a website which provides information on the public transport on the island.
I decided to structure the afternoon around how they might go about this if they wanted to run it as a free software project.
This gave an opportunity to introduce the students to github and build on the introduction to git that we had started in the previous session.
The bus application is challenging here in Bermuda as much of the data needed for the application does not appear to be available in machine readable form.
For most free software bus applications having your schedule and route data in the General Transit Feed Specification allows you to take advantage of a lot of work done in other jurisdictions.
The good news for Bermuda is that its bus and ferry network is small, so even if this data has to be entered by hand it should not take too long. Further, the students could always concentrate on one or two key routes while the iron out the glitches.
I had hoped to introduce a little python programming, but this will have to come in later sessions.
In the meanwhile, this advanced git talk from this year's PyCon may be helpful to get a better understanding of how git actually works and introduce some more advanced concepts.
The PyCon videos are all in the PyCon 2015 channel on youtube. I also recommend Jacob Kaplan-Moss's keynote to anyone unsure about whether they have the skills to be a programmer.
I am looking forward to being involved with these students over the summer. Hopefully, they are about to start doing some great things with free software.
Today started with news of a big SNAFU by a Malaysian telco. Part 2 of this excellent history of internet security explains how this giant hole in the internet has been there for a long time and been the cause of some spectacular breakages in the past.
As the article explains, it is really a consequence of the design philosophy of the internet. The aim was to create a robust network that could self-heal. The focus was on creating a system that would allow communication after a catastrophic event such as a nuclear war.
The Border Gateway Protocol controls routing of packets through the internet. If a router says it offers the best route to a particular host, the BGP is OK with that. BGP started life as some scribblings on three napkins over lunch. Now it is fundamental to the working of the entire internet.
Now, if you think about the scenario for which it was designed, back in the 1980's this is not such a bad feature: a nuclear war has destroyed lots of infrastructure, so if a node in the internet says, "Hey, I can help you out", why wouldn't you give it a go?
The problem is that most of the internet traffic is unencrypted, so if a malicious party pretends it is the best route to google.com it gets all the traffic, in clear. Not so good. And it stops packets from going where they are intended at the same time.
The first part of the history of internet security explains how the design in effect delegated responsibility to the end points on the internet. In short, every person connected to the internet is responsible for keeping it secure. Now this worked pretty well in the days when you needed an expensive computer to connect and the people running that computer were probably also doing some of the coding that keeps everything running.
But now the whole world is online, the trust model is somewhat broken. In fact, it is remarkable how well the whole thing does work. We should not lose sight of this: the system works pretty well at times. In fact, it works wonderously well for data that you do not mind sharing with everyone. If you have secrets, then it starts to get more problematic.
We hear stories such as this cyber espionage nightmare. Let's assume the account is accurate; it should be noted that this whole area of computer security is full of smoke, mirrors and snake oil salespeople, which further complicates assessing the real risk out there.
At this point, it is quite clear that most corporations and even governments are fundamentally incapable of protecting data.
Further, much sensitive data has already leaked out. Many organisations are unaware how much data has been leaked. The OPM apparently discovered their breach when a marketing team was giving a demo of their intrusion detection software.
Like the cold war that gave rise to the internet itself, there is an arms race going on. Unlike the cold war, though, once you have lost your secrets they are gone for good.
Now in the short term, the situation is not good. Many people's secrets will be leaking out to a wider audience.
Much of the damage is mitigated by the fact that most internet users are not malicious. For example, when Sony had its email archive leaked and posted online, most people, conscious of how they would feel about others trawling through their inboxes left it well alone.
But if the party that acquires the secrets has malicious intent, then damage will ge done. The cyber espionage nightmare article talks about economic espionage and the stealing of corporate secrets.
Holes in the internet will undoubtedly be plugged. Widespread use of encryption would seem to be desirable. However, at some point people need to read the actual data, it will be decrypted and so you have to ensure that only happens on secure systems.
Such things are pretty hard to come by. Securing computers against the most determined attackers is extremely difficult with today's tools.
If your secrets are not too valuable to others then you may be OK. Further, if you are a bit more secure than other equally interesting targets you may be OK.
One solution that will always work is not to store secrets on your computers. How much of your data is really secret? Might you be better off sharing your precious intellectual property with everyone?
It turns out large numbers of people are doing just that. With free, open source software, open data and open scientific research. Rather than burdening yourself with having to keep all this information secret, work in the open. Collaborate with others, together build a better mousetrap, a better society for everyone.
The renaissance was very much driven by the sharing of knowledge generated by the invention of the printing press. The internet takes this to a whole new level. Humanity is sharing ideas like never before. New inventions come when people put together ideas from different fields.
The challenge today is not keeping your own ideas secret, rather it is keeping pace with new developments driven by the open sharing on the internet.
The internet was designed for sharing information. If you are using it to keep secrets, it probably won't work out so well for you.
The company in the cyber espionage nightmare may well have been better off sharing its knowledge and focussing at being the best in its field, benefitting from contributions from others and from not having to waste valuable resources protecting secrets it cannot hope to keep.
If you cannot compete with others who have to ship their products half way round the world to beat you then maybe you are not good enough at your business.
I am going to be doing an introduction to free software and linux for this year's Technology Leadership Forum students.
The plan is to have the students use raspberry pi's to learn about the linux platform.
I have a bunch of the new raspberry pi 2's and have been experimenting with different linux distributions on these pi's.
I was going to write up my experiences, but Swapnil Bhartiya has kindly blogged about his own experiences with Arch, Raspbian and Ubuntu Snappy Core. His conclusions were similar to my own.
I have been using Arch Linux a little of late and like many things about Arch and The Arch Way.
First of all, the Arch wiki has excellent documentation. This is critical to its success, since it does many things a little differently to the larger linux distributions.
An initial Arch install will not install much beyond the bare essentials to get you up and running. This does mean it can take a little while to get a new system just how you want it, but has the advantage you do not end up with hundreds of packages installed which you have little idea what they do. For the security conscious this is a definite plus.
One feature I love is that their are no dev packages. Anyone who has tried using any of the main linux distributions and is in the habit of compiling code on those systems will have run into the situation where code fails to build due to missing C header files.
In the major distributions these header files are in separate dev packages. The philosophy is that most people are not compiling code on these machines so do not need the header files. This choice is fine until a new user decides to try compiling some code and then is hit by the missing header file issue. Just another obstacle put in the way of potential new developers.
In contrast, Arch argues that these header files are generally tiny and including them in the main package adds little overhead and saves a lot of time for anyone doing development. It would be good if more distros made this switch.
Using Arch will present some challenges to a new user, but given the excellent state of the documentation it is also an excellent way to gain a thorough understanding of how everything works.
Raspbian appears to be the most widely used distribution on the raspberry pi. It is based on the Debian and has over 35,000 packages available.
Since I have mostly used Debian based distributions this seems a good place to start.
Trying different distros can get a little time consuming, between downloading images and copying them onto SD cards. Further, different images are needed for the older pi's and the pi 2.
The simplest way to get a Raspian system up and running is to download the raspian image from raspberrypi.org.
To install just copy to the SD card device using dd:
dd if=raspian_image_you_downloaded.img of=/dev/SDCARD
Finding the device for your SD card can be tricky. lsblk shows you all the block devices and with luck your SD card will be there.
On my Ubuntu system it is /dev/mmblk0, on an Arch machine it showed up as /dev/sdb. You can usually figure things out using the SIZE of the device.
Note also that you want the block with TYPE disk, not any of the partitions it might have.
Another approach is to use this Raspbian installer. One clear advantage is that it is a small download, a mere 11MB and a small copy onto your SD card.
You then just plug it into the pi, turn on the pi with a wired network connection and the install happens by magic. It takes 20 minutes or so, with a reasonable internet connection.
Other advantages include:
The down side is that the base install comes without a GUI environment. Depending on what you intend to do with the pi this may not be a problem.
I am going to experiment with customising the installer to see if I can get it to install lxde, emacs, git and some other goodies I like to have around on my systems.
I cloned the git repository for the rasbian net installer:
git clone git@github.com:debian-pi/raspbian-ua-netinst.git
I then added an installer-config.txt to specify some extra packages to install.
I then followed the instructions in BUILD.md to rebuild the image and installed from there.
This did not go as well as I had hoped, since although the extra packages got installed their dependencies did not, at least that is what I think happened.
It also took me a couple of goes, installer-config.txt needs to be given execute permission, eg:
chmod 77f installer-config.txt
to make this work.
I decided to try using post-install.txt instead. By the time this runs the install is pretty much complete.
My first attempt with this was just to add the following apt-get call:
apt-get install -y git emacs aptitude xserver-xorg-video-fbdev lxde curl htop nmap
But this failed to do anything. Time to track down the actuall install script and see how what is going on.
The actual install script is in scripts/etc/init.d/rcS
Now I understand. The installer basically boots a minimal linux kernel and then uses a single init script, run during the boot to do the install.
Reading that script it becomes clear what I need to do. The new operating system is actually mounted on /rootfs and I need to use the chroot command to make sure apt-get runs with that as the root filesystem. So I ended up with post-install.txt looking like this:
#!/bin/bash # install some extra goodies. Do it here rather than in packages to # pull in dependencies chroot /rootfs /usr/bin/apt-get install -y \ git emacs aptitude xserver-xorg-video-fbdev lxde curl htop nmap
Bingo! It works, modulo having to hit enter three times to say OK to some dialogs that the lxde desktop environment displays.
One other trick I used in all this was mounting the SD card's first partition after copying the installer onto it. This allows me to copy over a new installer-config.txt or post-install.txt without having to rebuild the full image.
Overall, I am liking this installer.
The PyData, Dallas videos are now up on youtube under PyData TV.
Lots of interesting stuff to view there. So far, I have watched Luis Miguel Sanchez talk about modelling insurance linked securities (ILS) all in Jupyter notebooks. Shows how things that used to take days and weeks can now be done in a matter of hours.
I am also enjoying Peter Wang talking about the state of the of the Py.
Gustave Dore Ancient Mariner Illustration, licensed under Public Domain via Wikimedia Commons.
I started looking at this data after being inspired by Chris Waigl's PyCon talk on Satellite mapping for everyone. Chris gave an introduction to some of the python tools you can use to look at this data.
She mentioned some websites where you can find data from various satellite missions and showed how to work with the data and just what sort of data is available.
There really is a wonderful array of data available. The process of exploring all this can be time consuming, but fun. There is a lot of fascinating work being done.
I was hoping to find images in and around the time of the hurricanes. With high enough resolution I believe it should be possible to use image processing software to help with damage surveys. For example, it should be possible to spot blue tarpaulins placed over damaged roofs, or indeed the roof damage itself.
I picked Landsat, pretty much based on Chris's talk. I found two days either side of the hurricanes in Bermuda last October for which there were images available for.
So far so good. But to get the higher resolution data you had to register and obtain a key to gain access to the downloads. These are large, around 1GB per image, so it is reasonable for whichever agency is supplying the images to know a little about those downloading the data. In particular, be able to contact them if putting an unreasonable load on their servers.
Now for Bermuda we only really need about 1% of the data which covers the few square miles around the island. This is much more manageable, so it feels it would be good to be able to host the data locally. This is something I expect I will return to.
There is still much to do here. Chris noted that you will learn coordinate systems. Up to now I have managed to avoid this, just slicing the numpy arrays that rasterio gives me as I read the images.
I need to learn how to pull out a window from one of these images by specifying the lat/lon box defining the area to extract. Better still, I could do with a number of pre-defined boxes that pull out interesting areas of Bermuda, for example each Parish.
Most satellite data is collected by expensive national and international space missions. Many of the missions are aimed at creating a global resource, but generally focussed on the nations that funded the missions.
Global Precipitation Measurement mission has a number of data sets available. These are typically at the 0.1 degree resolution, which corresponds to 7 miles on the ground. The temporal resolution is good, a new image is available at 30 minute intervals. Further, the project aims to:
intercalibrate, merge, and interpolate "all" satellite microwave precipitation estimates, together with microwave-calibrated infrared (IR) satellite estimates, precipitation gauge analyses, and potentially other precipitation estimators at fine time and space scales for the TRMM and GPM eras over the entire globe.
For Bermuda, the spatial resolution is not quite enough to do a detailed analysis, but it is very useful to understand the severity of storms hitting the island.
Another good source of data is weather station data. It is possible to build a DIY weather station for a $200-300. The project in the link had some special constraints. It was intented to create a weather station that would show the conditions at a lake 2 hours drive from the person that built it, so it needed to be robust against system glitches.
There are a number of weather stations here in Bermuda that are connected to the Weather Underground network of stations.
Another project is Open Weather Map. This provides an API that you can use to connect your weather station to the network.
For risk modelling purposes we ideally need historical data for the times when the larger storms have hit the island. The sites mentioned above are primarily focussed on weather forecasting, rather than collecting data for subsequent analysis, although they do also do this.
Unfortunately, access to historical data is limited without a paid subscription. The sites have costs to cover, so small charges for access to data is one way to continue to provide the service.
Full access to Open Weather Map historical data costs $2000 per month. If we want to create an environment where interested parties can explore their ideas then removing these cost barriers is an important step to take,
If Bermuda had a network of 100-200 weather stations it would open lots of powerful modelling opportunities. For example, machine learning could be used to try and tease out the relationship between weather station parameters such as height, distance from the coast, local topography, land use and whatever other parameters are available.
If such a model can be fitted to the data it can then be used to estimate windspeed for any point on the island. In this way we can create a detailed windfield model for Bermuda.
Furthermore, most of the tools needed to do this are already available as open source projects. The pieces just need to be glued together.
Finally, any work done here in Bermuda can easily be generalised and applied to other similar jurisdictions.
If we also have a detailed, post event, damage survey we can also use the same machine learning techniques to develop damage models relating the hazard at each location to the damage it creates.
There are good open source tools, such as scikit-learn and scikit-image that can be used for this modelling.
Open Street Map is an open mapping project that has been running for many years:
OpenStreetMap is built by a community of mappers that contribute and maintain data about roads, trails, cafes, railway stations, and much more, all over the world
The project has a number of related projects centred around the core mapping project.
The Humanitarian OpenStreetMap Team works on mapping damage to help with relief work following natural disasters, such as the recent Nepal earthquakes.
Whilst this work is focussed on disaster relief in the immediate aftermath of a disaster it is producing valuable data which can be used to better understand damage. It could be a key input into new models that can be used to explore mitigation measures for future events.
With the world facing unprecedented challenges, such as climate change and increased earthquake risk due to human activities such as fracking as well as the melting ice caps changing the stresses on tectonic plates there is a humanitarian need to be able to model and explore the potential impacts on delicate eco-systems such as small island communities.
At PyCon in Montreal https://us.pycon.org/2015/ Chris Waigl gave a talk about satellite mapping and some of the python tools that help with this
Following the talk I decided to take a look to see what satellite data is available around the time of the hurricanes Fay and Gonzalo, back in October 2014.
The hope was to be able to find suitable before and after images at a high enough resolution to use image processing software to help with damage analyis.
Chris's talk is on youtube (along with all the other PyCon talks) and embedded below.
from IPython import display
# Chris Waigl, Satellite mapping for everyone.
display.YouTubeVideo('MCHpt1FvblI')
A little googling turned up this gem from the NASA's Tropical Rainfall Measuring Mission.
Image Credit: NASA/SSAI, Hal Pierce
This is a seven day animation, covering the period of Fay and Gonzalo.
Assuming rainfall is a good proxy for storm intensity, you can see how Fay intensified as it reached the island and how Gonzalo followed a very similar path, just six days later.
The key question with respect to Bermuda is whether this sort of data is available at higher resolution.
The article does mention that
Global Precipitation Measurement (GPM) mission product in late 2014 will supersede the TRMM project.
The Nasa GPM page has some wonderful animations of the sort of thing that is possible with GPM.
# 3-D animation of a typhoon from the GPM project
display.YouTubeVideo('kDlTZxejlbI')
A major challenge with satellite data is finding just what images are available.
Landsat has a well documented site created by the USGS
However it is still time consuming to see what is available.
Downloads can be large, roughly 1GB per satellite image. These images generally contain multiple layers for different parts of the spectrum.
To download the larger files you need to register and get an API key.
Once registered I downloaded a couple of images, either side of the October storms.
Below are my attempts to extract and plot the data.
# lets start with matplotlib
%matplotlib inline
from matplotlib import pyplot
# Chris recommended the rasterio library
import rasterio
infile = '../data/LC80060382014275LGN00_B2.TIF'
# This is pretty simple, just open the TIFF file and you have
# an object that can tell you all sorts of things about the image
data = rasterio.open(infile)
data.width, data.height
# take a look at the meta data
data.meta
# read the bands in the file, there will be as many bands as
# the count above
bands = data.read()
# take a look at the data -- numpy arrays with 16 bit values
bands
# so we have a 3D array, first dimension is the band
bands[0].shape
img = bands[0]
# just take every 10th pixel for now -- imshow does not handle
# large images well.
img = img[::10, ::10]
img.shape
# now plot the thing.
pyplot.imshow(img)
So we have succeeded in downloading and plotting one of these bands.
Now time to play spot Bermuda. First impressions are this particular data is likely not high enough resolution to be useful.
A second thing to note is that the NASA sites are, understandably, quite US-centric. To do comprehensive studies of satellite data for Bermuda it looks like it will be worthwhile to create local mirrors of the key data.
In particular, whilst some of these images are quite large, the part covering Bermuda will generally be much more manageable.
# put it all together
def plot_image(infile, box=None, axes=None):
if axes is None:
fig, axes = pyplot.subplots(1, 2, figsize=(8,8))
if box is None:
box = 1000, 2200, 3700, 5500
a, b, c, d = box
data = rasterio.open(infile)
bands = data.read()
img = bands[0]
img = img[a:b, c:d]
axes.imshow(img)
# plotting images either side of the hurricane
fig, axes = pyplot.subplots(1, 2, figsize=(8,8))
#pyplot.subplot(1,2,1)
x = 3
top = 1300
left = 4000
width = 1200
height = 1000
box = (top, top + height, left, left + width)
infile = '../data/LC80060382014275LGN00_B%d.TIF' % x
plot_image(infile, box=box, axes=axes[0])
#pyplot.subplot(1,2,2)
infile = '../data/LC80060382014307LGN00_B%d.TIF' % x
plot_image(infile, box=box, axes=axes[1])