Working with large datasets

A demo says more as thousand words:

http://graphdemo.inqbus.de/diagramview/PBL_20160101.h5
Login: admin:password

There you can navigate in a large dataset:
Ceilometer-Data of the planetary boundary layer heights between 1.1.2016 and 1.7.2016 in 15sec resolution with potentially three datapoints per timestamp.
There are 1Mio timestamps so in sum there is a data basis of roughly 3Mio. datapoints, residing in a HDF5-File on our VM-Server.

  • The Backend machinery is not optimized, yet. Each AJAX request hitting the VM results in a reopening of the HDF5-File - there is no caching, yet.
  • THe VM runs together with 5 others on a 7 year old host.
  • We have two WSGI-Threads utilizing gunicorn

To give you a perspective:

  • If you load one day of this data utilizing the usual bokeh methods you will have to wait 20 seconds initial loading time. Afterwards you can manipulate the data quite quickly.
  • If you load one month of this data utilizing the usual bokeh methods you will blow up your browser (Out of Memory).

What does DSS do? Basically three things:
1) Hooking intelligently into Bokeh events. DSS reacts on changes of the axis of the plot, but only filtered. So if DSS sees a number consecutive Events in a small time window it only reacts on the last event.
2) Transfering metadata to the server and receiving new data from the server for the Bokeh-Datasources. For this task we invented a new protocol based on HTML5-binary transport to send multidimension complex data as a single chunk of binary data that is de-marshalled on the client side into typed JS Arrays that are going straight into Bokeh. That is FAST transfer.
3) Intelligently scaling and filtering the data. DSS shows you only what you are capable to see.
Example:
Your diagram has a visible resolution of 600 pixels for the X-Axis. And you have 600.000 datapoints in the chosen X-Interval. So you can only plot a reasonable number of 600
Datapoints without cluttering your display. DSS in this case filters the data utilizing a regriddign utilizing a average operator.
So in the example case shunks of 1000 Datapoints are averaged in X and in Y to form a new regridded datapoint. Also Error-Bars can be obtained in that fashion - but are not shown
in the demo.

DSS does work acceptable. It has some flaws, we will overcome in then next weeks:
1) We have not take care on boundary effects. When zooming in the curve lost its connection to the outside of the viewport.
2) IE does not work. Firefox and Chrome are working.
3) Problems with Bokeh. After some zooming and panning to plottgin area shifts to right-down leaving a growing gray streak top-left. No clue at all where this comes from.

After we improved the code to some maturity we will realease it as an open source extension to Bokeh. But it will be a long way to go.