Hi Chris,
Last time I asked, you had mentioned that you were in the process of reconsidering the binary format to save the mworks data in.
A fellow who is affiliated with Neuroscience without Borders was suggesting hdf5 ad a very versatile format. I hear that is what openephys uses also.
Are you considering that? or are you considering something else?
Would be good to know as we would like to standardize the data formats in our lab.
Best
Najib
Hi Najib,
Yes, HDF5 is an obvious, widely-used option, and it was, in fact, the first format I considered. However, after investigating it some, my feeling is that it probably isn’t the best choice. There are two main reasons for this:
-
The HDF5 data model is just bafflingly complex.
Even the Introduction to HDF5 sets my head spinning. While I’m sure that, with some effort, I could come to terms with it, it certainly doesn’t inspire confidence.
-
The format is not a very good fit for MWorks data.
As far as I understand, HDF5 is primarily designed for storing large amounts of highly uniform data (e.g. large arrays of fixed-size, integer or floating-point values), along with some affiliated, less structured metadata. However, MWorks events are basically all loosely-structured data (very similar to JSON objects). While the event times and codes would fit neatly into uniform arrays in an HDF5 file, the associated event data would all have to be tacked on as metadata of some sort. At the very least, this would make for a curiously lopsided HDF5 file. Also, there may be performance implications for storing large amounts of data this way (though I haven’t investigated that).
In addition, this article on a lab’s reasons for abandoning HDF5 details some other issues with the format and is worth reading.
Having considered and rejected HDF5, my current idea is for each MWorks event file to be an SQLite database, with event data encoded via MessagePack (which we already use when sending events between client and server). SQLite is widely used and well supported, and Python includes a standard module for reading and writing SQLite databases. MessagePack has a straightforward specification, and implementations of it are available for many programming languages, including Python and MATLAB.
However, before settling on this (or any other) format, I’ll need to do some testing. Specifically, I need to see how my proposed format compares to the current MWK format in terms of both file size and read-access speed. I’m hopeful that SQLite+MessagePack will perform well on both counts, but we’ll have to see.
Cheers,
Chris
Chris,
I’m the fellow Nijab mentioned who is interested in building a converter from MWorks to Neurodata Without Borders with an HDF5 back-end. It is clear that you gave this topic considerable thought, and I appreciate the concerns you bring up in your post. Let me just comment on them briefly.
-
Yes, HDF5 is very complex. It is a very versatile format with many tools that can be leveraged for efficiency and expressiveness of data, which naturally results in more complexity. I think you’ll find however that the python (h5py) and matlab tools for working with HDF5 files are much easier to learn and use.
-
I have done some poking around in MWorks data, and I see what you mean, that the data associated with an event can be a freeform dictionary or XML, which is very different from the philosophy of HDF5. However there are some data types, e.g. eye position, that I think would be more natural in HDF5 than in this current form.
But my purpose here is not to convince you that MWorks should adopt NWB or HDF5 as its core back-end. The goal of NWB is to provide a common data structure across neurophysiology labs, and I am working now on stress-testing it with different types of experimental data from different acquisition systems. Tony Movshon’s lab has asked me to look into interoperability between MWorks and NWB, and build a converter from MWorks to NWB. This would allow for a common set of analysis and visualization scripts to run on data collected by MWorks and by other acquisition systems. I am excited to try it out and see what obstacles arise, especially given the differences in format philosophies you mentioned. I also think building a highway from MWorks data format to NWB could be an asset for the NWB community and for all groups using MWorks.
I’ve started to build these tools, but I have come up against a few practical questions regarding the interpretation of data in the MWorks system. Could you provide me the contact info of someone that I can contact for these types of questions? Should I just post them on the support forum?
Thanks,
Ben
Hi Ben,
Could you provide me the contact info of someone that I can contact for these types of questions?
There’s no one to ask other than me. If you want to start a new discussion on the support site, that would be ideal.
Cheers,
Chris