Stimulus hashing

Hi Chris,

I’m following up on a previous conversation we had about using large sets of images, specifically with more than 8,000 images. I have a few questions I’d appreciate your help with, and I’m copying Ani, Guy and Jim since we’re verifying our analysis code on Guy’s experiment involving a large image set.

  1. Hashing Policy
    Could you provide more details about MWorks’ hashing policy for the stimulus set? Does MWorks use pixel-hashing, or does it generate unique hashes from filenames? Given the potential for redundant images, in large stimulus sets, is there something users should be cautious about?

  2. Accessing MWorks’ Hash Code
    I’m developing a neural experiment pipeline and would like to ensure my code handles stimulus metadata accurately. Currently, I log the stimulus ID based on the MWorks experiment code, like the “stimulus_presented” variable in your RSVP example. Is there a way to access the hash table MWorks uses? Would doing so improve the reliability of managing stimulus metadata (compared to relying solely on the stimulus ID variable)? For example, if a user mapped their stimulus ID incorrectly in MWorks, could I still recover the exact filenames associated with the hash codes used in each RSVP trial (fixation)?

  3. Handling Large Lists of Images
    I recall that Chong ran an experiment with a large image set. He chose to “vectorize” all repeated stimuli instead of repeating unique images—presenting 16,000 images once instead of showing 8,000 unique images twice. Is there a difference in maintaining image order between these two approaches?

  4. Lastly, I recently conducted an experiment involving 8,400 images presented twice. The parsed stimulus events and neural responses appear consistent, but I’d appreciate it if you could take a look at the MWorks code to ensure the code is handling image-ordering properly. Here is the dropbox link (2GB) containing the code and image set.

Thanks again for your time and help!

Yoon

Hi Yoon,

Could you provide more details about MWorks’ hashing policy for the stimulus set? Does MWorks use pixel-hashing, or does it generate unique hashes from filenames? Given the potential for redundant images, in large stimulus sets, is there something users should be cautious about?

For details, see this article. There’s also a bit more info in this discussion.

In short:

  • Only image file stimuli have hashes recorded.
  • The hash is computed from the raw bytes of the file. This means that filename and file access/modification times do not go in to the hash value, but anything in the file data itself (including any embedded metadata) does.

The only thing I’d be cautious about is changes to metadata (e.g. Exif tags), since those will cause otherwise-identical image files to produce different hash values.

Currently, I log the stimulus ID based on the MWorks experiment code, like the “stimulus_presented” variable in your RSVP example. Is there a way to access the hash table MWorks uses? Would doing so improve the reliability of managing stimulus metadata (compared to relying solely on the stimulus ID variable)?

My RSVP demo doesn’t include a “stimulus_presented” variable. Are you just storing the current index in to a stimulus group in “stimulus_presented”? If so, then I’d say that keeping track of filename and file hash would be a much more robust method of establishing image-file identity. Those values aren’t directly available to experiment code, but they’re recorded in the event file, as described in the previously-cited docs.

For example, if a user mapped their stimulus ID incorrectly in MWorks, could I still recover the exact filenames associated with the hash codes used in each RSVP trial (fixation)?

Yes, that’s the idea. As long as you have the file hashes from your event file, and you have access to the image files used in your experiment (so that you can compute each file’s hash), you can robustly associate each image stimulus presentation with the image file presented. You can also compare images used across multiple experiments, even when the experiments define the “stimulus_presented” variable differently (or don’t define it at all).

I recall that Chong ran an experiment with a large image set. He chose to “vectorize” all repeated stimuli instead of repeating unique images—presenting 16,000 images once instead of showing 8,000 unique images twice. Is there a difference in maintaining image order between these two approaches?

That was a poor experimental design choice, and as I recall, it caused problems for Sarah and Jon. If the goal is to present each image twice without insisting that every image is presented once before any image is repeated, then MWorks can do that without the experiment defining two different image stimuli for each image file. Please see this discussion for details.

I recently conducted an experiment involving 8,400 images presented twice. The parsed stimulus events and neural responses appear consistent, but I’d appreciate it if you could take a look at the MWorks code to ensure the code is handling image-ordering properly.

I’m happy to take a look, but I’m not sure how you’re defining “properly”. Can you be more specific about what you’d like me to check?

Thanks,
Chris

Hi Yoon,

I just noticed that the experiment files you shared via Dropbox are the same ones you asked me to review previously. Let me take a look and get back to you.

Cheers,
Chris

Hi Yoon,

Your experiment code looks OK to me.

I did find it a little confusing that you’re using a variable named reward_duration to specify a volume in milliliters to dispense via the syringe pump (and, in the actions attached to variable reward, waiting for that numeric amount of time). But I see that the value of reward_duration is 0.15us (aka just 0.15), so if the intent is to dispense 0.15 mL as the reward, I believe everything should work correctly (and you’re not likely to care about the 15 microsecond wait when you assign to reward).

As to whether you want the image set repetitions to be independent (as they are now) or interleaved (like Chong did), that’s entirely up to you. Switching to the interleaved approach would require only minor changes to your experiment code, as detailed in this discussion. I can help you make that change, if you want to go that way.

Cheers,
Chris

Hi Chris,

Thank you for the informative feedback, very useful!

I am new with the software, and am trying to understand how to retrieve that logged image hash from the .mwk2 file.

Looking at the variables’ names as parsed from the events in the .mwk2 file, I can’t identify anything that looks like that hash. Could you help understand how to parse it?

dict_keys(['#allowAltFailover', 'key_r_pressed', '#state_system_mode', 'saccade', '#announceMessage', 'fixation_pos_x', '#stimDisplayUpdate', 'key_p_pressed', '#stimDisplayCapture', 'eye_in_window', '#experimentLoadProgress', 'stim_on_time', '#loadedExperiment', 'inter_trial_interval', '#announceSound', 'stimulus_size', '#announceCalibrator', 'stim_on_delay', '#requestCalibrator', 'mouse_button_pressed', '#announceCurrentState', 'reward', '#announceTrial', 'ignore_time', '#announceBlock', 'mouse_y', '#announceAssertion', 'correct_fixation_list', '#serverName', 'stimuli_shown', '#mainScreenInfo', 'stimulus_set_repetitions', '#warnOnSkippedRefresh', 'stimulus_presented', '#stopOnError', 'stimulus_pos_y', '#realtimeComponents', 'key_c_pressed', 'animal_name', 'correct_fixation', 'project', 'stimulus_pos_x', 'eye_h_raw', 'eye_v_raw', 'pupil_size_r', 'eye_h_calibrated', 'eye_v_calibrated', 'eye_h', 'eye_v', 'fixation_window_size', 'fixation_color_r', 'fixation_color_g', 'fixation_color_b', 'fixation_point_size_min', 'fixation_point_size_max', 'fixation_point_visible', 'fixation_pulse_period', 'fixation_pulse_start_time', 'fixation_pos_y', 'experiment_state_line', 'trial_start_line', 'stim_start_line', 'reward_line', 'mouse_x', 'key_x_pressed', 'key_spacebar_pressed', 'stim_off_time', 'stimulus_set_repeat_count', 'stimuli_per_trial', 'stimulus_presented_list', 'reward_duration', 'num_stims_shown', 'miss_count', 'success', 'failure', 'ignore', 'sync', 'cal_fixation_duration', 'cal_fix_pos_x', 'cal_fix_pos_y', 'RSVP_test_stim_index', '#privateCalibratoreye_calibrator'])

Related, if an end user wishes to compute hashes for arbitrary images (to implement robust comparison with hashes parsed from the events, as you described), what is the way/API for doing that in python?

Just FYI, our issues with image indexing and correspondence to neural recordings may have been related to RSVP_test_stim_index values going (1 : stimulus_set_size), and not (0 : stimulus_set_size-1) as they should in the experiment’s .mwel file (not in the one you have reviewed and concluded correct).

Many thanks for your support,

Guy

Hi Guy,

Looking at the variables’ names as parsed from the events in the .mwk2 file, I can’t identify anything that looks like that hash. Could you help understand how to parse it?

There isn’t a separate variable for the file hashes. The variable you need to look at is #stimDisplayUpdate. The value of this variable is a (possibly empty) list of dictionaries. Each dictionary contains the name and parameters of a stimulus that was presented in the display update associated with the #stimDisplayUpdate event.

For an image stimulus, the dictionary will contain the keys filename and file_hash. The filename value is an absolute path to a temporary directory used by MWServer, but the end of the path will have the filename you want. The file_hash value is simply the hash value I described previously, as a hexadecimal string.

As an example, here’s the value of a #stimDisplayUpdate event (in Python) generated by my RSVP demo experiment, for a display update that included both an image and a fixation point:

[{'filename': '/var/folders/xh/yvhmgx1d4nngtzs7_j45nbc00000gn/T/MWorks/Experiment '
              'Cache/_Users_cstawarz_Documents_Work_McGovern_mworks_mworks_examples_Examples_RSVPDemo_RSVPDemo.mwel/tmp/images/RSVP_images/OSImage_78.png',
  'alpha_multiplier': 1.0,
  'rotation': 0.0,
  'size_x': 5.400000095367432,
  'size_y': 5.400000095367432,
  'pos_y': 5.0,
  'pos_x': 0.0,
  'file_hash': '72614194523ad5b4bdc728cc562dee103c6096ae',
  'name': 'OSImage_78',
  'type': 'image',
  'action': 'draw'},
 {'name': 'fixation_point',
  'pos_x': 0.0,
  'active_when_hidden': False,
  'pos_y': 0.0,
  'type': 'circular_fixation_point',
  'size_x': 0.20000000298023224,
  'alpha_multiplier': 1.0,
  'action': 'draw',
  'center_x': 0.0,
  'size_y': 0.20000000298023224,
  'color_r': 1.0,
  'color_g': 0.0,
  'color_b': 0.0,
  'center_y': 0.0,
  'rotation': 0.0,
  'width': 2.0}]

Related, if an end user wishes to compute hashes for arbitrary images (to implement robust comparison with hashes parsed from the events, as you described), what is the way/API for doing that in python?

Here’s a Python function that will do it:

import hashlib

def file_hash(filename):
    with open(filename, 'rb') as fp:
        return hashlib.sha1(fp.read()).hexdigest()

Just FYI, our issues with image indexing and correspondence to neural recordings may have been related to RSVP_test_stim_index values going (1 : stimulus_set_size) , and not (0 : stimulus_set_size-1) as they should in the experiment’s .mwel file (not in the one you have reviewed and concluded correct).

Starting the index at one instead of zero can work correctly, as long as any experiment code that uses the index to select stimuli from a stimulus group subtracts one from the index value. But I see your point: If some experiments start at zero while others start at one, and the pipeline analysis code is expecting them all to use the same starting index, then that’s going to be a serious problem.

Chris

Thank you for getting back to me, Chris. Your comments were helpful, especially in terms of sorting out suboptimal usages in the past.

Just wanted to follow up on the value/units associated to the 'syringe pump’ (this is actually the name of the product). Your assessment is correct — I’ve assigned 0.15 in terms of the duration, rather than the volume. I calibrated this value according to the hardware specifications of the syringe pump.

Lastly, I prefer to set image repetitions to be independent. I am wary of repeating stimuli in an interleaved fashion.

Best,
Yoon

Hi Yoon,

Just wanted to follow up on the value/units associated to the 'syringe pump’ (this is actually the name of the product). Your assessment is correct — I’ve assigned 0.15 in terms of the duration, rather than the volume. I calibrated this value according to the hardware specifications of the syringe pump.

I’m sure it’s working as you intended, but I have to say that I’m even more confused now. If you configure the pump to run at 100 mL/min, and you tell it to pump 0.15 mL, isn’t it going to run for 90 milliseconds (not 0.15 microseconds)? If running the pump for 90 ms is the goal, it would be a lot clearer if the value of reward_duration was 90ms (instead of 0.15us).

Chris

Hi Chris,

The “duration” you see in MWEL does not equal the physical volume. The variable labelled “reward_duration” carried over since Alina’s rig was still using NI’s DAQ + solenoid, but wanted to keep the same code to minimize confusion.

In terms of the syringe pump in rig 1, I followed your instructions in which the real volume is calibrated with additional properties such as the flow rate and syringe diameter, etc.
I settled with the value 0.15us after measuring 100s of pumps to calibrate/measure the real volume. For instance, after testing two devices (NE500), I found that the value of 0.15us would lead to approx. 0.1mL per pump. This would allow us to dispense a total of 200mL worth of liquid across a typical recording comprised of 2000 fixations (~2-3 hours).

Yoon