Open Data FAQs for chemists

The following FAQs have been asked by members of the Department of Chemistry and answered by members of the Open Data team at the University.

If you have any amendments or further questions you would like to ask please contact the Librarian at the Department of Chemistry, Clair Castle, at library@ch.cam.ac.uk, in the first instance.

The Open Data team can also be contacted at info@data.cam.ac.uk, http://www.data.cam.ac.uk/.

FAQs

What would open data for a typical synthetic organic chemistry paper look like?

For a synthetic paper you might include the output files from NMR, UV/Vis, and IR measurements (for example). These should be in a format that others can use, so the data should be in a form which can be manipulated (images of graphs, especially of NMR experiments, wouldn’t meet this criteria). So for an NMR measurement you should include the processed data as .csv file (for example) so that future users can replot the data for themselves.

Lab books form an important record of the experiments, so where possible they should also be included in the data record (or at least the detailed methodology for the experiment, so that it can be reproduced).

However, if it would be too time consuming and costly to digitise the lab books then you can simply create a meta-data record on the repository so that future users can contact you to physically access your lab books.

What would open data for a typical molecular dynamics based paper look like?

For a computational paper you might include the input and output files from the calculations. Whether you need to include binary output files (which are often produced but hardly ever analysed) is left at your discretion, but if you feel that these files are necessary for the interpretation of the results then they should also be included.

If you have performed a whole suite of experiments, all of which are similar, then it might only be necessary to provide the input files and a couple of example output files. Future researchers can then scrutinize a sample of your output and then re-run all your input files if they wish to do so.

How raw should the deposited data be? Do funders have a view on, for example, whether I should deposit an NMR spectrum or the actual fid which can then be processed to give the spectrum?

As a minimum, you should share research data which is needed to validate findings described in your publication. You, the researcher, are the expert of your own research data and you are in the best position to decide which data is valuable to others, and needed to validate your findings. In the specific case of NMR measurements, it’s probably sufficient to share your results in a .csv file format. You should be able to reproduce 1D and 2D NMR quite readily in this format. Obviously this isn’t the original fid (which if you think is important to share, then by all means share it) but it is a format which other chemists (and physicists) can use to replot and reanalyse your results. Sharing images of graphs as a form of data is not sufficient; you should provide the underlying data to the graph.

What about iterative experiments? If I quote yield of 80% for a synthesis should I deposit data only for that synthesis or also for all the iterated syntheses that led to the final one?

Again, as a minimum, you should share research data which is needed to validate findings described in your publication. If in your publication you only describe the process that led to the yield of 80% for a synthesis, as a minimum you should deposit data only for that synthesis.

Do I have to use the University's repository, Apollo?

No, you can use any repository you wish. Only researchers funded by the ESRC and NERC need to use ESRC’s and NERC’s data repositories, respectively. We provide guidance on how to choose a data repository here: http://www.data.cam.ac.uk/repository.

What is a realistic maximum size for a dataset?

We can accept big file submissions from you. Individual files submitted via Symplectic Elements (www.data.cam.ac.uk/upload) cannot extend 1GB (the total size of all your files needs to be below 20GB); bigger submissions are possible, via external hard drives (arrange by emailing info@data.cam.ac.uk).

However, when it comes to sharing big files via the repository, remember that the end user will need to download your files in order to re-use them. In other words, in order to be re-usable, your data needs to be downloadable. To help the end user, consider ways of trying to reduce the size of your files before sharing. You can for example compress your files or, if possible, divide your dataset into smaller, downloadable files.

Can one dataset support several papers?

Yes, we can link the same dataset to several publications in the repository.

What is metadata? Can you give some examples?

Metadata is the description of data. We provide detailed explanation about what metadata is here: http://www.data.cam.ac.uk/data-management-guide/organising-your-data#Metadata.

Discipline-specific examples of metadata are provided by the Digital Curation Centre, and can be found here: http://www.dcc.ac.uk/resources/metadata-standards.

Can you give examples of appropriate statements to add to publications?

We provide sample statements here: http://www.data.cam.ac.uk/faq-0/data-repositories (under the question: How do I link to my data? Do you have any template statements that I could adapt for my publication?).

Do I need a DOI for everything?

You need a DOI for each record in the repository. Note that every record can contain several items. Typically researchers create a separate dataset record (with a DOI assigned) for each publication.

How and when does an author get a DOI that they can add to their publication note regarding where the data has been deposited?

After a data submission is received via the website upload form (www.data.cam.ac.uk/upload), the placeholder DOI link to your data will be sent to you automatically within 5-10 minutes.

Can I get a DOI before the data is completely ready?

You will be able to get a placeholder DOI. This DOI will be reserved for your dataset, and you will be able to add that link to your publication. Once your dataset is complete, we will then register your DOI. At that moment you will not be able to make additional modifications to the record.

How can I get a placeholder DOI for my dataset?

Simply go to your Symplectic Elements account, follow the steps for adding a new dataset but in the question about the status of your dataset, select ‘Placeholder record’. You will be sent a placeholder DOI for your dataset automatically after you have uploaded your placeholder dataset. For more information about how to do this, click here.

How do I know that I have uploaded my data successfully?

You should receive an e-mail containing the placeholder DOI for your dataset. This also confirms that the files have been successfully received.

The DOI link for my placeholder dataset does not resolve - what should I do?

You need to go back to Symplectic Elements, finalise your dataset and mark the status of your data submission as ‘final’. We will then review your dataset and once approved, the DOI link will start resolving. For more information about how to do this, click here.

What can other people do with my data?

This will depend on the license that you choose for your data. When you submit your files to the University of Cambridge data repository, we will ask you how do you want to license your data. It is important that you think about this carefully, as this will determine what others can/cannot do with your data. Our recommended licence is CC BY. CC BY requires end users to cite your data but also allows your dataset to be re-used for multiple purposes (thus maximising the impact of your dataset and the potential number of citations). You can read more about available licenses here: http://www.data.cam.ac.uk/data-faq/licensing-competition-and-data-misuse.

Will publishers regard data deposition as prior publication?

Most of the time when a publisher accepts a manuscript for a publication they will ask you to sign a copyright transfer agreement. In practice these agreements usually mean that you will transfer your copyright to the publisher and you will no longer own any copyright over the published version of your article.

However, if you submit your dataset supporting the publication to the University repository, the University is the publisher of your dataset, but NOT the publisher of your corresponding paper. This means that you can decide under what conditions to make your dataset available.

If you have any questions about data licensing, please info@data.cam.ac.uk.

How does open data relate to open publication?

Both open data and open publication are part of Open Access to scholarly outputs. Many funders had the requirement for making publications available open access already for a while. However, in order to facilitate the cultural change, stimulate knowledge exchange, and help science moving forward, funders now also realised that research data underpinning publications should be also openly available. Hence the new requirement for open data.

Your website states that there is a “one-off” fee for data storage charged for by the University. Does this fee apply to each paper that is stored there?

This “one-off” fee applies to each data submission (every submission made via our website upload form: www.data.cam.ac.uk/upload) of 20GB or more.

How expensive is "Too expensive to share"?

The EPSRC does not provide a cut-off, as to when it is too expensive.

The EPSRC policy is focused on sharing research data which:

Is necessary to validate findings described in your publication;
Data which might be valuable to others;
Data which cannot be re-generated (for example, data coming from environmental observations).

So if your data can be easily re-generated, and it is expensive to share, it is probably worth considering sharing a representative sample, instead of sharing all of your research data.

What about international collaborations? I will not be able to compel foreign co-workers to participate in Open Data. Particularly if I am not the corresponding author on a paper. Is the funders’ view that I should just make the data pertinent to my bit of a paper available?

Ideally (and in future collaborations) you should inform your potential collaborators that due to your public funding, you are expected to share research data as openly as possible. With your current research project you should determine with your collaborator which data can be shared, which cannot, and describe this in your data management plan. If (some) research data need to be restricted, then you should provide an appropriate statement in your publication explaining the reason why access to data is restricted.

The EPSRC by default would like all research data to be shared. This comes from the principle of supporting a global cultural change and global advancement of science. Note also that the requirement for data sharing is not limited to UK funders. Similarly to the UK Research Councils, the European Commission, the NIH, Bill & Melinda Gates Foundation and many other funders have policies on research data sharing.

That's why the default expectation is that data underpinning publication should be shared, to make results available for scrutiny/validation. If your collaborators are disinclined to share research data, you shall determine with your collaborators, which data can be shared and under what conditions (so altering the 'default' state).

The EPSRC's checks will be based on looking for statements about research data in publications. If the statement clearly defines that according to the collaboration agreement access to some research data had to be restricted (and why), I personally doubt they will look for proofs of where each piece of data comes from. From all our conversations it is quite clear that the aim of the EPSRC's policy is to drive a cultural change and move towards greater openness. They also admitted that they want to see whether researchers express their good will to share research data - even if sometimes not everything (or nothing) can be shared.

Do I need to share data underpinning my PhD thesis?

PhD students are encouraged to share research data from their PhD research, providing that:

The research process is not damaged by premature and/or inappropriate release of research data.
The research data has been generated in accordance with the University’s Research Policies, the University’s Research Integrity and Ethics guidelines and in accordance with policies of research funders.

In general it is advised that supervisors are always consulted before any research data underpinning PhD research is released.

Is open data a problem or an opportunity?

At the University of Cambridge we think that open data is a great opportunity for a cultural change in research, and move towards better transparency and openness in science. Of course, there will be problems encountered at the initial implementation of open data; however, we do believe that this will be a learning process for everyone and ultimately beneficial for science, and for society.