N.B. You should be familiar with the submitters' Checklist for Deposit before checking a Submission. Login with your university login and click on 'Submissions' to find the Submission, then take it out of the pool, so that another curator won't spend time starting to look at it while you're already working on it. 

Also once familiar with the contents you can often just refer to this 'easy guide' created by Stefano which is a good simple reference.

DataShare Easy Guide


Things to check

Q1 Is DataShare the right place for this? 

    1. Is it research data? Check whether the submitted data fits with our definition of a Research Data dataset - see http://www.ed.ac.uk/information-services/research-support/data-library/data-repository/definitions . If the submission contains only a document(s), it will be more appropriate to suggest the depositor submit this to ERA. If there is data, ensure it is submitted in a more reusable format eg CSV or Excel.

    2. Is this data eligible/appropriate for DataShare? For example, if the dataset contains any personal, sensitive data, we would require it to be completely anonymised before deposit because DataShare is publicly accessible. The DataVault may be more appropriate, in association with encryption and DataSync for restricting access. If the depositor is an undergrad or a postgrad from another institution, we should ask the project supervisor, who should be an academic member of staff, for an assurance that the data would be of value to the research community. N.B. this could be a dissertation project or extracurricular project, but should be a UoE project.
    3. Is any part of the dataset potentially infringing copyright? If so, question the depositor so you can decide whether to accept or reject the deposit, or require some content to be removed. 


Files

Q2 File formats - Are the file formats Supported?

Go to Full Item View and check the Format field and the filename extension for each submitted file.

Use the guidance we provide to users on file formats and digital preservation (https://www.ed.ac.uk/information-services/research-support/research-data-service/after/data-repository/choosing-file-formats) to gauge whether you need to ask the depositor to consider converting the files to a different format, for reasons of long-term accessibility and interoperability. In particular if they have submitted jpegs, we might be able to convert these to JPEG-2000 for them.

Be aware - the support category (Unknown, Known or Supported) is recorded in the DSpace file format registry, but not displayed against the files at any time. If the filename extension is not present in DataShare's file format registry, it will be identified by the system as Unknown; so the 'Format' field will say 'Unknown'. And likewise if the filename extension is recorded in the format registry against an entry where the MIME type is empty or is "application/octet-stream" (i.e. it is undefined) then the file will be labelled in the Full Item view and in the curation interface with 'Format: Unknown'. Whereas, if the filename extension is recognised the system will display the MIME type (aka Media type) or an associated name in the Format field. This information should reflect the result of the system's automatic checking of the filename extensions against the DataShare File Registry, which can be viewed or edited at: http://datashare.is.ed.ac.uk/admin/format-registry , and a further lookup - the text displayed in the Format field is drawn from this messages list: https://github.com/edina/DSpace/blob/datashare-dspace-5.2/dspace/modules/xmlui/src/main/webapp/i18n/messages.xml#L2328 . Any new MIME types need to be given a name in this messages list.


If the 'Format' field says 'Unknown'
: this may be because the filename extension is not in the file format registry, OR because the registry entry contains no MIME type, OR the MIME type is "application/octet-stream"; you need to establish which of these is the case. Therefore you need to check the registry by searching the full text copy on the wiki page at: File format registry (just use Ctrl+F in your browser to search the whole page; sadly the DSpace registry is not searchable).

If the filename extension is attached to an existing entry in the registry, use the information in the registry to decide whether you need to open a dialogue with the depositor about their chosen file format e.g. if the format is merely 'Known' rather than 'Supported', you should decide whether to engage in a dialogue with the user about whether they can enhance the sustainability of their deposit by converting the files to a supported format. No action is necessary vis-a-vis the registry, unless of course you spot some information that could be updated.

For example, for .jpg files, a more sustainable (but less popular, less interoperable) format is JPEG-2000. If there are a large number of .jpg files, it may be a good idea to use ImageMagick on Linux as follows...

Example: file called "#3.jpg"

magick convert \#3.jpg number3.jp2

... notice the use of the backslash to escape the hashtag character. The file can now be viewed using GIMP or Adobe Acrobat Pro.


If on the other hand the file format is not in the registry, follow the Procedure: New File Formats in DataShare Submissions.  

It may be worthwhile to sound out the depositor as to whether they would be content for us to do some work with them, with our colleagues who have digital preservation expertise, to enhance the dataset for long-term re-useability. This might consist of us working through the kind of digital preservation  found in the DPC's 'Making Progress with Digital Preservation' course, see from page 16 of the preservation planning chapter onwards - https://www.digitalpreservationcoalition.co.uk/knowledge-base/training/training-resources

Q3. Number of files - Are there too many files?

Check that the number of files is less than 200 (where any zip file = 1 file). At much more than 200 bitstreams in an item, it becomes unmanageable for us to maintain the item e.g. changing the order in which the files are listed or deleting files becomes very time-consuming or in some cases impossible, causing the browser to crash. If the number is over 200, the depositor should consider re-grouping them into more than one item, and/or zipping some subset(s) together. WARNING: Do not approve the item with the intention of simply adding a zipped-up copy of the files and deleting the individual ones after approval; when the number of files is too large, it will be impossible to edit the item this way after approval. It will therefore be necessary in most such cases to 'reject' the submission.

Q4. Corrupted files - Are the files corrupted? 

We have observed a bug that occasionally arises as a result of asynchronous transfer which means some files, although uploaded completely, are not openable by the user, but open a duplicate instead, because some files have gone into a second 'bundle: original'. The issue can be solved by firstly getting a copy of the file using the Download-all button or the 'view' link on the edit bitstreams page, then deleting the file, then re-adding it. N.B. this can only be done after approval. 

If there are any files containing a "%" symbol in the filename, this can cause a bug where our browsers mis-order the files. So any such submission should be sent back to the submitter using 'Reject' with a request to rename the file(s) with "percent" instead of "%".

If there are any two files with the same name, send the submission back to the depositor to remove one of them, rename it and add it back in again. Files with the same name within a zip file or files are fine, won't cause a problem. The reason we need to avoid files with the same name is that DataShare zips the files in the Item together to make the zip file for the download-all button; if two files have the same name, DataShare will add only one of them to the zip file. In other words, if the files are different in their content but have the same name, users who use download-all will miss some of the content because one of the files won't be included. If on the other hand the files are duplicates (this can happen as a result of the libraries used by  the HTML upload page in the submission form) they may cause confusion as well as taking up unnecessary space.

Check that all the files can be opened/read. If there are spreadsheets (inc .xlsx), ask the submitter if it would be appropriate and if they'd be happy for a .csv version to be added, for accessibility/sustainability. Likewise, if there are .doc or .docx Word files, ask the submitter if they'd be happy for a pdf version to be added for accessibility and sustainability. If the deposit includes .xls or .doc files, contact the depositor and suggest them to save the files in the newer .xlsx and .docx formats respectively, since these are more open and xml-based and therefore we consider them more sustainable in the long-run. If there is concern about compatibility issues with regards to Excel 2010, please consult http://office.microsoft.com/en-gb/excel-help/use-office-excel-2010-with-earlier-versions-of-excel-HA010342994.aspx#BM5a. If the documentation is contained in a file within a zip archive, try to encourage the depositor to agree to you adding a copy of the file in the top level of the item for ease-of-access; user feedback suggests this is a worthwhile step. 

If the file has many thousands of lines and is too large for a text editor to open the whole file (this is common with GWAS files we receive, which are plain text but will make Notepad++ complain), Windows Powershell may be a convenient way to check the files, ie using the 'more' command to page through, or the 'head' command eg "head -30 data_file.txt" to view the first thirty lines. 

Metadata

Q5. Descriptive metadata - Is the metadata sufficient?

Check the metadata are readable, contain no major spelling errors or other mistakes, and are reasonably complete – remember, metadata (and file content) will be indexed by search engines. N.B. our metadata schema is based on Dublin Core, but we also need a certain level of alignment with DataCite's metadata schema, to retain our authority to mint DOIs, so curators should be familiar with both schemas.

NOTE: During the submission review stage there is currently an issue preventing edits to free text fields (e.g. Title; Description (Abstract & TOC); Data Publisher; etc) from being saved when using the 'Save and exit' button. If making edits to affected text fields, when checking a submission you should advance to the next submission step (i.e. 'File upload') using the 'Next >' button before clicking 'Save and exit'. This issue does not affect metadata fields which have an 'Add' button (e.g. Data Creator; Source), and only occurs during the submission review stage - metadata fields can be successfully modified and saved after items have been approved.

  • Data Creator is a mandatory field in DataCite, so if this field is empty, check with user what the reason is, and if they insist no data creator should be listed, consult with the team. Otherwise, we will get an error the day after approval, when DataCite's software refuses to create a DOI for the dataset. 
  • If there are any full stops in the data creators' names, remove them; our style is initials do not have full stops e.g. "Armstrong, J Douglas" or "Armstrong JD". If any of the data creators has a previous deposit, try to apply consistency, so that the faceted browsing will allow users to find all the deposits where this data ceator is cited at once. e.g. if submitted metadata says "Bastin, ME", please expand to "Bastin, Mark E". 

  • If the dataset is deposited by a student, make sure the principal investigators of the related research project are named as data creators. Otherwise, when the student has autonomously conducted research, they must be left as the only data creators.

  • Make sure the parts of the publisher name are separated by full stops, with no full stop at the end eg "University of Edinburgh. University of Edinburgh. School of Biological Sciences. Intitute of Cell Biology".If the user has added more levels of organisation than necessary then best to remove them so that the Publisher name does not describe the research group, nor the department for example, just the University, School and perhaps a research centre.

  • Dataset Description (Abstract and 'TOC'). Carriage returns are not displayed so best to add three spaces to any new lines (in case some day DSpace renders Markdown). Likewise, if the depositor has entered a list with carriage returns, add an asterisk and a space at the start of each line to make Markdown bullet points, and improve readability (N.B. Markdown also recognises a dash at the start of a line as a bullet).

  • If the 'Funder' field says 'Other', email the depositor to ask them for the name of the funder(s) so you can overwrite that information. If it seems a reasonable supposition that there may be future submissions linked to the same funder, ask the tech team to add this funder to the dropdown controlled vocabulary (this is now a quick and easy config change for them to do, with no restart required, so no need for a Jira).

  • 'Funder': If possible, in discussion emails with the depositor ask for the grant numbers. Although this will not go in the DataShare metadata it is useful for linking to the appropriate project funding in Pure.

  • If there is a DOI or other information about a pre-print or publication, make sure if possible:

    • Include a URI for the paper in the "Is Referenced By" field; the script will pick up the "https" start of the string and convert that into a hyperlink on the Summary page. 

    • If the paper is not yet published in its final version put a reference in the data description and readme file; e.g. McAra et al. "Paper title if you have it" (In Submission) ... OR Higgs et al., Physics & Astronomy Letters (Accepted manuscript). NB we used to include this info in the IsReferencedBy field, but we have changed our practice to better fit with the DataCite standard.  
  • Do you need to add a subject category? Check the subject category and keywords (i.e. discovery metadata) - if empty, try to complete the subject category and add at least one or two keywords, consulting with the depositor if necessary.

  • Does the dataset and/or documentation language match the dc.language.iso field? Although dc.language.iso is not a mandatory field, we should always make sure that the language iso code matches the language of the dataset and/or documentation. This is particularly relevant and recurrent for linguistic deposits. When the language code is evidently different from the language of the deposit, enquire with the user. They might have left the field set as default ("eng") either because they forgot to change it or they couldn't find the relevant language in the list. Indeed, our list contains just about 40 languages known in DataShare at present out of almost 8000 listed in the iso 639.3 catalogue ( https://iso639-3.sil.org/code_tables/639/data ). When we realise that a language is missing, these are the steps we need to undertake:
    • Make sure that you know the exact language with the user.
    • Find the related iso code for the new language by visiting the iso 639.3 catalogue above.
    • Edit the dc.language.iso metadata field accordingly.
    • Get in touch with our tech team to add the new language to DataShare's dropdown list.
    • Edit the "Current metadata schema" Wiki page accordingly (language list section).

Q6. Rights - Does the licence make sense?

Check the rights statement. If the depositor has written their own rights statement, read carefully to check whether their intentions are clear. Probably a good idea to remind the depositor that any restrictions they've placed on re-use may be difficult to police. 

NB DataCite does not allow more than 2,000 characters in this field, so any text over this length needs to instead be put in a file, so the dc.rights field length can be shortened eg "See terms-and-conditions.txt". 

  • NOTE: Since the update to the DataCite metadata schema (Jan 2019) it appears that when a user selects 'no license' (and add their own rights statement), after the item is approved the DOI for the item is minted but not assigned (DataCite reports an error with the item metadata). This is due to the dc.rights metadata field. In order to assign the DOI, the rights field (dc.rights) must first be deleted - make sure that you keep a copy of the original rights statement. Once the DOI has been added to the item record, you must go into the item record and add the rights statement back in (i.e. Edit Item > Item Metadata > Add New Metadata > dc.rights).

Q7. Access - Is the embargo appropriate?

Check the embargo date:

  • If it is more than a year in the future, ask the depositor why they have chosen such a long date, and whether they are confident they are complying with their funder's policy and the university's policy about sharing data.
  • Do not allow them to set the embargo to Hogmanay (31st of December) or the 1st of January; explain to them that they'll receive an automatic reminder email one week in advance to say it will expire automatically, but we’ll all be on annual leave by that time, so it’ll be too late if they realise at that point that they want us to extend the embargo for them.
  • Do not allow an incomplete date such as '2027' (instead of '2027-07-31' for example). Such a date (YYYY eg '2027') is ambiguous, which is a bad thing in itself, but even worse, it is interpreted by DataShare as the 1st of January, which we do not allow. So for example you should amend '2027' to the YYYY-MM-DD form and to a date outside the winter shutdown such as '2027-01-31'), and then make the depositor aware, give them the opportunity to choose a different date. 
  • The system will not allow us to approve a submission with an embargo date in the past. If the date falls today or very soon: make sure you change it if the curation process takes so long that the date then falls in the past. (The submission form prevents users from entering an embargo date in the past, so this only causes a problem when the curation process has taken several days).

N.B. when we receive the email messages to say the embargoes are one week away from expiring, we should check whether each deposit links to a publication or pre-print; if not, check whether there is a new publication out; if so, update the metadata in DataShare and Pure accordingly. And then email round the other curators to say you've done this so we're not duplicating the effort.

Q8. Documentation - Is there enough documentation?

Check that there is appropriate documentation - for example usually there should be a file listing in a documentation/readme file containing at least some brief description of the file(s). Check that the files in the submission correspond to the list in the readme file; sometimes a user might not notice that DataShare dropped some files because the upload did not complete, or that folders which they've described in their readme file are no longer there because the web form won't allow the user to upload a folder. In either of these two cases you would probably need to reject the submission so the submitter can fix the fileset.

Q9. Where is the right place within DataShare, which Collection? 

Check whether the Submission has been added to a Collection other than the Default Holding Collection. If not, you will need to ask the Submitter to agree to a title for a new Collection which you will create, and provide some descriptive text for it, or to identify which existing Collection the Item can be added to. You will also need to ask the Submitter to identify which Community or Sub-Community the Collection can be linked to, and if it does not already exist, you will need to create it. N.B. you might want to check Unidesk in case the submitter has emailed to tell us the Collection name they want, as the documentation tells them. NB in the vast majority of cases, this is not mandatory information for approval of the submission, since we can always select or create an appropriate Collection for the dataset ourselves. Therefore, when a prospective submitter contacts us to request the creation of a collection, we ask them to first submit their data, and we only create the new collection after we have received a dataset submission. Thus we avoid creating collections which might then never hold any deposits. 

Workflow - sequence of events

If something is not as it should be, you may need to go back and ask the Depositor to amend something or provide more information. You might need to find their email address by searching for their surname at http://datashare.is.ed.ac.uk/admin/epeople (N.B. the first name may not be recorded as part of the e-person, so it is better to search by surname).

WARNING: The Depositor will not be able to amend the Item directly in DataShare (neither the files nor the metadata), once the item has been submitted to the administration task pool. You will have to “Reject” the Item, and they will have to make any changes to it on their machine and then resubmit their Submission. You should explain tactfully in the email that this is for technical reasons, and does not mean the submission will not be accepted.

On the other hand, you may edit the metadata before Approval.

If you need to add a file e.g. a readme file, you cannot do so during the approval process, but you can do so after approval. 

You've approved it... now what? 

After approving a deposit - what next?

  • No labels

1 Comment

  1. Unknown User (mdonnell)

    At the bottom of this page, there should be a link to another page called something like "You've approved it... now what?"