The CLARIN-IS repository does usually not accept entries without data (i.e. without the bitstreams attached to the entry). Here below are the guidelines on the structure of the deposited language resources, which formats are accepted by the CLARIN-IS repository, and what standards should be used as the annotation formats in the textual language resource files.

Basic aspects

Names of files and directories

Filenames and directory names should only contain ASCII letters, digits, the hyphen ("-") and period (".") characters. They should not contain spaces, underscores, brackets, quotes, dollars, slashes, colons, or other punctuation characters (except hyphen and period), nor accented letters or other non-ASCII characters. Examples of good filenames are "news.v1.zip", "ParlaMint-IS.xml", "TextNormalization-statistics.tsv".

File extensions

Standard or commonly recognised file extensions should be used, such as ".txt", ".xml", ".jpg". Double extensions can be used (e.g. "igc.tei.xml" or "icg.TEI.zip") to indicate that the file is in a standard encoding, or that an archive file contains files of a certain type.

In the rest of this document, the preferred extensions are given next to the file types.

File compression

When resources deposited in the repository are compressed, a complete directory should be compressed, and the name of the compressed file should be the same as the directory it unpacks in. For example, the file "IGC-Parla.21.05.zip" should unpack into the directory " IGC-Parla.21.05/" which then contains the files and possibly subdirectories. It is recommended that the directory also contains a README text file, which gives the title of the resource and its handle as well as a short description.

CLARIN.IS prefers ZIP (.zip) files, but accepts TAR (.tgz) or, for single files, GNU ZIP (.gz).

Controlled values

Language codes

When the data (or filename) needs to refer to a certain language, language codes should be used, rather than names of languages. When they exist, the two-letter ISO 639-1 language codes should be used, while for languages that do not have a two-letter code, the three-letter code from ISO 639-3 should be used.

Dates and times

All dates and times that appear in a machine-processable context should follow ISO 8601, i.e. “2020-12-28” for a date, “23:21:21” for a time, and “2020-12-28T23:21:12” for a combination of the two.

Accepted binary formats

Below is a list of the formats accepted by CLARIN.IS. The formats have been grouped into functional domains. Each item in the list is also a link to further information about the format, usually the one given on CLARIN‘s Standards Information System (SIS), accessible at https://standards.clarin.eu/.

Audovisual Annotation

Folker https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fFLN
TeiSpoken https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fTEISpoken
Transcriber https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fTRS

Audovisual Source Language Data

MP3 https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fMP3
AIFF https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fAIFF
AVI https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fAVI
FLAC https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fAVI
MPG 4 video https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fMP4
Wave https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fWave

Documentation

HTML https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fHTML
PDF/A https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fPDFA
TEI https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fTEI
XML https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fXML

GeoData

GML https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fGML
KML https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fKML

Image Source Language Data

GIF https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fJPEG
JPEG https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fJPEG
PDF/A https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fPDFA
PNG https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fPNG
SVG https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fSVG
TIFF https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fTIFF

Lexical Resource

CSV https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fCSV
LMF https://en.wikipedia.org/wiki/Lexical_Markup_Framework
TSV https://whatis.techtarget.com/fileformat/TSV-Tab-separated-values-file

Statistical Data

R https://www.r-project.org/about.html
SPSS .dat and sps https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fSPSS.data-and-setup

Text Annotation

CoNNL https://universaldependencies.org/format.html
TEI https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fTEI
XML https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fXML

Textual Source Language Data

plain Text https://clarin.ids-mannheim.de/standards/views/view-format.xq?id=fTextPlain

XML schemas

RelaxNG https://relaxng.org/

Compression and packaging

GZIP https://en.wikipedia.org/wiki/Gzip
TAR https://en.wikipedia.org/wiki/Tar_(computing)
ZIP https://en.wikipedia.org/wiki/ZIP_(file_format)

Encoding of textual files

As most of the repository submissions involve files, which are essentially text files (including numeric data, source program files, XML files, etc.), we here explain how such files should be encoded in more detail.

Character encoding

CLARIN.IS accepts only Unicode files. We do not accept files with 8-bit encodings, such as ISO 8859 or Windows code pages. The Unicode files should be encoded in UTF-8, with exceptions being text files in non-Latin based scripts, such as Japanese, which can use UTF-16.

Plain text files

For unstructured text, we accept plain text files (.txt). Trivial formatting, such as the fact that a line break indicates a new paragraph or that text in square brackets indicates a transcriber comment can also be included, as long as the conventions used are explained in a README file.

Tabular data

For spreadsheet or database-like data, we accept commonly used formats such as tab (.txt/.tsv/.tab) and comma (.txt/.csv) separated values. The tabular files should contain a header row and the data should be accompanied by a README file, explaining the meaning of the columns.

Annotated corpora can be submitted in the CoNLL-U format (.connlu) used by the Universal Dependencies project.

HTML documents

We do not accept HTML (.html/.htm) documents as primary data, however, they can be used for documenting the entry, e.g. containing the explanation of the structure of the data or its linguistic annotation. Such HTML documents should be valid according to some version of HTML (preferably XHTML) and self-sufficient, i.e. if CSS is used, it should be, preferably, embedded in the HTML file(s) or stored together with them.

XML documents

By far the most common format of submissions is XML (.xml), which allows for richly and hierarchically structured text data. CLARIN.IS accepts any valid XML documents, where:

  • the schema, that is used to validate a document is well-known and publicly available from a stable location, which includes the documentation, e.g. RDF/XML (.rdf) or ELAN (.eaf);
  • or the schema, including its documentation, is a part of the repository entry.

We accept the schemas in any XML schema definition languages, i.e. DTD (.dtd), RelaxNG (.rng/.rnc) and W3C XML schema (.xsd), as well as Schematron (.xml)

TEI documents

The preferred XML encoding of CLARIN.SI repository entries is TEI (.tei/.xml), i.e. using the Text Encoding Initiative Guidelines for encoding structured language resources, such as language corpora, machine-readable dictionaries, text-critical editions, etc.

When the type of the deposited language resource is covered by any of the standard or best-practice customisations of the TEI, such as ISO 24624:2016 for transcriptions of spoken language, TEI Lex0 for dictionaries, or Parla-CLARIN for encoding corpora of parliamentary debates, these schemas should be used in preference to using bespoke or generic TEI encodings

If the deposited TEI documents use only standard modules of the TEI, in particular, if they can be validated according to the CLARIN.SI TEI schema, then only the XML TEI files can be deposited. But for document encodings that incorporate any extensions to the TEI, the TEI ODD and generated XML schemas (in particular RelaxNG .rng and .rnc) and documentation in at least HTML should always accompany such TEI XML data.

Linguistic annotation vocabularies

Most language corpora are annotated on various levels with linguistic categories. These categories must be documented, either on stable external URLs or together with the repository entry, i.e. in included files or, esp. with TEI-encoded corpora, as part of the corpus document itself.