CSV on the Web Working Group Charter

The term 'open data' has evolved rapidly since it came into common usage less than a decade ago. From presidential executive orders to pan-national directives; from cultural heritage to scientific data, it's clear that more and more of the data that historically has been hidden away is now coming into the open. There is widespread consensus that this is a good thing.

A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. CSV files may be of a significant size but they can be generated and manipulated easily, and there is a significant body of software available to handle them. Indeed, popular spreadsheet applications (Microsoft Excel, iWork’s Number, or OpenOffice.org) as well as numerous other applications can produce and consume these files. However, although these tools make conversion to CSV easy, it is resisted by some publishers because CSV is a much less rich format that can't express important detail that the publishers want to express, such as annotations, the meaning of identifier codes etc. As experience of publishing and using CSV published on the Web grows it's clear that this is one of many issues that need to be addressed.

Existing formats for tabular data are format-oriented and hard to process (e.g. Excel); un-extensible (e.g. CSV/TSV); or they assume the use of particular technologies (e.g. SQL dumps). None of these formats allow developers to pull in multiple data sets, manipulate, visualize and combine them in flexible ways. Other information relevant to these datasets, such as access rights and provenance, is not easy to find. CSV is a very useful and simple format, but to unlock the data and make it portable to environments other than the one in which it was created, there needs to be a means of encoding and associating relevant metadata.

Now is a good time for standardization because there has already been significant experimentation by organizations such as the Open Knowledge Foundation and Google but the relevant parties are not (yet) locked in to particular solutions.

The mission of the CSV on the Web Working Group, part of the Data Activity, is to provide technologies whereby data dependent applications on the Web can provide higher interoperability when working with datasets using the CSV (Comma-Separated Values) or similar formats. As well as single CSV files, the group will define mechanisms for interpreting a set of CSVs as relational data. This will include the definition of a vocabulary for describing tables expressed as CSV and locatable on the web, and the relationships between them.

In this way, it will be possible to see CSVs as 5 star data - a non format data that allows data to be linked to other sources - using the Web as an intelligent data platform rather than as a simple distribution system for files containing inaccessible data.

Join the CSV on the Web Working Group.

End date	31 August 2015
Confidentiality	Proceedings are public
Initial Chairs	Dan Brickley (Google), Jeni Tennison (Open Data Institute)
Initial Team Contacts (FTE %: 20)	Ivan Herman
Usual Meeting Schedule	Teleconferences: Weekly Face-to-face: Once Annually

Scope

Tabular data has many usage scenarios. It is often used directly, without any manipulation of the data. For example, CSV content may be extremely large and its conversion might be unnecessarily time-consuming; or its usage is restricted to the data itself without the necessity to integrate with other datasets (a typical example may be the medical prescription data that a hospital or a national health service publishes on the Web and an application that only publishes the results of an analysis of the data). In other cases, the CSV content must be converted to other data formats to be used with data sources of a different origin, or with the application developer’s specific tools.

Whether converted to other formats or not, there is a need to describe the content of the CSV file: its structure, datatypes used in a specific column, language used for text fields, access rights, provenance, etc. This means that metadata should be available for the dataset, relying on standard vocabulary terms, and giving the necessary information for applications. The metadata can also be used for the conversion of the CSV content to other formats like RDF or JSON, it can enable automated loading of the data as objects, or it can provide additional information that search engines may use to gain a better understanding of the content of the data. It should also be possible to encode this metadata in RDF, thereby providing a “glue” to bind the CSV content to other datasets in different formats and to Linked Data in general. In this way CSV datasets become linkable on the Web via the metadata, even if the content remains in its original tabular format.

While the primary focus of this WG is on associating metadata with CSV files, to the extent possible the WG should consider and favor design choices that make the vocabulary useful with other tabular structured data use cases and in particular HTML tables.

To provide the necessary infrastructure for CSV on the Web, the Working Group will develop the following building blocks:

A metadata vocabulary describing the content of tabular data, like row and column structure, headings, datatypes, etc. A careful consideration of common use cases should generate the set of requirements that the metadata vocabulary should fulfill. The vocabulary should be defined, or should have an encoding, in standard RDF and, wherever possible and appropriate, should refer to, and reuse, existing vocabularies developed elsewhere.
Standard method(s) to find the metadata for a specific dataset. Several approaches should be explored and possibly specified: possibilities include the inclusion of a metadata reference in the HTTP header for the dataset’s URI; the extension of the CSV structure with special rows and/or columns containing the metadata or a URI pointing at the metadata; or the definition of a standard packaging format containing the CSV data as well as its metadata. The Working Group will explore these different possibilities, cross-check them with use cases, and recommend one or more mechanisms.
Standard method(s) to find the data described by the metadata. Closely allied to the previous point, the working group will define a simple link mechanism from the metadata to the data. Metadata may be created alongside the data it describes or completely independently by a third party. Either way, such a link may of course dereference to a static file or it may encode a query for a service available on the Web, such as an OData service or other API. Definition of the structure of those links/queries is out of scope.
Standard mapping mechanisms transforming CSV to other formats (e.g., RDF, JSON, or XML).

Notes:

This charter refers to “CSV”, i.e., “Comma Separated Values”; a similar, though less frequently used format is ”TSV”, for “Tab Separated Values”. All the specifications developed for CSV can be adapted to TSV easily; the Working Group may decide to take the TSV format into account, too, when formalizing the technologies.)

The term “mapping mechanism” does not necessarily mean a physical conversion of the data to another format; it may mean providing an, e.g., RDF “view” of the tabular data provided by a specialized server-side application.)

The XForms 2.0 draft specification includes the definition of XPath expressions for CSV; the Working Group will consider whether this fulfills the for an XML requirements or whether a separate specification is indeed necessary.

The output of the mapping mechanism for RDF MUST be consistent with either the RDF Direct Mapping or R2RML so that if a table from a relational database is exported as CSV and then mapped it produces semantically identical data.

Although not specified by W3C, the Working Group should also contribute to a newer version of RFC 4180 to ensure a better interoperability of CSV content. An example is the current restriction in the RFC to use ASCII only, which is inappropriate from the point of view of proper Internationalization on the Web and is not in line with the practice of CSV deployment.

Input documents

The following documents may be considered by this Working Group:

Non W3C documents

Linked CSV, Unofficial Draft, ODI (2013), J. Tennison, (ed.)
Common Format and MIME Type for Comma-Separated Values (CSV) Files, RFC 4180, The Internet Society (2005), Y. Shafranovich, (ed.)
A JSON Media Type for Describing the Structure and Meaning of JSON Documents, Draft, The Internet Society (2011), K. Syp, G. Court, (eds.)
URI Fragment Identifiers for the text/csv Media Type, Draft, The Internet Society (2013), M. Hausenblas, E. Wilde, and J. Tennison (eds.)
Simple Data Format (SDF), Draft, OKFN Labs, (2013), R. Pollock (ed.)
Dataset Publishing Language, Draft, Google Developers, (2013), O. Benjelloun (ed.)
CSV Meta Description, Draft, Talis (2011), A. Dix (ed.)
CSV2RDF Application (Ivan Ermilov, Sören Auer, Claus Stadler)
TARQL Mapping Language (Richard Cyganiak)
GREL Functions
OData Version 4.0 Part 1: Protocol (Michael Pizzo, Ralf Handl, Martin Zurmuehl)

W3C documents

PROV-AQ: Provenance Access and Query, W3C Working Group Note, (2013), G. Klyne and P. Groth (eds.)
The RDF Data Cube Vocabulary, W3C Draft, (2013), R. Cyganiak and D. Reynolds (eds.)
SPARQL 1.1 Query Results CSV and TSV Formats, W3C Recommendation, (2013), A. Seaborne (ed.)
R2RML RDB to RDF Mapping Language, W3C Recommendation, (2012), S. Das, S. Sundara, and R. Cyganiak (eds.)
A Direct Mapping of Relational Data to RDF, W3C Recommendation, (2012), M. Arenas, A. Bertails, E. Prud’hommeaux, and J. Sequeda (eds.)
Indexed Database API, W3C Candidate Recommendation, (2013), N. Mehta, J. Sicking, E. Graff, A. Popescu, J. Orlow, J.

Success Criteria

To advance to Proposed Recommendation, the metadata vocabulary should be shown to be used by at least two independent applications; “use” may also mean extended to other, more specialized vocabularies.

To advance to Proposed Recommendation, each conversion specification is expected to have at least two independent implementations of each feature defined in the specification. Similarly, each mechanism for the discovery of metadata should be implemented by at least two independent implementations.

During its lifetime the group has to produce enough outreach material so that the community at large is actively encouraged to use the technologies defined by the Working Group. It is expected that implementations will include parsers/importers for NoSQL databases such as MongoDB, as well as SQL databases, and triplestores. Exporters and converters should also be implemented from those same databases and from spreadsheet software. Validators, visualizers and other generic tooling is also foreseen since these tools are at the heart of the open data movement.

Out of Scope

Methodologies for the discovery of (CSV) data on the Web in general. While a major challenge, this group should concentrate only on how to find metadata associated with a specific CSV file (e.g., if its URI is known), and not discovery in general.

Deliverables

The titles of the deliverables are not final; the Working Group will have to decide on the final titles as well as the structures of the documents. The Working Group may also decide to merge some deliverables into one document or produce several documents that together constitute one of the deliverables.

CSV Data on the Web: Use Cases and Requirements (Working Group Note)
Metadata vocabulary for CSV data (Recommendation)
Access methods for CSV Metadata (Recommendation)
Mapping mechanism to transforming CSV into various Formats (e.g., RDF, JSON, or XML) (Recommendation)

Other Deliverables

Contribution to the future version of RFC 4180
Test suites for the different mappings
Outreach materials (in the form of presentation slides, possible guidelines, relevant wiki pages, etc.)

Milestones

Milestones
Specification	FPWD	LC	CR	PR	Rec/Note
Note: The group will document significant changes from this initial schedule on the group home page.
CSV Data on the Web (UCR) (Note)	March 2014	n/a	n/a	n/a	November 2014 (Note)
Metadata Vocabulary for CSV Data (Recommendation)	April 2014	November 2014	February 2015	April 2015	June 2015 (Rec.)
Access Methods for CSV Metadata (Recommendation)	April 2014	November 2014	February 2015	April 2015	June 2015 (Rec.)
Mapping of CSV to Other formats (Recommendation)	April 2014	November 2014	February 2015	April 2015	June 2015 (Rec.)

Timeline View Summary

December 2013: First teleconference
February 2014: First face-to-face meeting
March 2014: FPWD for the Recommendation track documents
November 2014: Second face-to-face meeting
November 2014: LCWD for the Recommendation track documents
February 2015: CR for for the Recommendation track documents
June 2015: REC for the for the Recommendation track documents

Dependencies and Liaisons

W3C Groups

Forms Working Group: Coordinate on the possible usage of CSV XPath expressions
Internationalization Activity: Ensure that multilinguality concerns are properly reflected in the CSV metadata work
Privacy Interest Group: Ensure that the privacy concerns are properly included in the CSV metadata either via reference or via direct vocabulary elements
Semantic Web Coordination Group (Member only - note that the SWCG is likely to be replaced by a Data Activity Coordination Group): Ensure that the metadata vocabulary development follows the guidelines of vocabulary specifications in general; also to ensure that the resulting specification work well with other forms of data on the Web, e.g., Linked Data
XML Activity: Ensure that the CSV to XML mapping (if specified) is in line with the general requirements of the XML activity and future developments

Furthermore, CSV on the Web Working Group expects to follow these W3C Recommendations:

External Groups

Internet Engineering Task Force (IETF): Contributing to a new version of RFC 4180 and to the ongoing work on defining URI Fragments for CSV content.
Open Knowledge Foundation (OKFN): OKFN is a leading source of information and experience in using data on the Web, including data in CSV formats; the use cases and technologies defined by this Working Group may draw on OKFN’s experience and community.

Participation

To be successful, the CSV on the Web Working Group is expected to have 10 or more active participants for its duration. To get the most out of this work, participants should expect to devote several hours a week; for budgeting purposes, we recommend at least half a day a week. For chairs and document editors the commitment will be higher, say, 1-2 days a week. Participants who follow the work less closely should be aware that if they miss decisions through inattention further discussion of those issues may be ruled out of order. However, most participants follow some areas of discussion more closely than others, and the time needed to stay in good standing therefore varies from week to week. The Working Group will also allocate the necessary resources for building Test Suites for each specification.

Communication

This group primarily conducts its work on the public mailing list public-csv-wg@w3.org (archives). Administrative tasks may be conducted in Member-only communications.

Information about the group (deliverables, participants, face-to-face meetings, teleconferences, etc.) is available from the CSV on the Web Working Group home page.

Decision Policy

As explained in the Process Document (section 3.3), this group will seek to make decisions when there is consensus. When the Chair puts a question and observes dissent, after due consideration of different opinions, the Chair should record a decision (possibly after a formal vote) and any objections, and move on.

A formal vote should allow for remote asynchronous participation—using, for example, email and/or web-based survey techniques. Any resolution taken in a face-to-face meeting or teleconference is to be considered provisional until 5 working days after the publication of the resolution in draft minutes sent to the group's mailing list.

This charter is written in accordance with Section 3.4, Votes of the W3C Process Document and includes no voting procedures beyond what the Process Document requires.

Patent Policy

This Working Group operates under the W3C Patent Policy (5 February 2004 Version). To promote the widest adoption of Web standards, W3C seeks to issue Recommendations that can be implemented, according to this policy, on a Royalty-Free basis.

For more information about disclosure obligations for this group, please see the W3C Patent Policy Implementation.

About this Charter

This charter for the CSV on the Web Working Group has been created according to section 6.2 of the Process Document. In the event of a conflict between this document or the provisions of any charter and the W3C Process, the W3C Process shall take precedence.

Ivan Herman, Phil Archer

$Date: 2013-12-11 08:31:51 $