The term 'open data' has evolved rapidly since it came into common usage less than a decade ago. From presidential executive orders to pan-national directives; from cultural heritage to scientific data, it's clear that more and more of the data that historically has been hidden away is now coming into the open. There is widespread consensus that this is a good thing.
A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. CSV files may be of a significant size but they can be generated and manipulated easily, and there is a significant body of software available to handle them. Indeed, popular spreadsheet applications (Microsoft Excel, iWork’s Number, or OpenOffice.org) as well as numerous other applications can produce and consume these files. However, although these tools make conversion to CSV easy, it is resisted by some publishers because CSV is a much less rich format that can't express important detail that the publishers want to express, such as annotations, the meaning of identifier codes etc. As experience of publishing and using CSV published on the Web grows it's clear that this is one of many issues that need to be addressed.
Existing formats for tabular data are format-oriented and hard to process (e.g. Excel); un-extensible (e.g. CSV/TSV); or they assume the use of particular technologies (e.g. SQL dumps). None of these formats allow developers to pull in multiple data sets, manipulate, visualize and combine them in flexible ways. Other information relevant to these datasets, such as access rights and provenance, is not easy to find. CSV is a very useful and simple format, but to unlock the data and make it portable to environments other than the one in which it was created, there needs to be a means of encoding and associating relevant metadata.
Now is a good time for standardization because there has already been significant experimentation by organizations such as the Open Knowledge Foundation and Google but the relevant parties are not (yet) locked in to particular solutions.
The mission of the CSV on the Web Working Group, part of the Data Activity, is to provide technologies whereby data dependent applications on the Web can provide higher interoperability when working with datasets using the CSV (Comma-Separated Values) or similar formats. As well as single CSV files, the group will define mechanisms for interpreting a set of CSVs as relational data. This will include the definition of a vocabulary for describing tables expressed as CSV and locatable on the web, and the relationships between them.
In this way, it will be possible to see CSVs as 5 star data - a non format data that allows data to be linked to other sources - using the Web as an intelligent data platform rather than as a simple distribution system for files containing inaccessible data.
End date | 31 August 2015 |
---|---|
Confidentiality | Proceedings are public |
Initial Chairs | Dan Brickley (Google), Jeni Tennison (Open Data Institute) |
Initial Team Contacts (FTE %: 20) |
Ivan Herman |
Usual Meeting Schedule | Teleconferences: Weekly Face-to-face: Once Annually |
Tabular data has many usage scenarios. It is often used directly, without any manipulation of the data. For example, CSV content may be extremely large and its conversion might be unnecessarily time-consuming; or its usage is restricted to the data itself without the necessity to integrate with other datasets (a typical example may be the medical prescription data that a hospital or a national health service publishes on the Web and an application that only publishes the results of an analysis of the data). In other cases, the CSV content must be converted to other data formats to be used with data sources of a different origin, or with the application developer’s specific tools.
Whether converted to other formats or not, there is a need to describe the content of the CSV file: its structure, datatypes used in a specific column, language used for text fields, access rights, provenance, etc. This means that metadata should be available for the dataset, relying on standard vocabulary terms, and giving the necessary information for applications. The metadata can also be used for the conversion of the CSV content to other formats like RDF or JSON, it can enable automated loading of the data as objects, or it can provide additional information that search engines may use to gain a better understanding of the content of the data. It should also be possible to encode this metadata in RDF, thereby providing a “glue” to bind the CSV content to other datasets in different formats and to Linked Data in general. In this way CSV datasets become linkable on the Web via the metadata, even if the content remains in its original tabular format.
While the primary focus of this WG is on associating metadata with CSV files, to the extent possible the WG should consider and favor design choices that make the vocabulary useful with other tabular structured data use cases and in particular HTML tables.
To provide the necessary infrastructure for CSV on the Web, the Working Group will develop the following building blocks:
Notes:
This charter refers to “CSV”, i.e., “Comma Separated Values”; a similar, though less frequently used format is ”TSV”, for “Tab Separated Values”. All the specifications developed for CSV can be adapted to TSV easily; the Working Group may decide to take the TSV format into account, too, when formalizing the technologies.)
The term “mapping mechanism” does not necessarily mean a physical conversion of the data to another format; it may mean providing an, e.g., RDF “view” of the tabular data provided by a specialized server-side application.)
The XForms 2.0 draft specification includes the definition of XPath expressions for CSV; the Working Group will consider whether this fulfills the for an XML requirements or whether a separate specification is indeed necessary.
The output of the mapping mechanism for RDF MUST be consistent with either the RDF Direct Mapping or R2RML so that if a table from a relational database is exported as CSV and then mapped it produces semantically identical data.
Although not specified by W3C, the Working Group should also contribute to a newer version of RFC 4180 to ensure a better interoperability of CSV content. An example is the current restriction in the RFC to use ASCII only, which is inappropriate from the point of view of proper Internationalization on the Web and is not in line with the practice of CSV deployment.
The following documents may be considered by this Working Group:
To advance to Proposed Recommendation, the metadata vocabulary should be shown to be used by at least two independent applications; “use” may also mean extended to other, more specialized vocabularies.
To advance to Proposed Recommendation, each conversion specification is expected to have at least two independent implementations of each feature defined in the specification. Similarly, each mechanism for the discovery of metadata should be implemented by at least two independent implementations.
During its lifetime the group has to produce enough outreach material so that the community at large is actively encouraged to use the technologies defined by the Working Group. It is expected that implementations will include parsers/importers for NoSQL databases such as MongoDB, as well as SQL databases, and triplestores. Exporters and converters should also be implemented from those same databases and from spreadsheet software. Validators, visualizers and other generic tooling is also foreseen since these tools are at the heart of the open data movement.
Methodologies for the discovery of (CSV) data on the Web in general. While a major challenge, this group should concentrate only on how to find metadata associated with a specific CSV file (e.g., if its URI is known), and not discovery in general.
The titles of the deliverables are not final; the Working Group will have to decide on the final titles as well as the structures of the documents. The Working Group may also decide to merge some deliverables into one document or produce several documents that together constitute one of the deliverables.
The titles of the deliverables are not final; the Working Group will have to decide on the final titles as well as the structures of the documents. The Working Group may also decide to merge some deliverables into one document or produce several documents that together constitute on of the deliverables.
Note: The group will document significant changes from this initial schedule on the group home page. | ||||||
Specification | FPWD | LC | CR | PR | Rec/Note | |
---|---|---|---|---|---|---|
CSV Data on the Web (UCR) (Note) | March 2014 | n/a | n/a | n/a | November 2014 (Note) | |
Metadata Vocabulary for CSV Data (Recommendation) | April 2014 | November 2014 | February 2015 | April 2015 | June 2015 (Rec.) | |
Access Methods for CSV Metadata (Recommendation) | April 2014 | November 2014 | February 2015 | April 2015 | June 2015 (Rec.) | |
Mapping of CSV to Other formats (Recommendation) | April 2014 | November 2014 | February 2015 | April 2015 | June 2015 (Rec.) |
Furthermore, CSV on the Web Working Group expects to follow these W3C Recommendations:
To be successful, the CSV on the Web Working Group is expected to have 10 or more active participants for its duration. To get the most out of this work, participants should expect to devote several hours a week; for budgeting purposes, we recommend at least half a day a week. For chairs and document editors the commitment will be higher, say, 1-2 days a week. Participants who follow the work less closely should be aware that if they miss decisions through inattention further discussion of those issues may be ruled out of order. However, most participants follow some areas of discussion more closely than others, and the time needed to stay in good standing therefore varies from week to week. The Working Group will also allocate the necessary resources for building Test Suites for each specification.
This group primarily conducts its work on the public mailing list public-csv-wg@w3.org (archives). Administrative tasks may be conducted in Member-only communications.
Information about the group (deliverables, participants, face-to-face meetings, teleconferences, etc.) is available from the CSV on the Web Working Group home page.
As explained in the Process Document (section
3.3), this group will seek to make decisions when there is
consensus. When the Chair puts a question and observes dissent, after
due consideration of different opinions, the Chair should record a
decision (possibly after a formal vote) and any objections, and move
on.
A formal vote should allow for remote asynchronous
participation—using, for example, email and/or web-based survey
techniques. Any resolution taken in a face-to-face meeting or
teleconference is to be considered provisional until 5 working days
after the publication of the resolution in draft minutes sent to the
group's mailing list.
This charter is written in accordance with Section 3.4, Votes of the W3C Process Document and includes no voting procedures beyond what the Process Document requires.
This Working Group operates under the W3C Patent Policy (5 February 2004 Version). To promote the widest adoption of Web standards, W3C seeks to issue Recommendations that can be implemented, according to this policy, on a Royalty-Free basis.
For more information about disclosure obligations for this group, please see the W3C Patent Policy Implementation.
This charter for the CSV on the Web Working Group has been created according to section 6.2 of the Process Document. In the event of a conflict between this document or the provisions of any charter and the W3C Process, the W3C Process shall take precedence.
Copyright © 2013 W3C ® (MIT, ERCIM, Keio, Beihang), All Rights Reserved.
$Date: 2013-12-11 08:31:51 $