Indicator Description

Available to anyone online

Users can access the dataset online without submitting requests or being required to register or identify themselves.

Data is meaningfully open if it is available on the internet to the widest range of users, and therefore this indicator is the foundation for the others. Typical barriers to availability are: 1) requirements to submit requests, and 2) mandatory registration.

Under ‘Freedom of Information’ legislation in most jurisdictions, information is open if requested by the user. However, many open data standards and assessment tools advocate that authorities should actively disclose the data, instead of waiting for a request from the public.

Another barrier is the requirement for registration. It is commonly found when the authorities ask the individuals to identify themselves in order to obtain the data. Such requirements may dissuade some individuals from using open data for fear that their activity could be monitored or they may be subject to reprisals.

That data should be accessible without registration is also called the “non-discriminatory” principle, because it ensures anyone can use the data without discrimination based on their ethnicity, nationalitiy, profession etc..

Free of Charge

The dataset is available free of charge.

Access to government information sometimes comes with a charge, which deters the public from using it. Governments justify the fees with a number of reasons, for example “cost of collection, production, reproduction and dissemination” in a 2003 European Union directive. However, the Ten Principles For Opening Up Government Information (2010) launched by the Sunlight Foundation, argued that “the existence of user fees has little to no effect on whether the government gathers the data in the first place.” A study in 2017 commissioned by the EU also reviewed this directive, suggesting “the trend towards zero charges should be strengthened.”

Bulk access/API

The dataset is downloadable in bulk and provided with an Application Programming Interface (API) when applicable.

Providing bulk access and an API are two common methods to open up datasets for the public, though their advantages and disadvantages depend on the circumstances.

Bulk access refers to putting all of the data into a file or a set of files, so that all of the data can be acquired with a few simple downloads. Compared to searching the database by query languages like SQL or accessing via API, bulk access is easier to use with fewer restrictions. No programming is required.

API is a technical agreement that allows two applications to exchange information in a certain way.  For open data, an API allows users to get some slice of the data by placing specific queries, and makes it possible for programmers to automate the data access process. When data changes in real time, like weather temperatures and traffic, the automation potential of API makes it a better way to release data than in bulk.

Open license

The dataset is released under an open license which is presented in an explicit manner.

An open license is one which grants permission to access, reuse, and redistribute a work with few or no restrictions. Almost every established open data standard advocates that public data be given an open license, some initiatives even stipulate that theauthorisation conditions should be clearly evident.

One of the most popular licensing systems in the world is Creative Commons (CC), featuring a set of visualised labels that help users quickly understand their rights to using the licensed works.

Machine-readable

The dataset is provided in machine-readable formats and organised in a structured or standardised manner.

People nowadays cannot talk about open data without mentioning machine-readability. ‘Machine’ refers to the computer, it is also called ‘machine-processable’.

A machine-readable document must fulfil two criteria. First of all, the document format must be one that is ‘readable’ by a computer. Image formats such as jpg and gif, or scanned copies in pdf format do not meet this criterion. Secondly, data in the document must be structured or standardised. Xml is a typical format for machine-readable documents, but simply transforming the information from pdf to xml format does not necessarily facilitate an analysis by computer. Instead, a matrix of numbers with clearly defined column and row titles is more meaningful to a machine. A standardised data format is a series of guidelines that define the way in which data should be collected or recorded, supporting compatibility and interoperability between datasets.

Open format

The dataset is provided in an open format, which can be processed with at least one non-proprietary application.

According to the 'Open Definition' of the Open Knowledge Foundation, an ‘open format’ is “one which places no restrictions, monetary or otherwise, upon its use and can be fully processed with at least one free/ libre/ open-source software tool.”  At the core of the definition is that no one should own exclusive rights to the format, and so it is also called ‘non-proprietary’.

Typical proprietary formats are those developed for commercial software and you have to pay for them, for example, xls and doc by Microsoft, and pdf by Adobe. Proprietary formats make it impossible for people who cannot afford the software to use that data format. To conform to the open format principle, xls can be substituted with csv while doc/pdf can be substituted with odf/xml.

Primary

The dataset is released at the finest possible level of granularity available, not in aggregate or modified forms.

This indicator states that data should be collected from the primary source and published in its original, unmodified form without aggregation. A critical value of open data is to increase government transparency and hold them accountable, so it must allow the public to carry out analyses based on raw data instead of second-hand information processed or screened by governments. Unless for privacy or security concerns, data should not be aggregated.

In the circumstances that aggregation is inevitable (e.g., census), the data should be disaggregated to the lowest level possible. The disaggregation can be by gender, age, income, and other categories.

Timely

The data should be released as soon as possible after the collection, and updated in a timely manner whenever there are changes.

The principle of timeliness is two-layered: 1) data should be released as quickly as it is gathered and collected; 2) the dataset should be regularly updated. The purpose is to preserve the value of data, as elaborated in the International Open Data Charter (2015), “Effective and timely access to data helps individuals and organizations develop new insights and innovative ideas that can generate social and economic benefits, improving the lives of people around the world.”

Timeliness depends on the nature of data. For example, public transport data should be ideally in real time, economic performance is announced monthly, whereas the census is conducted every few years.

Metadata

The dataset is provided with core metadata and with accompanying documentation describing the context.

Metadata was originally used in the catalogues of libraries to enable users to find books. In the context of open data, ‘metadata’ provides information that defines and explains a dataset, hence, the data users can easily find a specific category of data by searching on the internet or within the data portal.

A typical metadata is core metadata, which provides fundamental information about data, including the dataset title, source, publication date, and format, as well as other elements that can reveal the meanings of the data.

Some open data standards also advocate that datasets be accompanied with documentation to provide context, so that users can understand the background, analytical limitations, and security requirements of the dataset, as well as how to process the data.

Permanent

The published dataset is stored at a stable online location as a historical archive.

This indicator shows whether the published datasets are archived on the internet.

‘Permanent’ has three layers of meanings: 1) retaining copies of all published datasets available online; 2) stable formats with version-tracking; and 3) stable online locations.

Permanent availability and stable formats ensure comparative analysis over time. Permanent web addresses help the public share documents with others by pointing directly to the source rather than providing instructions on how to find it.

Identifier

The dataset is provided with a Uniform Resource Identifier (URI) to denote its key elements.

Tim Berners-Lee, founder of the World Wide Web, coined the concept of  ‘linked data’ in 2006, advocating the use of URIs to identify things and to link them to each other for automated information sharing.

A Uniform Resource Identifier is a string of characters to identify a resource, similar to an identity card number or car registration plate number. Identical from different datasets can be perceived as the same individual, thus the datasets are interrelated. For those datasets that do not involve any privacy (such as trees, rivers, and streets), open identifiers can assist the data users to conduct research and analysis.

Human-readable

The data is written in plain and clear language that can be understood by the general public.

Human-readability is a newer principle.

The International Open Data Charter (2015) placed equal emphasis on human-readability as machine-readability.  This indicator aims at ensuring anyone can access and use open data, regardless of whether they have a certain type of programming skill. Furthermore, both the information in data and the dataset’s accompanying documentation should be written in plain and clear language.

View Open Data Index
View Index