Difference between revisions of "RSF Comprehensive Description"

From Madagascar
Jump to navigation Jump to search
m (Clarified issue raised by Francesco)
 
(22 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
[[Image:Fotolia_2600038_XS.jpg|right|]]
 
==Introduction==
 
==Introduction==
  
The Regularly Sampled Format (RSF) is a specific arrangement of information defined by the behavior of the implementation of the [http://rsf.svn.sourceforge.net/viewvc/rsf/trunk/api/c/ C API of Madagascar] on a reference machine architecture with a reference version of a Linux distribution and associated dependencies. RSF is the way in which most Madagascar programs expect their input to be and in which they structure their output. Due to portability of code, an exact reference setup has not been specified, and it is expected that Madagascar programs will read and write in the same way on any machines on which the package was compiled successfully. While a [http://reproducibility.org/wiki/Guide_to_RSF_file_format intuitive introduction to RSF] already exists, this document attempts to describe RSF in an exhaustive fashion, for the use of programmers attempting to interface Madagascar with other packages.
+
The Regularly Sampled Format (RSF) is a specific arrangement of information defined by the behavior of the implementation of the [http://rsf.svn.sourceforge.net/viewvc/rsf/trunk/api/c/ C API of Madagascar]. RSF is the way in which most Madagascar programs expect their input to be and in which they structure their output. While a [http://reproducibility.org/wiki/Guide_to_RSF_file_format intuitive introduction to RSF] already exists, this document attempts to describe RSF in an exhaustive fashion, for the use of programmers attempting to interface Madagascar with other packages.
  
The RSF originated as a representation of a discrete set of values of a single-valued function defined on a ''n''-dimensional [http://en.wikipedia.org/wiki/Space_%28mathematics%29 space]. A real-world example of this would be the values of pressure in a space-time volume through which an acoustic wavefield propagates. Each dimension of the space is either discrete and regular, or continuous, but sampled discretely and regularly. In this context, regularity is defined as the property of a set of reals or integers of being consecutive integer multiples of a finite quantity of the same kind of themselves. RSF can be visualized as "matrices with physical dimensions".  
+
The RSF originated as a representation of a discrete set of values of a single-valued function defined on a ''n''-dimensional [http://en.wikipedia.org/wiki/Space_%28mathematics%29 space]. A real-world example of this is acoustic wavefield propagation in a space-time volume. Each dimension of the space is either discrete and regular, or continuous, but sampled discretely and regularly. In this context, regularity is defined as the property of a set of reals or integers of being consecutive integer multiples of a finite quantity of the same kind of themselves. RSF can be visualized as "matrices with physical dimensions".  
  
 
The physicality of dimensions, while useful for jargon, for explanations and for the definition of parameters below, is not compulsory. Ultimately RSF is just a sane way of storing ''n''-d arrays on disk. Just like programming languages use intrinsic methods and user-defined procedures to work with arrays held in memory, Madagascar uses both programs in its main distribution and user-written programs to work with data stored in RSF. RSF datasets are just out-of-core arrays.
 
The physicality of dimensions, while useful for jargon, for explanations and for the definition of parameters below, is not compulsory. Ultimately RSF is just a sane way of storing ''n''-d arrays on disk. Just like programming languages use intrinsic methods and user-defined procedures to work with arrays held in memory, Madagascar uses both programs in its main distribution and user-written programs to work with data stored in RSF. RSF datasets are just out-of-core arrays.
  
 
==Encoded information==
 
==Encoded information==
A RSF dataset consists of the sequence of numerical values in the array and of information about this "out-of-core array" (metainformation -- data about data). The distinct pieces of metainformation will be assigned names in italics, to help define the format later.
+
A RSF dataset consists of: (1) the sequence of numerical values in the array and (2) of information about this "out-of-core array" (metainformation -- data about data). This section attempts to ''define'' the information contained in a RSF dataset is, without concerning itself with the way this information is encoded, which will be done in the section that follows.
  
To the extent possible, each program that acted on the array will record:
+
A discrete sequence of numbers is quite simple and needs no further explanation. The metainformation requires a more detailed definition. Its distinct components  will be assigned names in italics, to help define the encoding later.
 +
 
 +
To the extent that is technically practical, each program that acted on the array will record:
  
 
* Name of program (''prog'')
 
* Name of program (''prog'')
Line 30: Line 33:
 
==RSF components==
 
==RSF components==
 
The information to be encoded in a RSF dataset, described in the previous section, consists of two distinct parts: the metainformation and the data sequence. Depending on the context, the two parts can be in a single file, in a stream following each other, or in two separate files. So, depending the context, RSF can be a file format, a protocol or a metaformat – hence the preference for the weaselly term "dataset". This section defines the context-independent content of each of the two parts, and the sections that follow shows how they come together in various contexts.
 
The information to be encoded in a RSF dataset, described in the previous section, consists of two distinct parts: the metainformation and the data sequence. Depending on the context, the two parts can be in a single file, in a stream following each other, or in two separate files. So, depending the context, RSF can be a file format, a protocol or a metaformat – hence the preference for the weaselly term "dataset". This section defines the context-independent content of each of the two parts, and the sections that follow shows how they come together in various contexts.
 
'''TO BE CONTINUED'''
 
  
 
===ASCII header===
 
===ASCII header===
(metainformation)
+
The previous section mentioned that each program will record metainformation "to the extent that is technically practical". The current interpretation of this statement is that in the case of a linear workflow (i.e. <tt>prog1 | prog2 | prog3...</tt>) all of them will record the metainformation. However, in the case of a merged workflow (i.e. <tt>prog1 > file1.rsf; prog2 > file2.rsf; prog3 file1 file2 >file3.rsf </tt>) only the metainformation entries from one file (usually the one going into stdin) is kept.  
 
 
===Data===
 
(data sequence)
 
 
 
==Single-file RSF==
 
 
 
==Data stream RSF==
 
 
 
==Multiple-file RSF==
 
(Question for rsf-user: does RSF allow split binaries, like SEPlib post-2002 or so?)
 
 
 
==Implementation of context detection==
 
Madagascar I/O is an overloaded operation completely transparent for the user..
 
 
 
 
 
===Header Files===
 
 
 
Associated with each such data file is a header file. The header file is 7 bit ASCII (UTF-7).
 
 
 
Lines with no "=" are considered comments and are ignored.
 
Lines with more than one "=" are illegal.
 
 
 
Lines with a single "=", with no adjacent spaces, assign a value to an alphanumeric named variable
 
 
 
Textstrings must be delimited by pairs of quotes. Numerical values are subject to C's parsing rules.
 
  
"in=" parameter contains the fully qualified path to the relevant data and is required
+
Using the nomenclature described in the previous section, a metainformation entry made by a program is formatted as follows:
  
"n#" (n1, n2, n3) etc. is the number of points in a dimension. n1 is the fastest direction (maps directly onto memory).
+
<pre>
 +
prog dir: user@host datetime
  
n1 must be specified. The size of the array is the product of all n# values with the size of the fundamental type
+
        in="pointer"
 +
        data_format="format_type"
 +
        esize=esize
 +
        n1=nelem_axis1
 +
        o1=orig_axis1
 +
        d1=sampl_axis1
 +
        unit1="unit_axis1"
 +
        label1="label_axis1"
  
====Optional elements====
+
</pre>
  
In addition to the above, many filters enforce the following conventions:
+
Datasets of dimensions higher than 1 will have values for parameters describing the other axes such as n2, n3, etc.
  
"d#" is the physical spacing in the respective dimension
+
The header must contain only [http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters ASCII printable characters].
  
"o#" is the physical origin in some absolute coordinate system of the respective dimension
+
All entries must be of the form: <tt>key=val</tt>, with no spaces adjacent to the equal sign. Lines that
  
===Data Files===
+
The mandatory fields are those in the first line, <tt>in</tt>, <tt>data_format</tt>, <tt>esize</tt> and <tt>n1</tt>. These are the minimum necessary for a sequence of data values to be read. The first line of the recording (<tt>prog dir: user@host datetime</tt>) is written by all Madagascar programs, but is currently not used in any way (i.e. it is not mandatory). Indentation of the fields after the first line and the presence of the blank lines are not mandatory either.
  
Data files are rectangular arrays of data. The following data formats are supported
+
The order of the entries after the first line is not mandatory.  
  
* ASCII (C-compatible input)
+
The ''pointer'' can be either a path to a separate file holding the data and only the data, or the string <tt>stdin</tt> (when the RSF dataset is contained in a single file or transmitted over a stream).
* XDR (device independent binary standard)
 
* native binary (default)
 
  
ASCII format is most useful for debugging, while XDR format is useful for portability across formats.  
+
The valid values of ''format'' are <tt>native</tt>, <tt>xdr</tt> and <tt>ascii</tt>, and the valid values of type can be <tt>short</tt>, <tt>int</tt>, <tt>float</tt>, <tt>double</tt>, <tt>complex</tt>, <tt>uchar</tt>, <tt>byte</tt>. These correspond to the similarly-named C language data types. The <tt>data_format</tt> with the best support and highest amount of optimization throughout Madagascar is ''native_float''.
  
Performance is optimized for native format.
+
Given the encodings currently implemented in madagascar, <tt>esize</tt> can be 1, 2, 4 or 8.
  
The following data formats are supported:
+
Axis length values (n1, n2, ... n9) must be greater than zero. It is allowed to not specify axis length values of higher dimensions than the intrinsic dimensionality of the data. For example, for a 1000x20x500 data cube, it is allowed to not specify any values for n4, n5, ... n9. It is also allowed to set the higher-dimension axis length values to 1. Gaps in the axis length sequence are not allowed (example: if n4 is given a value greater than one, then n1, n2 and n3 must be specified, even if their values are set to one.
  
* unsigned byte
+
Programs write to the ASCII header only those non-compulsory fields whose values have been modified. For example, a program performing summation of a 3-dimensional array over axis 3 will write to the header only <tt>n3=1</tt>, but not the values for <tt>n1</tt> and <tt>n2</tt>, which have not been changed. Madagascar programs read the ASCII headers from the beginning to the end, and overwrite existing fields if new values appear.
* byte
 
* int (native int)
 
* short (2 bytes)
 
* float (native float)
 
* complex (real, imaginary float pairs)
 
  
===RSF files in streams===
+
===Data sequence===
====Input====
+
Like arrays stored in computer memory, the sequence of numerical values is ordered in progressive order along consecutive dimensions. For example, the array
When a Madagascar program writes a RSF file to disk (i.e.: <tt>sfprog >file.rsf</tt>), it will create a header file and a binary file as described above.
+
<pre>
 +
a11 a12 a13
 +
a21 a22 a23
 +
a31 a32 a33
 +
</pre>
 +
will be written as:
 +
<pre>
 +
a11 a21 a31 a12 a22 a32 a13 a23 a33
 +
</pre>
 +
A array with three dimensions will be ordered as follows:
 +
<pre>
 +
a111 a211 a311 a121 a221 a321 a131 a231 a331 a112 a212 a312 a122 a222 a322 a132 a232 a332
 +
</pre>
 +
In this example dimension 1 is defined as the column direction, dimension 2 as the line direction, and dimension 3 as the "out of page" direction. This example follows the linear algebra / Fortran convention. It must be noticed that of only the reality of ordering of numerical values in the file is of substance; the imaginary representation with "vertical" and "horizontal" directions is not (the C language actually uses an inverse order for the dimensions). Thus, dimension 1 in a RSF dataset is in general defined as the dimension along which the first two elements in the data file are organized, then dimension 2 as the dimension adjacent to it, and so on.
  
If the output is to a stream, or if the parameter <tt>--out=stdout</tt> is passed to the program, then the program will write to the stdout stream the ASCII header, followed by the sequence of three special characters: EOL EOL EOT (<tt>\014\014\004</tt>), followed by the binary.
+
==Multi-file RSF==
 +
When a Madagascar program writes a RSF file to disk (i.e.: <tt>sfprog >file.rsf</tt>), it will create by default:
 +
* a file with the ASCII header and the extension ".rsf"
 +
* a file with the data sequence, in the encoding specified by the header, and the extension ".rsf@". Temporary files created during piping operations are allowed have no extension. The data file is located usually in the directory specified by the <tt>$DATAPATH</tt> environmental variable. The <tt>in</tt> field in the ASCII header specifies the absolute path of the data file.
  
====Output====
+
==Single-stream RSF==
When a Madagascar program reads from the stdin stream, it expects either a EOF character indicating the end of the ASCII header (after which it transfers the stdin to reading from the binary cube), or a EOL EOL EOT sequence indicating that the data follows immediately on stdin.
+
A single-stream RSF dataset consists of the ASCII header in which the last value of <tt>in</tt> is <tt>"stdin"</tt>. The ASCII header is followed by the sequence of three special characters: EOL EOL EOT (<tt>\014\014\004</tt>), followed by the data sequence. This stream can be written to a single rsf file with the standard ".rsf" extension if a special option is passed to the madagascar program. It can also be piped in the standard Unix manner.

Latest revision as of 10:07, 25 February 2010

Fotolia 2600038 XS.jpg

Introduction

The Regularly Sampled Format (RSF) is a specific arrangement of information defined by the behavior of the implementation of the C API of Madagascar. RSF is the way in which most Madagascar programs expect their input to be and in which they structure their output. While a intuitive introduction to RSF already exists, this document attempts to describe RSF in an exhaustive fashion, for the use of programmers attempting to interface Madagascar with other packages.

The RSF originated as a representation of a discrete set of values of a single-valued function defined on a n-dimensional space. A real-world example of this is acoustic wavefield propagation in a space-time volume. Each dimension of the space is either discrete and regular, or continuous, but sampled discretely and regularly. In this context, regularity is defined as the property of a set of reals or integers of being consecutive integer multiples of a finite quantity of the same kind of themselves. RSF can be visualized as "matrices with physical dimensions".

The physicality of dimensions, while useful for jargon, for explanations and for the definition of parameters below, is not compulsory. Ultimately RSF is just a sane way of storing n-d arrays on disk. Just like programming languages use intrinsic methods and user-defined procedures to work with arrays held in memory, Madagascar uses both programs in its main distribution and user-written programs to work with data stored in RSF. RSF datasets are just out-of-core arrays.

Encoded information

A RSF dataset consists of: (1) the sequence of numerical values in the array and (2) of information about this "out-of-core array" (metainformation -- data about data). This section attempts to define the information contained in a RSF dataset is, without concerning itself with the way this information is encoded, which will be done in the section that follows.

A discrete sequence of numbers is quite simple and needs no further explanation. The metainformation requires a more detailed definition. Its distinct components will be assigned names in italics, to help define the encoding later.

To the extent that is technically practical, each program that acted on the array will record:

  • Name of program (prog)
  • Directory in which the program was run (dir)
  • User that ran the program (user)
  • Short hostname of machine on which the program was run (host). For example, this would be machine instead machine.university.edu
  • Date and time (up to seconds) at which the program was started (datetime)
  • A pointer to the binary data (pointer)
  • Data type, i.e. integer, real or complex (type)
  • Data encoding, i.e. name of protocol for representing the data (format)
  • Size of each data sample in bytes (esize)
  • For each dimension # in the dataset (1 <= # <= 9), specify:
    • Number of elements in that dimension: n#, i.e.: n1, n2, n3... (nelem_axis#)
    • Origin on that axis: o#, i.e.: o1, o2, o3... (orig_axis#)
    • Sampling interval on that axis: d#, i.e.: d1, d2, d3... (sampl_axis#)
    • Label for that axis: label#, i.e.: label1, label2, label3... (label_axis#)
    • Physical unit for that axis: unit#, i.e.: unit1, unit2, unit3... (unit_axis#)

RSF components

The information to be encoded in a RSF dataset, described in the previous section, consists of two distinct parts: the metainformation and the data sequence. Depending on the context, the two parts can be in a single file, in a stream following each other, or in two separate files. So, depending the context, RSF can be a file format, a protocol or a metaformat – hence the preference for the weaselly term "dataset". This section defines the context-independent content of each of the two parts, and the sections that follow shows how they come together in various contexts.

ASCII header

The previous section mentioned that each program will record metainformation "to the extent that is technically practical". The current interpretation of this statement is that in the case of a linear workflow (i.e. prog1 | prog2 | prog3...) all of them will record the metainformation. However, in the case of a merged workflow (i.e. prog1 > file1.rsf; prog2 > file2.rsf; prog3 file1 file2 >file3.rsf ) only the metainformation entries from one file (usually the one going into stdin) is kept.

Using the nomenclature described in the previous section, a metainformation entry made by a program is formatted as follows:

prog dir: user@host datetime

        in="pointer"
        data_format="format_type"
        esize=esize
        n1=nelem_axis1
        o1=orig_axis1
        d1=sampl_axis1
        unit1="unit_axis1"
        label1="label_axis1"

Datasets of dimensions higher than 1 will have values for parameters describing the other axes such as n2, n3, etc.

The header must contain only ASCII printable characters.

All entries must be of the form: key=val, with no spaces adjacent to the equal sign. Lines that

The mandatory fields are those in the first line, in, data_format, esize and n1. These are the minimum necessary for a sequence of data values to be read. The first line of the recording (prog dir: user@host datetime) is written by all Madagascar programs, but is currently not used in any way (i.e. it is not mandatory). Indentation of the fields after the first line and the presence of the blank lines are not mandatory either.

The order of the entries after the first line is not mandatory.

The pointer can be either a path to a separate file holding the data and only the data, or the string stdin (when the RSF dataset is contained in a single file or transmitted over a stream).

The valid values of format are native, xdr and ascii, and the valid values of type can be short, int, float, double, complex, uchar, byte. These correspond to the similarly-named C language data types. The data_format with the best support and highest amount of optimization throughout Madagascar is native_float.

Given the encodings currently implemented in madagascar, esize can be 1, 2, 4 or 8.

Axis length values (n1, n2, ... n9) must be greater than zero. It is allowed to not specify axis length values of higher dimensions than the intrinsic dimensionality of the data. For example, for a 1000x20x500 data cube, it is allowed to not specify any values for n4, n5, ... n9. It is also allowed to set the higher-dimension axis length values to 1. Gaps in the axis length sequence are not allowed (example: if n4 is given a value greater than one, then n1, n2 and n3 must be specified, even if their values are set to one.

Programs write to the ASCII header only those non-compulsory fields whose values have been modified. For example, a program performing summation of a 3-dimensional array over axis 3 will write to the header only n3=1, but not the values for n1 and n2, which have not been changed. Madagascar programs read the ASCII headers from the beginning to the end, and overwrite existing fields if new values appear.

Data sequence

Like arrays stored in computer memory, the sequence of numerical values is ordered in progressive order along consecutive dimensions. For example, the array

a11 a12 a13
a21 a22 a23
a31 a32 a33

will be written as:

a11 a21 a31 a12 a22 a32 a13 a23 a33

A array with three dimensions will be ordered as follows:

a111 a211 a311 a121 a221 a321 a131 a231 a331 a112 a212 a312 a122 a222 a322 a132 a232 a332

In this example dimension 1 is defined as the column direction, dimension 2 as the line direction, and dimension 3 as the "out of page" direction. This example follows the linear algebra / Fortran convention. It must be noticed that of only the reality of ordering of numerical values in the file is of substance; the imaginary representation with "vertical" and "horizontal" directions is not (the C language actually uses an inverse order for the dimensions). Thus, dimension 1 in a RSF dataset is in general defined as the dimension along which the first two elements in the data file are organized, then dimension 2 as the dimension adjacent to it, and so on.

Multi-file RSF

When a Madagascar program writes a RSF file to disk (i.e.: sfprog >file.rsf), it will create by default:

  • a file with the ASCII header and the extension ".rsf"
  • a file with the data sequence, in the encoding specified by the header, and the extension ".rsf@". Temporary files created during piping operations are allowed have no extension. The data file is located usually in the directory specified by the $DATAPATH environmental variable. The in field in the ASCII header specifies the absolute path of the data file.

Single-stream RSF

A single-stream RSF dataset consists of the ASCII header in which the last value of in is "stdin". The ASCII header is followed by the sequence of three special characters: EOL EOL EOT (\014\014\004), followed by the data sequence. This stream can be written to a single rsf file with the standard ".rsf" extension if a special option is passed to the madagascar program. It can also be piped in the standard Unix manner.