đź’ˇ Pitch

(?P<named>capture.*group) parameters on DirectoryFormats

Evan Bolyen
Evan Bolyen
Note to reader: To better understand the purpose of this proposal, it may be beneficial to scan the Future Application section below before diving into the gritty details which follow.
Detailed description of a freshly caught file-format by an early bioinformatician.

Goals 

  1. Provide the information needed to convert a DirectoryFormat into a web-form (or equivalent)
  2. Create a 1-to-1 mapping between file paths and a new set of described parameters. (Thereby making the path_maker obsolete, or at least automatic.)
  3. Support better automated documentation for directory formats

Appetite 

6 weeks

Background 

QIIME 2 stores all artifact data using a DirectoryFormat. This directory format has class attributes which define collections of files which must be present in the directory for it to be a valid member of that format.

Currently, QIIME 2 supports describing these filepaths as regular expressions (regex) which means that in order to programmatically construct the filepath, there must be a function to make that path (the creatively named path_maker). Unfortunately, the arguments to this path_maker are not easily introspected and it is typically annoying to write as it serves relatively little purpose (as nothing can use it because there is no introspection).

Instead, it would be advantageous if the filepath were broken into regular segments with named capture groups. Then each named capture group could be further described permitting both introspecting, and automated parameterization. This would improve interfaces' ability to generate import machinery as it would have a deeper knowledge of the construction of the directory format. This is especially helpful, as these details are some of the most opaque to the user.

Terms
  • DirectoryFormat: The python model of a filesystem directory schema. This is the contents of the `/data/` directory in a QIIME 2 Archive.
  • model-attribute: A DirectoryFormat's model attributes, these include `model.File` and `model.FileCollection`, perhaps in the future: `model.Directory` and `model.Zarr`
  • pathspec: Path Specification, the string that a member of a model-attribute would match. This is sometimes a regex, but not consistently.
  • path_maker: The function which is in charge of converting some arbitrary uninspectable keyword parameters into a filepath.
  • named capture group: An extension of a capture group ( `([0-9]+)` ) which is named ( `(?P<name_here>[0-9]+)` )

Approximate Syntax 

Sample syntax for defining and annotating named capture groups. 63.2 KB View full-size Download
This particular syntax would require validation to occur in the meta-type's `__new__` to verify all capture groups are described and that any other constraints are met. Alternatively, additional arguments could be added to model-attributes instead.

The specific parameters of the `path_param` factories are given only as an example, but would permit some interesting things to be done automatically. For example:
  • `unique` could be checked by the `validate()` method
  • `alias` could be used to simplify interfaces and clarify segments
  • `all_or_none` would indicate that each `sample_id` should possess all-or-none of "R1" and "R2" 
  • `compression` could permit the automatic detection and correction of uncompressed files.
  • `default` (not used above), could indicate that this file matches the regex of the pathspec, but would allow a direct conversion of a FileFormat into the DirectoryFormat without specifying the path details for this segment. The advantage of this parameter is in making importing a directory simpler as the regex could be permissive, without compromising the ability to write these directory formats when a nameless FileFormat is provided.
Each of these parameters provides some defined and actionable information about how these paths are related to each other, and reduces the burden on the developer to program these constraints manually in `_validate_()`.

Minimum Implementation for Parameters

As there are a potentially endless number of possible parameters for the `path_param`s, implementers should use their best judgement in prioritizing a useful subset. This subset does not need to be complete or even entirely useful, rather it is more important that it is straight-forward to add new terms in the future.

A survey of existing formats and possible generalizations would be helpful in populating an initial set of path_param parameters, but this will take time away from implementation and is not ultimately that necessary for this project (future projects can tackle that).

Automated Documentation

To improve documentation of formats, it would be useful if the path components could be extracted, and explained. For example, suppose there was a pathspec which looked like this:
(?P<sample_id>.+)_(?P<direction>R[12])\.(?P<ext>\.fastq\.gz)

It would be ideal construct documentation that resembled this:

Docstring of the class would goes here

File collection: sequences
   :file format: :ref:`FastqGZFormat`
   :path: <sample_id>_<direction>.fastq.gz
   :sample_id: This is the name of your sample, it may contain any characters
   :direction: This is the direction of the read, R1 for forward, R2 for 
               reverse. Each sample_id must have a corresponding R1 and/or R2
               when they are present.

File: something_else
   :file format: :ref:`WhateverFormat`
   :path: some_fixed_path.txt

In the above, the regex capture groups have been replaced with more readable placeholders, each possessing its own description. The extension `ext` is inlined as it is static. (It would therefore be convenient if `<ext>` was a specifically required capture group and if it was simple to identify if alternative extensions are allowed.)

This documentation does not need to be templated by the format or the framework, but there should be sufficient information accessible that such a template could be made by an external documentation system (e.g. the library and q2galaxy).

SDK Affordances 

Currently it is difficult to determine if a path is fixed (no parameters or variation in the name), or a regex. `q2galaxy` uses the heuristic of finding an escaped period (`\.`) to indicate that a regex is present, however this is not particularly robust, as not all periods are correctly escaped. There should be some mechanism to more authoritatively identify if a given pathspec is a regex or not. Maintaining backwards compatibility here may be difficult, so a breaking change could be the outcome to correct this.

There should be a mechanism to extract the sub-expressions from a capture group. i.e. it should be possible to create a generic `<input type="text"/>` and validate that component without constructing the entire path. This regex is capable of that (up to a limit of course): https://regexr.com/5lceo
\(\?P\<(?P<named_group>[_a-zA-Z]+[_a-zA-Z0-9]*)\>(?P<regex>.+?)(?<!(?<!(?<!(?<!(?<!(?<!(?<!(?<!(?<!\\)\\)\\)\\)\\)\\)\\)\\)\\)\)
Ideally these sub-regexes are available as attributes on the path_param objects.

Anything which is not identified to be a named capture group, must be a static string, and the match boundaries will collectively identify those static strings. Using these boundaries, it should be possible to convert a dictionary of arguments (keyed by capture group name) to a unique filepath. This functionality should take the place of the existing `path_maker`.

It should also be possible to verify that the substrings between capture groups is static by searching for regex control sequences. This itself could be challenging as there are many unusual cases (such as `\\.` having an infinite number of matches whereas `\.` has only a single match). Processing such escape sequences first may be important to determining what control sequences actually exist. `re.escape` may be helpful here, but indirectly.

Potential Difficulties

The `plugin.model` API is constructed with the use of meta-classes. While these are not particularly difficult (once understood), it is a part of the Python language which is rarely used, so most are unfamiliar with the precise mechanics. This means some of the time budget will be spent on learning how to use meta-classes and then designing a sufficient `path_param` API.

It is believed that the regex above can solve this particular problem, however if it turns out to be incomplete, a major assumption of this proposal fall apart. This could necessitate properly parsing a regex into a syntax tree that can be manipulated, which while absolutely possible, will take far longer to research and implement.

Similarly, if detecting static substrings in a regex proves impossible with readily available tools, then a similar parsing and syntax tree approach may be needed, with the same downsides as above.




Future Application 

To illustrate some of the expectations for this design, and to clarify the intention, the following examples show what could be done with this project once finished. These are not meant to represent work to be done as a part of this project.

Replace the "FASTQ Manifest" file formats with a generalized "directory manifest".
For example, consider this `PairedEndFastqManifestPhred64V2` format:
sample-id     forward-absolute-filepath               reverse-absolute-filepath
sample-1      $PWD/some/filepath/sample0_R1.fastq.gz  $PWD/some/filepath/sample1_R2.fastq.gz
sample-2      $PWD/some/filepath/sample2_R1.fastq.gz  $PWD/some/filepath/sample2_R2.fastq.gz
Could be generalized to a DirectoryFormat with a pathspec of:
(?P<sample_id>.+)_(?P<direction>R[12])\.(?P<ext>\.fastq\.gz)
Combine this with a description which implied:
  • `sample_id` is a String
  • `direction` is `{'R1': 'forward', 'R2': 'reverse'}`
  • `ext` is exactly `'.fastq.gz'`
Then you could, from the above, infer a file with the following shape:
sequences:               sample_id  direction
/some/filepath1.forward  sample-1   forward
/some/filepath1.reverse  sample-1   reverse
/some/filepath2.forward  sample-2   forward
/some/filepath2.reverse  sample-2   reverse

metadata:                some_field
/another/path            label2
This looks a bit like one of the older V1 manifest formats, but the difference is this particular file doesn't need a transformer to be understood (it is also file-oriented, rather than sample-oriented). Instead, the fields provided can be interpreted based on their reference to the DirectoryFormat's model-attributes, and the arguments can be validated using by the capture group's regex and any additional type information. (In this particular example, the unescaped colon after the name in the header makes it clear that there it is a label and not a path)

Once validated, this file contains all of the arguments to the pathspec, and can be reversed into an operation like:
cp /some/filepath.forward  /tmp/import_dir/sample-1_R1.fastq.gz
cp /some/filepath.reverse  /tmp/import_dir/sample-1_R2.fastq.gz
cp /another/path           /tmp/import_dir/metadata.label2.json
This `/tmp/import_dir` directory is then imported as a fully formed DirectoryFormat and as a consequence, all of the members of that directory will be correctly check-summed by provenance.

(Minor note: `qiime2.util.duplicate` would be ideally used as it would hard-link files when possible, instead of copying.)

Converting the regex itself
Instead of specifying every parameter individually, an individual familiar with regex could generate the above description with an alternative regex which contained the same named capture groups. For instance, suppose the target is (as before):
(?P<sample_id>.+)_(?P<direction>R[12])\.(?P<ext>\.fastq\.gz)
However the files are named like this:
KerbalLab.sample1.R1.fastq.gz
KerbalLab.sample2.R2.fastq.gz
...
A regex such as:
KerbalLab.(?P<sample_id>.*)\.(?P<direction>R[12])\.(?P<ext>\.fastq\.gz)
would be sufficient to parameterize each file for import.

It may be advantageous to write the above "directory manifest" as the result of this process instead of directly importing, as it allows for the manual review of the regex's performance and makes it simple to correct minor/intermittent issues.

Improving q2galaxy's import wizard
Currently, q2galaxy's mechanism for importing data still requires somewhat deep knowledge of the expected directory structure, just as if you were importing on the command line. Furthermore, as the data is generally not local to the user, they cannot use the "fastq manifest" formats.

Here are two examples of associating individual files with a given filename. Sometimes, the filename is simple and the intention fairly obvious. At other times, this is not the case.
Suppose instead, we generated a simple form which avoided the regex entirely (same directory format, but showing before and after):
This is q2galaxy's analogue to the "directory manifest" from above. This should be straight-forward to accomplish if each pathspec was matched via a capture group and additional information was provided about the capture groups as described in this pitch.

(The "directory manifest" cannot be used for similar reasons to the "fastq manifests", as filepaths are hidden away from the user. Although such a manifest could be generated from a regex on a Galaxy collection then reviewed in a two-action workflow.)

The real deal import wizard.
It looks like you're importing your data. Would you like help?

Take all of the above and put it together into a small graphical interface with input forms which can be prefilled by a regex statement or by consuming a "directory manifest". All guided by a questionnaire using automated documentation to narrow down what exactly you want to do.

It's simple really.

(Actually, it would be really cool do to a global alignment of the filepath names and then implement a multi-select based on the high-entropy loci. This would give you a "graphical regex" by just highlighting relevant segments of the filename.)