rushd package

Submodules

rushd.flow module

Common function for analyzing flow data in Pandas Dataframes.

Allows users to specify custom metadata applied via well mapping. Combines user data from multiple .csv files into a single DataFrame.

exception rushd.flow.GroupsError[source]

Bases: RuntimeError

Error raised when there is an issue with the data groups DataFrame.

exception rushd.flow.MOIinputError[source]

Bases: RuntimeError

Error raised when there is an issue with the provided dataframe.

exception rushd.flow.MetadataWarning[source]

Bases: UserWarning

Warning raised when the passed metadata is possibly incorrect, but valid.

exception rushd.flow.RegexError[source]

Bases: RuntimeError

Error raised when there is an issue with the file name regular expression.

exception rushd.flow.YamlError[source]

Bases: RuntimeError

Error raised when there is an issue with the provided .yaml file.

rushd.flow.load_csv_with_metadata(data_path, yaml_path, filename_regex=None, *, columns=None)[source]

Load .csv data into DataFrame with associated metadata.

Generates a pandas DataFrame from a set of .csv files located at the given path, adding columns for metadata encoded by a given .yaml file. Metadata is associated with the data based on well IDs encoded in the data filenames.

Parameters:

data_path (str or Path) – Path to directory containing data files (.csv)
yaml_path (str or Path) – Path to .yaml file to use for associating metadata with well IDs. All metadata must be contained under the header ‘metadata’.
filename_regex (str or raw str (optional)) – Regular expression to use to extract well IDs from data filenames. Must contain the capturing group ‘well’ for the sample well IDs. If not included, the filenames are assumed to follow this format (default export format from FlowJo): ‘export_[well]_[population].csv’
columns (Optional list of strings) – If specified, only the specified columns are loaded out of the CSV files. This can drastically reduce the amount of memory required to load flow data.

Return type:

A single pandas DataFrame containing all data with associated metadata.

rushd.flow.load_groups_with_metadata(groups_df, base_path='', filename_regex=None, *, columns=None)[source]

Load .csv data into DataFrame with associated metadata by group.

Each group of .csv files may be located at a different path and be associated with additional user-defined metadata.

Parameters:

groups_df (Pandas DataFrame) – Each row of the DataFrame is evaluated as a separate group. Columns must include ‘data_path’ and ‘yaml_path’, specifying absolute or relative paths to the group of .csv files and metadata .yaml files, respectively. Optionally, regular expressions for the file names can be specified for each group using the column ‘filename_regex’ (this will override the ‘filename_regex’ argument).
base_path (str or Path (optional)) – If specified, path that data and yaml paths in input_df are defined relative to.
filename_regex (str or raw str (optional)) – Regular expression to use to extract well IDs from data filenames. Must contain the capturing group ‘well’ for the sample well IDs. Other capturing groups in the regex will be added as metadata. This value applies to all groups; to specify different regexes for each group, add the column ‘filename_regex’ to groups_df (this will override the ‘filename_regex’ argument). If not included, the filenames are assumed to follow this format (default export format from FlowJo): ‘export_[well]_[population].csv’
columns (Optional list of strings) – If specified, only the specified columns are loaded out of the CSV files. This can drastically reduce the amount of memory required to load flow data.

Return type:

A single pandas DataFrame containing data from all groups with associated metadata.

rushd.flow.load_well_metadata(yaml_path)[source]

Load a YAML file and convert it into a well mapping.

Parameters:: yaml_path (Path to the .yaml file to use for associating metadata with well IDs.) –
Return type:: A dictionary that contains a well mapping for all metadata columns.

rushd.flow.moi(data_frame, color_column_name, color_cutoff, output_path=None, summary_method='median', *, scale_factor=1.0)[source]

Calculate moi information from flowjo data with appropriate metadata.

Generates a pandas DataFrame of virus titers from a pandas DataFrame of flowjo data.

Parameters:

data_frame (pd.DataFrame) –

The pandas DataFrame to analyze. It must have the following columns:
condition: the conditions/types of virus being analyzed replicate: the replicate of the data (can have all data as the same replicate) starting_cell_count: the number of cells in the well at the time of infection scaling: the dilution factor of each row max_virus: the maximum virus added to that column

scaling times max_virus should result in the volume of virus stock added to a well
color_column_name (str) – The name of the column on which to gate infection.
color_cutoff (float) – The level of fluoresence on which to gate infecction.
output_path (str or path (optional)) – The path to the output folder. If None, instead prints all plots to screen. Defaults to None
summary_method (str (optional)) – Whether to return the calculated titer as the mean or median of the replicates.
scale_factor (float (optional)) – Whether to scale down the Poisson fit by the given scale factor maximum.

Return type:

A single pandas DataFrame containing the titer of each condition in TU per uL.

rushd.io module

A submodule implementing common IO handling mechanisms.

## Rationale File and folder management is a common problem when handling large datasets. You often want to separate out large data from your code. How do you keep track of where your data is, especially if moving between different computers/clusters?

rushd.io adds convenience functions to handle common cases, as well as writing metadata with your output files that identify input files.

exception rushd.io.NoDatadirError[source]

Bases: RuntimeError

No datadir.txt file found.

Error raised when rushd is unable to locate a datadir.txt path in the current file.

rushd.io.cache_dataframe(cache_path)[source]

Wrap caching functionality around a dataframe-generating function.

Notes

If you wrap a function that contains an invalidate keyword, this keyword will be removed when passed to your function!

Parameters:

cache_path (str or Path) – The path at which the dataframe cache should be saved

Return type:

Callable[..., Callable[..., DataFrame]]

Returns:

A function that generates a dataframe with optional caching.
An extra keyword argument, ‘invalidate’ is added that invalidates
the cache if needed

rushd.io.git_version()[source]

Return the current version control state as a string.

The state is a string {hash}, with {-dirty} appended if there are edits that have not been saved. Returns None if the current working directory is not contained within a git repository.

Return type:: Optional[str]

rushd.io.infile(filename, tag=None, should_hash=True)[source]

Wrap a filename, marking it as an input data file.

Passthrough wrapper around a path that (optionally) hashes and adds the file to a internally tracked list. This list accumulates files that potentially went into creation of an output file.

Parameters:

filename (str or Path) – The filename of the input file to open.
tag (str (optional)) – A user-defined tag that organizes opened files.
should_hash (bool) – If the input file should be hashed. You may want to skip this if the file is extremely large.

Return type:

A Path object that represents the same file as filename.

rushd.io.outfile(filename, tag=None)[source]

Wrap a filename, declaring it as a tracked output file.

Passthrough method that write a YAML file defining which files went into creating a certain output file.

Any needed subdirectories will be created if the outfile is relative to datadir or rootdir.

Parameters:

filename (str or Path) – An output filename to write data to.
tag (str) – A user-defined string that groups input and output files together.

Return type:

Path

Returns:

A Path object that represents the same file as filename.
Side-effects
————
For output file out.txt, writes a YAML file out.txt.yaml
that encodes the following type of metadata
```yaml
type (tracked_outfile)
name (out.txt)
date (2022-01-31)
git_version (13a81aa2a7b1035f6b59c2323b0a7c457eb1657e)
dependencies –
- file: some_infile.csv path_type: datadir_relative
```

rushd.plot module

Various helper plotting functions that make data analysis easier.

rushd.plot.adjust_subplot_margins_inches(subfig, *, left=0.0, bottom=0.0, top=0.0, right=0.0)[source]

Adjust subplot margins to specified margins in inches.

This adjusts the extent of all subplot axes are placed the specified number of inches from the edges of the subfigure.

Parameters:

fig (matplotlib SubFigure) – Subfigure to be adjusted
left (float) – Left margin, in inches
bottom (float) – Bottom margin, in inches
top (float) – Top margin, in inches
right (float) – Right margin, in inches

Return type:

None; modifies the subfigure in place.

rushd.plot.debug_axes(fig)[source]

Add debug artists to a figure that shows subfigure axis alignment.

Parameters:: fig (matplotlib Figure) – Figure that contains subfigures with axes to show debug info for.
Return type:: None; modifies the figure in place.

rushd.plot.generate_xticklabels(df_labels, ax_col, label_cols, *, ax=None, align_ticklabels='center', align_annotation='right', linespacing=1.2)[source]

Create table-like x-axis tick labels based on provided metadata.

Parameters:

df_labels (Pandas DataFrame) – DataFrame of metadata related to original xticklabels. Columns are metadata values, including the x-axis value.
ax_col – Column of ‘df_labels’ that contains the original xticklabels. For seaborn plots, this should be equivalent to the column passed to the x variable.
label_cols (List) – List of columns of ‘df_labels’ to use to replace the xticklabels.
ax (Optional Matplotlib axes) – Axes to edit, uses current axes if none specified.
align_ticklabels ('left', 'center', or 'right') – Text alignment for multi-line xticklabels.
align_annotation ('left', 'center', or 'right') – Text alignment for multi-line annotations comprising the columns of ‘df_labels’ used to replace the xticklabels. These appear to the bottom left of the plot, with the bounding box right-aligned with the right of the yticklabels and aligned vertically center with the vertical center of the xticklabels.
linespacing (float) – Spacing between rows of the new xticklabels, as a multiple of the font size. Keeps matplotlib default (1.2) if not specified.

Return type:

None; modifies the axes in place.

rushd.plot.plot_mapping(mapping, *, plate_size=None, fig=None, style=None)[source]

Plot a single well mapping as projected onto a 96 well plate.

Parameters:

mapping (Dict[str, Any]) – The mapping from well names to condition values to plot.
plate_size (Optional[Tuple[int, int]] (default: None)) – The width and height of the plate, in number of wells. Defaults to a 96- or 384-well plate, depending on the size of the mapping.
fig (Optional[Figure] (default: None)) – A matplotlib Figure to use to plot on. If not specified, a new figure is created.

Return type:

A matplotlib.figure.Figure object encoding the plot.

rushd.plot.plot_well_metadata(filename, *, output_dir=None, plate_size=None, columns=None, style=None)[source]

Plot the specified metadata columns listed in a YAML file.

Parameters:

filename (str or pathlib.Path) – The path to the YAML file containing the mapping
output_dir (optional pathlib.Path) – If given, outputs plate maps as PNGs, PDFs, and SVGs into this folder. If not given, plots are plt.show’d interactively.
plate_size (optional Tuple[int,int]) – The width and height of the plate, in number of wells. Defaults to a 96- or 384-well plate.
columns (optional List[str]) – The list of columns to plot. If not specified, all metadata columns are plotted.

rushd.well_mapper module

Converts user-specified plate specifications into well maps.

Rationale

Helper module that parses plate specifications of the form: `yaml MEF-low: A1-E1 MEF-bulk: F1-H1, A2-H2, A3-B3 retroviral: A1-H12 ` and returns a dictionary that lets you map from well number to a plate specification.

This format allows for robust and concise description of plate maps.

Specification

While these plate maps can be concisely defined inside YAML or JSON files, this specification does not define an underlying format; it only deals with how to handle the specification.

A well specification is a string containing a comma-separated list of region specifiers. A region specifier is one of two forms, a single well form:

```: A1 B05

```

or a rectangular region form:

```: A1-A12 B05-D8 B05 - C02

```

As seen in these examples, the rectangular region form is distinguished by the presence of a hyphen between two single-well identifiers. Whitespace and leading zeros are allowed.

A well specification is first normalized by the software, where all whitespace characters are removed. The resulting string is split by commas, and further parsed as one of the region specifiers.

Within a single specifier, duplicate entries are ignored. That is, the following specifiers are all equivalent:

```: A5-B7 A5,A6,A7,B5,B6,B7 A5-B7,B6 A5-B7,B5-B7

```

A plate specification is either a dictionary (if order is not important) or a sequence of dictionaries (if order is important). The difference between these in a YAML underlying format is:

`yaml test: A5-A7 test2: A5-A9 `

which yields {‘test’: ‘A5-A7’, ‘test2’: ‘A5-A9’} and

`yaml - test: A5-A7 - test2: A5-A9 `

which yields [{‘test’: ‘A5-A7’}, {‘test2’: ‘A5-A9’}]

This module reads either of these formats. It iterates over each of the well specifications, building up a dictionary that maps wells to conditions. If multiple well specifications overlap, then condition names are merged in the order in which they appear, separated by a separator (by default, a period). This allows very concise condition layouts, such as the following:

```yaml conditions:

MEF: A1-C12 293: D1-F12 untransformed: A1-D3 experimental: A4-D12

```

will return a well map of the form:

` {'A1': 'MEF.untransformed', ..., 'C10: 293.experimental'} `

Both the non-normalized (e.g. no leading zeros, A1) and normalized (e.g. with leading zeros, A01) forms are returned for mapping.

rushd.well_mapper.well_mapping(plate_spec, separator='.')[source]

Generate a well mapping given a plate specification.

Parameters:

plate_spec (dict or iterable) – Either a single dictionary containing well specifications, or an iterable (list, tuple, etc) that returns dictionaries or well specifications as items.
separator (str) – The separator to use for overlapping plate specifications

Return type:

A dictionary that maps wells to conditions.

Module contents

## rushd: data management for humans.

Collection of helper modules for maintaining robust, reproducible data management.

rushd.infile(filename, tag=None, should_hash=True)[source]

Wrap a filename, marking it as an input data file.

Passthrough wrapper around a path that (optionally) hashes and adds the file to a internally tracked list. This list accumulates files that potentially went into creation of an output file.

Parameters:

filename (str or Path) – The filename of the input file to open.
tag (str (optional)) – A user-defined tag that organizes opened files.
should_hash (bool) – If the input file should be hashed. You may want to skip this if the file is extremely large.

Return type:

A Path object that represents the same file as filename.

rushd.outfile(filename, tag=None)[source]

Wrap a filename, declaring it as a tracked output file.

Passthrough method that write a YAML file defining which files went into creating a certain output file.

Any needed subdirectories will be created if the outfile is relative to datadir or rootdir.

Parameters:

filename (str or Path) – An output filename to write data to.
tag (str) – A user-defined string that groups input and output files together.

Return type:

Path

Returns:

A Path object that represents the same file as filename.
Side-effects
————
For output file out.txt, writes a YAML file out.txt.yaml
that encodes the following type of metadata
```yaml
type (tracked_outfile)
name (out.txt)
date (2022-01-31)
git_version (13a81aa2a7b1035f6b59c2323b0a7c457eb1657e)
dependencies –
- file: some_infile.csv path_type: datadir_relative
```