rushd package
Submodules
rushd.ddpcr module
Common functions for analyzing ddPCR data in Pandas Dataframes.
Extracts data and metadata from .ddpcr files. Allows users to specify custom metadata applied via well mapping.
- exception rushd.ddpcr.DataPathError[source]
Bases:
RuntimeErrorError raised when the path to the data is not specified correctly.
- exception rushd.ddpcr.YamlError[source]
Bases:
RuntimeErrorError raised when there is an issue with the provided .yaml file.
- rushd.ddpcr.calculate_copy_number(df, exp_channel, ref_channel, gates, *, ref_copy_num=2.0)[source]
Calculates copy number of an experimental target relative to a reference target.
Adds a column to the DataFrame with this computed value. Math is based on … TODO
- Parameters:
df (
DataFrame) – Data on which to calculate. Must contain columns corresponding to the experimental and reference channels.exp_channel (
str) – Column in df containing measurements for the experimental target.ref_channel (
str) – Column in df containing measurements for the reference target.gates (
Dict[str,float]) – Gates specifying threshold for positive droplets, one for each experimental and reference channel.ref_copy_num (
float(default:2.0)) – Known copy number of the reference gene. If not specified, defaults to 2.0 (diploid).
- Return type:
DataFrame- Returns:
The original DataFrame with a new column ‘copy_num’ containing the computed
values.
- rushd.ddpcr.load_ddpcr(data_path, yaml_path=None, *, extract_metadata=True)[source]
Load ddPCR data into DataFrame with associated metadata.
Generates a pandas DataFrame from a .ddpcr file, which is the file type for experiments on the BioRad QX100/QX200 machines. Adds columns for metadata encoded by a given .yaml file. Metadata is associated with the data based on well IDs extracted from the experiment data.
- Parameters:
data_path (
str|Path) – Path to .ddpcr file.yaml_path (
str|Path|None(default:None)) – Path to .yaml file to use for associating metadata with well IDs. All metadata must be contained under the header ‘metadata’.extract_metadata (
bool|None(default:True)) – Whether to extract metadata from the .ddpcr file. If True, adds a subset of the metadata associated with each well in the BioRad software, namely sample names (numbered ‘Sample description’ fields, returned as numbered ‘condition’ keys) and targets for each channel/dye (returned as ‘[channel]_target’ keys).
- Return type:
DataFrame
- rushd.ddpcr.load_ddpcr_metadata(unzipped_path)[source]
Load well metadata from an unzipped .ddpcr file.
Generates a metadata dict in the same format as the YAML well mapping, i.e., key -> {well -> value}. The columns are a subset of the metadata associated with each well in the BioRad software, namely sample names (numbered ‘Sample description’ fields, returned as numbered ‘sample_description’ keys) and targets for each channel/dye (returned as ‘[channel]_target’ keys).
- Parameters:
unzipped_path (
Path) – Path to unzipped .ddpcr file- Return type:
Dict[Any,Any]- Returns:
A dictionary that contains a well mapping for metadata extracted from
the .ddpcr experiment.
rushd.flow module
Common functions for analyzing flow data in Pandas Dataframes.
Allows users to specify custom metadata applied via well mapping. Combines user data from multiple .csv files into a single DataFrame.
- exception rushd.flow.ColumnError[source]
Bases:
RuntimeErrorError raised when the data is missing a column specifying well IDs.
- exception rushd.flow.DataPathError[source]
Bases:
RuntimeErrorError raised when the path to the data is not specified correctly.
- exception rushd.flow.GroupsError[source]
Bases:
RuntimeErrorError raised when there is an issue with the data groups DataFrame.
- exception rushd.flow.MOIinputError[source]
Bases:
RuntimeErrorError raised when there is an issue with the provided dataframe.
- exception rushd.flow.MetadataWarning[source]
Bases:
UserWarningWarning raised when the passed metadata is possibly incorrect, but valid.
- exception rushd.flow.RegexError[source]
Bases:
RuntimeErrorError raised when there is an issue with the file name regular expression.
- exception rushd.flow.YamlError[source]
Bases:
RuntimeErrorError raised when there is an issue with the provided .yaml file.
- rushd.flow.load_csv(data_path, filename_regex=None, *, columns=None, csv_kwargs={})[source]
Load .csv data into DataFrame without additional metadata.
Generates a pandas DataFrame from a set of .csv files located at the given path, adding columns for metadata encoded in the data filenames.
- Parameters:
data_path (
str|Path) – Path to directory containing data files (.csv)filename_regex (
str|None(default:None)) – Regular expression to use to extract metadata from data filenames. Any named capturing groups will be added as metadata. If not included, the filenames are assumed to follow this format (default export format from FlowJo): ‘export_[condition]_[population].csv’columns (
List[str] |None(default:None)) – If specified, only these columns are loaded out of the CSV files. This can drastically reduce the amount of memory required to load flow data.csv_kwargs (
Dict[str,Any] |None(default:{})) – Additional kwargs to pass to pandas ‘read_csv’. For instance, to skip rows or to specify alternate delimiters.
- Return type:
DataFrame
- rushd.flow.load_csv_with_metadata(data_path, yaml_path, filename_regex=None, *, columns=None, csv_kwargs={})[source]
Load .csv data into DataFrame with associated metadata.
Generates a pandas DataFrame from a set of .csv files located at the given path, adding columns for metadata encoded by a given .yaml file. Metadata is associated with the data based on well IDs encoded in the data filenames.
- Parameters:
data_path (
str|Path) – Path to directory containing data files (.csv)yaml_path (
str|Path) – Path to .yaml file to use for associating metadata with well IDs. All metadata must be contained under the header ‘metadata’.filename_regex (
str|None(default:None)) – Regular expression to use to extract well IDs from data filenames. Must contain the capturing group ‘well’ for the sample well IDs. If not included, the filenames are assumed to follow this format (default export format from FlowJo): ‘export_[well]_[population].csv’columns (
List[str] |None(default:None)) – If specified, only these columns are loaded out of the CSV files. This can drastically reduce the amount of memory required to load flow data.csv_kwargs (
Dict[str,Any] |None(default:{})) – Additional kwargs to pass to pandas ‘read_csv’. For instance, to skip rows or to specify alternate delimiters.
- Return type:
DataFrame
- rushd.flow.load_groups_with_metadata(groups_df, base_path='', filename_regex=None, *, columns=None, csv_kwargs={})[source]
Load .csv data into DataFrame with associated metadata by group.
Each group of .csv files may be located at a different path and be associated with additional user-defined metadata.
- Parameters:
groups_df (
DataFrame) – Each row of the DataFrame is evaluated as a separate group. Columns must include ‘data_path’ and ‘yaml_path’, specifying absolute or relative paths to the group of .csv files and metadata .yaml files, respectively. Optionally, regular expressions for the file names can be specified for each group using the column ‘filename_regex’ (this will override the ‘filename_regex’ argument).base_path (
str|Path|None(default:'')) – If specified, path that data and yaml paths in input_df are defined relative to.filename_regex (
str|None(default:None)) – Regular expression to use to extract well IDs from data filenames. Must contain the capturing group ‘well’ for the sample well IDs. Other capturing groups in the regex will be added as metadata. This value applies to all groups; to specify different regexes for each group, add the column ‘filename_regex’ to groups_df (this will override the ‘filename_regex’ argument). If not included, the filenames are assumed to follow this format (default export format from FlowJo): ‘export_[well]_[population].csv’columns (
List[str] |None(default:None)) – If specified, only these columns are loaded out of the CSV files. This can drastically reduce the amount of memory required to load flow data.csv_kwargs (
Dict[str,Any] |None(default:{})) – Additional kwargs to pass to pandas ‘read_csv’. For instance, to skip rows or to specify alternate delimiters.
- Return type:
DataFrame
- rushd.flow.load_well_metadata(yaml_path)[source]
Load a YAML file and convert it into a well mapping.
- Parameters:
yaml_path (
str|Path)- Return type:
Dict[Any,Any]
- rushd.flow.moi(data_frame, color_column_name, color_cutoff, output_path=None, summary_method='median', *, scale_factor=1.0)[source]
Calculate moi information from flowjo data with appropriate metadata.
Generates a pandas DataFrame of virus titers from a pandas DataFrame of flowjo data.
- Parameters:
data_frame (
DataFrame) –- The pandas DataFrame to analyze. It must have the following columns:
condition: the conditions/types of virus being analyzed replicate: the replicate of the data (can have all data as the same replicate) starting_cell_count: the number of cells in the well at the time of infection scaling: the dilution factor of each row max_virus: the maximum virus added to that column
scaling times max_virus should result in the volume of virus stock added to a well
color_column_name (
str) – The name of the column on which to gate infection.color_cutoff (
float) – The level of fluoresence on which to gate infecction.output_path (
str|Path|None(default:None)) – The path to the output folder. If None, instead prints all plots to screen. Defaults to Nonesummary_method (
Literal['mean'] |Literal['median'] (default:'median')) – Whether to return the calculated titer as the mean or median of the replicates.scale_factor (
float(default:1.0)) – Whether to scale down the Poisson fit by the given scale factor maximum.
- Return type:
DataFrame
rushd.io module
A submodule implementing common IO handling mechanisms.
## Rationale File and folder management is a common problem when handling large datasets. You often want to separate out large data from your code. How do you keep track of where your data is, especially if moving between different computers/clusters?
rushd.io adds convenience functions to handle common cases, as well as writing metadata with your output files that identify input files.
- exception rushd.io.NoDatadirError[source]
Bases:
RuntimeErrorNo datadir.txt file found.
Error raised when rushd is unable to locate a datadir.txt path in the current file.
- rushd.io.cache_dataframe(cache_path)[source]
Wrap caching functionality around a dataframe-generating function.
Notes
If you wrap a function that contains an invalidate keyword, this keyword will be removed when passed to your function!
- Parameters:
cache_path (
Path|str) – The path at which the dataframe cache should be saved- Return type:
Callable[...,Callable[...,DataFrame]]- Returns:
A function that generates a dataframe with optional caching.
An extra keyword argument, ‘invalidate’ is added that invalidates
the cache if needed
- rushd.io.git_version()[source]
Return the current version control state as a string.
The state is a string {hash}, with {-dirty} appended if there are edits that have not been saved. Returns None if the current working directory is not contained within a git repository.
- Return type:
str|None
- rushd.io.infile(filename, tag=None, should_hash=True)[source]
Wrap a filename, marking it as an input data file.
Passthrough wrapper around a path that (optionally) hashes and adds the file to a internally tracked list. This list accumulates files that potentially went into creation of an output file.
- Parameters:
filename (
str|Path) – The filename of the input file to open.tag (
str|None(default:None)) – A user-defined tag that organizes opened files.should_hash (
bool(default:True)) – If the input file should be hashed. You may want to skip this if the file is extremely large.
- Return type:
Path
- rushd.io.outfile(filename, tag=None)[source]
Wrap a filename, declaring it as a tracked output file.
Passthrough method that write a YAML file defining which files went into creating a certain output file.
Any needed subdirectories will be created if the outfile is relative to datadir or rootdir.
- Parameters:
filename (
str|Path) – An output filename to write data to.tag (
str|None(default:None)) – A user-defined string that groups input and output files together.
- Return type:
Path- Returns:
A Path object that represents the same file as filename.
Side-effects
————
For output file out.txt, writes a YAML file out.txt.yaml
that encodes the following type of metadata
type (tracked_outfile)
name (out.txt)
date (2022-01-31)
git_version (13a81aa2a7b1035f6b59c2323b0a7c457eb1657e)
dependencies –
file: some_infile.csv path_type: datadir_relative
rushd.plot module
Various helper plotting functions that make data analysis easier.
- rushd.plot.adjust_subplot_margins_inches(subfig, *, left=0.0, bottom=0.0, top=0.0, right=0.0)[source]
Adjust subplot margins to specified margins in inches.
This adjusts the extent of all subplot axes are placed the specified number of inches from the edges of the subfigure.
- Parameters:
fig (matplotlib SubFigure) – Subfigure to be adjusted
left (float) – Left margin, in inches
bottom (float) – Bottom margin, in inches
top (float) – Top margin, in inches
right (float) – Right margin, in inches
- Return type:
None; modifies the subfigure in place.
- rushd.plot.debug_axes(fig)[source]
Add debug artists to a figure that shows subfigure axis alignment.
- Parameters:
fig (
Figure) – Figure that contains subfigures with axes to show debug info for.- Return type:
None; modifies the figure in place.
- rushd.plot.generate_xticklabels(df_labels, ax_col, label_cols, *, ax=None, align_ticklabels='center', align_annotation='right', linespacing=1.2)[source]
Create table-like x-axis tick labels based on provided metadata.
- Parameters:
df_labels (
DataFrame) – DataFrame of metadata related to original xticklabels. Columns are metadata values, including the x-axis value.ax_col – Column of ‘df_labels’ that contains the original xticklabels. For seaborn plots, this should be equivalent to the column passed to the x variable.
label_cols (
List) – List of columns of ‘df_labels’ to use to replace the xticklabels.ax (
Axes|None(default:None)) – Axes to edit, uses current axes if none specified.align_ticklabels (
Literal['left'] |Literal['center'] |Literal['right'] |None(default:'center')) – Text alignment for multi-line xticklabels.align_annotation (
Literal['left'] |Literal['center'] |Literal['right'] |None(default:'right')) – Text alignment for multi-line annotations comprising the columns of ‘df_labels’ used to replace the xticklabels. These appear to the bottom left of the plot, with the bounding box right-aligned with the right of the yticklabels and aligned vertically center with the vertical center of the xticklabels.linespacing (
float|None(default:1.2)) – Spacing between rows of the new xticklabels, as a multiple of the font size. Keeps matplotlib default (1.2) if not specified.
- Return type:
None; modifies the axes in place.
- rushd.plot.plot_mapping(mapping, *, plate_size=None, fig=None, style=None)[source]
Plot a single well mapping as projected onto a 96 well plate.
- Parameters:
mapping (
Dict[str,Any]) – The mapping from well names to condition values to plot.plate_size (
Tuple[int,int] |None(default:None)) – The width and height of the plate, in number of wells. Defaults to a 96- or 384-well plate, depending on the size of the mapping.fig (
Figure|None(default:None)) – A matplotlib Figure to use to plot on. If not specified, a new figure is created.
- Return type:
A matplotlib.figure.Figure object encoding the plot.
- rushd.plot.plot_well_metadata(filename, *, output_dir=None, plate_size=None, columns=None, style=None)[source]
Plot the specified metadata columns listed in a YAML file.
- Parameters:
filename (
Path|str) – The path to the YAML file containing the mappingoutput_dir (
Path|None(default:None)) – If given, outputs plate maps as PNGs, PDFs, and SVGs into this folder. If not given, plots are plt.show’d interactively.plate_size (
Tuple[int,int] |None(default:None)) – The width and height of the plate, in number of wells. Defaults to a 96- or 384-well plate.columns (
List[str] |None(default:None)) – The list of columns to plot. If not specified, all metadata columns are plotted.
rushd.qpcr module
Common functions for analyzing qPCR data in Pandas Dataframes.
Allows users to specify custom metadata applied via well mapping.
- exception rushd.qpcr.ColumnError[source]
Bases:
RuntimeErrorError raised when the data is missing a required column.
- exception rushd.qpcr.DataPathError[source]
Bases:
RuntimeErrorError raised when the path to the data is not specified correctly.
- exception rushd.qpcr.GroupsError[source]
Bases:
RuntimeErrorError raised when there is an issue with the data groups DataFrame.
- exception rushd.qpcr.InputError[source]
Bases:
RuntimeErrorError raised when there is an issue with an argument type.
- exception rushd.qpcr.RegexError[source]
Bases:
RuntimeErrorError raised when there is an issue with the file name regular expression.
- exception rushd.qpcr.YamlError[source]
Bases:
RuntimeErrorError raised when there is an issue with the provided .yaml file.
- rushd.qpcr.calculate_input_amount(y, fit)[source]
Given a cycle count (Cp, aka Ct value) and a linear regression fit, compute the amount of input.
Note that the linear regression fit is expected to have been performed on the log10-transform of the input amounts. Units of the returned value match those of the non-transformed input amount data.
- Parameters:
y (
float) – Cycle count (Cp, aka Ct value).fit (
LinregressResult|List[float]) – Linear fit to use. Accepts either the output of a call to scipy.stats.linregress or a list of the fit values [slope, intercept].
- Return type:
float
- rushd.qpcr.calculate_standard(df, amt_col, cp_col, ax=None)[source]
Calculate a standard curve for qPCR data.
For the given data, treats the values in ‘amt_col’ as input amounts and values in ‘cp_col’ as the corresponding cycle counts (Cp, aka Ct) from the qPCR output. Computes a linear regression on log10(amount) vs Cp, and returns this fit as well as the efficiency.
If axes are passed, plots the linear fit on the data, annotating the R^2 value and efficiency.
- Parameters:
df (
DataFrame) – Data to use to fit.amt_col (
str) – Name of column containing input amounts.cp_col (
str) – Name of the column containing Cp values.ax (<module ‘matplotlib.axes’ from ‘/opt/hostedtoolcache/Python/3.14.5/x64/lib/python3.14/site-packages/matplotlib/axes/__init__.py’> |
None(default:None)) – Axes on which to plot the data and fit.
- Return type:
List[scipy.stats._stats_py.LinregressResult, float]
- Returns:
A tuple of the fit (output of a call to scipy.stats.linregress)
and the calculated efficiency (float).
- rushd.qpcr.convert_mass_to_moles(mass, length)[source]
For a given amount of DNA in moles, use its length to calculate its mass.
Formula from NEB: mol = g / (bp x 615.94 g/mol/bp + 36.04 g/mol)
moles dsDNA = mass of dsDNA (g) / (molecular weight of dsDNA (g/mol))
molecular weight of dsDNA = (number of base pairs of dsDNA x average molecular weight of a base pair) + 36.04 g/mol
average molecular weight of a base pair = 615.94 g/mol, excluding the water molecule removed during polymerization and assuming deprotonated phosphate hydroxyls
the additional 36.04 g/mol accounts for the 2 -OH and 2 -H added back to the ends
bases are assumed to be unmodified
- Parameters:
mass (
float|List[float]) – Mass of dsDNA in grams.length (
float|List[float]) – Number of base pairs of the dsDNA (or average length of a heterogeneous sample).
- Return type:
float|List[float]
- rushd.qpcr.convert_moles_to_mass(moles, length)[source]
For a given amount of DNA in moles, use its length to calculate its mass.
Formula from NEB: g = mol x (bp x 615.94 g/mol/bp + 36.04 g/mol)
mass of dsDNA (g) = moles dsDNA x (molecular weight of dsDNA (g/mol))
molecular weight of dsDNA = (number of base pairs of dsDNA x average molecular weight of a base pair) + 36.04 g/mol
average molecular weight of a base pair = 615.94 g/mol, excluding the water molecule removed during polymerization and assuming deprotonated phosphate hydroxyls
the additional 36.04 g/mol accounts for the 2 -OH and 2 -H added back to the ends
bases are assumed to be unmodified
- Parameters:
moles (
float|List[float]) – Amount of dsDNA in moles.length (
float|List[float]) – Number of base pairs of the dsDNA (or average length of a heterogeneous sample).
- Return type:
float|List[float]
- rushd.qpcr.load_plates_with_metadata(groups_df, base_path='', filename_regex=None, *, well_column='well', columns=None, csv_kwargs={}, is_default=False)[source]
Load data from multiple plates into a DataFrame with associated metadata.
Each plate is a .csv file with well IDs encoded in one of the data columns.
- Parameters:
groups_df (
DataFrame) – Each row of the DataFrame is evaluated as a separate plate. Columns must include ‘data_path’ and ‘yaml_path’, specifying absolute or relative paths to the .csv files and metadata .yaml files, respectively. Optionally, regular expressions for the file names can be specified for each file using the column ‘filename_regex’ (this will override the ‘filename_regex’ argument).base_path (
str|Path|None(default:'')) – If specified, path that data and yaml paths in input_df are defined relative to.filename_regex (
str|None(default:None)) – Regular expression to use to extract metadata from data filenames. This value applies to all groups; to specify different regexes for each group, add the column ‘filename_regex’ to groups_df (this will override the ‘filename_regex’ argument). If not included, filename information will not be added as metadata.well_column (
str|None(default:'well')) – Name of the column containing well IDs.columns (
List[str] |None(default:None)) – If specified, only these columns are loaded out of the .csv files. This can drastically reduce the amount of memory required to load flow data.csv_kwargs (
Dict[str,Any] |None(default:{})) – Additional kwargs to pass to pandas ‘read_csv’. For instance, to skip rows or to specify alternate delimiters.is_default (
bool|None(default:False)) – If True, will override ‘well_column’, ‘columns’, and ‘csv_kwargs’ with defaults for plates with the format exported from Roche LightCycler 480II.
- Return type:
DataFrame
- rushd.qpcr.load_single_csv_with_metadata(data_path, yaml_path, *, well_column='well', columns=None, csv_kwargs={}, is_default=False)[source]
Load .csv data into DataFrame with associated metadata.
Generates a pandas DataFrame from a single .csv file located at the given path, adding columns for metadata encoded by a given .yaml file. Metadata is associated with the data based on well IDs encoded in one of the data columns.
Note that this uses pandas ‘read_csv’, so it is compatible with .tsv and .txt files with the appropriate kwargs.
- Parameters:
data_path (
str|Path) – Path to directory containing data files (.csv or similar)yaml_path (
str|Path) – Path to .yaml file to use for associating metadata with well IDs. All metadata must be contained under the header ‘metadata’.well_column (
str|None(default:'well')) – Name of the column containing well IDs.columns (
List[str] |None(default:None)) – If specified, only these columns are loaded out of the .csv files. This can drastically reduce the amount of memory required to load flow data.csv_kwargs (
Dict[str,Any] |None(default:{})) – Additional kwargs to pass to pandas ‘read_csv’. For instance, to skip rows or to specify alternate delimiters.is_default (
bool|None(default:False)) – If True, will override ‘well_column’, ‘columns’, and ‘csv_kwargs’ with defaults for plates with the format exported from Roche LightCycler 480II.
- Return type:
DataFrame
rushd.well_mapper module
Converts user-specified plate specifications into well maps.
Rationale
Helper module that parses plate specifications of the form:
`yaml
MEF-low: A1-E1
MEF-bulk: F1-H1, A2-H2, A3-B3
retroviral: A1-H12
`
and returns a dictionary that lets you map from well number to a
plate specification.
This format allows for robust and concise description of plate maps.
Specification
While these plate maps can be concisely defined inside YAML or JSON files, this specification does not define an underlying format; it only deals with how to handle the specification.
A well specification is a string containing a comma-separated list of region specifiers. A region specifier is one of two forms, a single well form:
or a rectangular region form:
As seen in these examples, the rectangular region form is distinguished by the presence of a hyphen between two single-well identifiers. Whitespace and leading zeros are allowed.
A well specification is first normalized by the software, where all whitespace characters are removed. The resulting string is split by commas, and further parsed as one of the region specifiers.
Within a single specifier, duplicate entries are ignored. That is, the following specifiers are all equivalent:
A plate specification is either a dictionary (if order is not important) or a sequence of dictionaries (if order is important). The difference between these in a YAML underlying format is:
`yaml
test: A5-A7
test2: A5-A9
`
which yields {‘test’: ‘A5-A7’, ‘test2’: ‘A5-A9’} and
`yaml
- test: A5-A7
- test2: A5-A9
`
which yields [{‘test’: ‘A5-A7’}, {‘test2’: ‘A5-A9’}]
This module reads either of these formats. It iterates over each of the well specifications, building up a dictionary that maps wells to conditions. If multiple well specifications overlap, then condition names are merged in the order in which they appear, separated by a separator (by default, a period). This allows very concise condition layouts, such as the following:
MEF: A1-C12 293: D1-F12 untransformed: A1-D3 experimental: A4-D12
will return a well map of the form:
`
{'A1': 'MEF.untransformed', ..., 'C10: 293.experimental'}
`
Both the non-normalized (e.g. no leading zeros, A1) and normalized (e.g. with leading zeros, A01) forms are returned for mapping.
- rushd.well_mapper.well_mapping(plate_spec, separator='.')[source]
Generate a well mapping given a plate specification.
- Parameters:
plate_spec (
Dict[Any,str] |List[Dict[Any,str]] |Tuple[Dict[Any,str]]) – Either a single dictionary containing well specifications, or an iterable (list, tuple, etc) that returns dictionaries or well specifications as items.separator (
str(default:'.')) – The separator to use for overlapping plate specifications
- Return type:
Dict[str,Any]
Module contents
## rushd: data management for humans.
Collection of helper modules for maintaining robust, reproducible data management.
- rushd.infile(filename, tag=None, should_hash=True)[source]
Wrap a filename, marking it as an input data file.
Passthrough wrapper around a path that (optionally) hashes and adds the file to a internally tracked list. This list accumulates files that potentially went into creation of an output file.
- Parameters:
filename (
str|Path) – The filename of the input file to open.tag (
str|None(default:None)) – A user-defined tag that organizes opened files.should_hash (
bool(default:True)) – If the input file should be hashed. You may want to skip this if the file is extremely large.
- Return type:
Path
- rushd.outfile(filename, tag=None)[source]
Wrap a filename, declaring it as a tracked output file.
Passthrough method that write a YAML file defining which files went into creating a certain output file.
Any needed subdirectories will be created if the outfile is relative to datadir or rootdir.
- Parameters:
filename (
str|Path) – An output filename to write data to.tag (
str|None(default:None)) – A user-defined string that groups input and output files together.
- Return type:
Path- Returns:
A Path object that represents the same file as filename.
Side-effects
————
For output file out.txt, writes a YAML file out.txt.yaml
that encodes the following type of metadata
type (tracked_outfile)
name (out.txt)
date (2022-01-31)
git_version (13a81aa2a7b1035f6b59c2323b0a7c457eb1657e)
dependencies –
file: some_infile.csv path_type: datadir_relative