Loading flow data
rushd provides several convenient ways to load flow data.
We recommend that you put all metadata into a YAML file and load it based
on auto-generated well-ID information, but you can also load metadata
directly from CSV filenames.
Metadata in YAML
You can load extra metadata using a YAML file that defines which wells had which conditions / treatments / cell lines / etc.
For example, this YAML files specifies circuit syntax and dox treatment:
metadata:
syntax:
- tandem: A1-A12
- convergent: B1-B12
- divergent: C1-C12
dox_ng:
- 0: A1-C6
- 1000: A7-C12
You are required to place all metadata conditions inside a top-level
key called metadata. Within this, you can define arbitrary mappings
that map onto ranges of wells.
To load a plate of flow data, you can use rushd by specifying
a path to this YAML file and a path to the folder containing the .csv files:
df = rd.flow.load_csv_with_metadata(rd.datadir/"exp01"/"metadata.yaml", rd.datadir/"exp01"/"csv_export")
Alternatively, you can specify a .zip file of the metadata. Let’s assume you zipped all of the CSVs into
one file, called csvs.zip. Then, you can load this as:
df = rd.flow.load_csv_with_metadata(rd.datadir/"exp01"/"metadata.yaml", rd.datadir/"exp01"/"csvs.zip")
Finally, you are allowed to zip the metadata and CSVs together. Let’s say you have the following zip file:
exp01.zip/
├── metadata.yaml
└── export/
├── export_A1_singlets.csv
├── export_A2_singlets.csv
├── ...
└── export_G12_singlets.csv
You can load this by specifying the path to the metadata file and the CSVs as a tuple:
df = rd.flow.load_csv_with_metadata((rd.datadir/"exp01.zip", "metadata.yaml"), (rd.datadir/"exp01.zip", "export"))
Check out the documentation for this function for more things you can do, like specifying only certain columns to be loaded:
- rushd.flow.load_csv_with_metadata(data_path, yaml_path, filename_regex=None, *, columns=None, csv_kwargs=None)[source]
Load .csv data into DataFrame with associated metadata.
Generates a pandas DataFrame from a set of .csv files located at the given path, adding columns for metadata encoded by a given .yaml file. Metadata is associated with the data based on well IDs encoded in the data filenames.
- Parameters:
data_path (location of the .csv files) – Either a directory containing .csv files, a zip file containing .csv files, or a path within a zip file containing .csv files.
yaml_path (either a path to a .yaml file or a path within a zip file to a .yaml) – Path to .yaml file to use for associating metadata with well IDs. All metadata must be contained under the header ‘metadata’.
filename_regex (str or raw str (optional)) – Regular expression to use to extract well IDs from data filenames. Must contain the capturing group ‘well’ for the sample well IDs. If not included, the filenames are assumed to follow this format (default export format from FlowJo):
export_[well]_[population].csvcolumns (list of strings (optional)) – If specified, only these columns are loaded out of the .csv files. This can drastically reduce the amount of memory required to load flow data.
csv_kwargs (dict (optional)) – Additional kwargs to pass to pandas
read_csv. For instance, to skip rows or to specify alternate delimiters.
- Returns:
A single pandas DataFrame containing all data with associated metadata.
- Return type:
DataFrame
Finally, you can use any of these data loading techniques with the multi-plate loading function:
- rushd.flow.load_groups_with_metadata(groups_df, base_path='', filename_regex=None, *, columns=None, csv_kwargs=None)[source]
Load .csv data into DataFrame with associated metadata by group.
Each group of .csv files may be located at a different path and be associated with additional user-defined metadata.
- Parameters:
groups_df (Pandas DataFrame) – Each row of the DataFrame is evaluated as a separate group. Columns must include ‘data_path’ and ‘yaml_path’, specifying absolute or relative paths to the group of .csv files and metadata .yaml files, respectively. Optionally, regular expressions for the file names can be specified for each group using the column ‘filename_regex’ (this will override the filename_regex argument).
base_path (str or Path (optional)) – If specified, path that data and yaml paths in input_df are defined relative to.
filename_regex (str or raw str (optional)) – Regular expression to use to extract well IDs from data filenames. Must contain the capturing group well for the sample well IDs. Other capturing groups in the regex will be added as metadata. This value applies to all groups; to specify different regexes for each group, add the column ‘filename_regex’ to groups_df (this will override the filename_regex argument). If not included, the filenames are assumed to follow this format (default export format from FlowJo):
export_[well]_[population].csvcolumns (list of strings (optional)) – If specified, only these columns are loaded out of the .csv files. This can drastically reduce the amount of memory required to load flow data.
csv_kwargs (dict (optional)) – Additional kwargs to pass to pandas
read_csv. For instance, to skip rows or to specify alternate delimiters.
- Returns:
A single pandas DataFrame containing data from all groups with associated metadata.
- Return type:
DataFrame
Metadata in filenames
If you ran a tube experiment or otherwise have metadata specified in filenames, you can use a function that just loads CSVs and extracts the metadata out of.
Let’s say that we have some files that have metadata in their filenames, like:
export_BFP_100_singlets.csvexport_GFP_1000_singlets.csv
where we want to extract the construct and the dox concentration. Developing the regex is beyond the scope here:
use https://regex101.com to evaluate teh regex. In this case, a regex that works here is ^.*export_(?P<construct>.+)_(?P<dox>[0-9]+)_(?P<population>.+)\.csv
regex = r"^.*export_(?P<construct>.+)_(?P<dox>[0-9]+)_(?P<population>.+)\.csv"
df = rd.flow.load_csv(rd.datadir/"exp02", regex)
You can see more details of this function below:
- rushd.flow.load_csv(data_path, filename_regex=None, *, columns=None, csv_kwargs=None)[source]
Load .csv data into DataFrame without additional metadata.
Generates a pandas DataFrame from a set of .csv files located at the given path, adding columns for metadata encoded in the data filenames.
- Parameters:
data_path (str or Path or a tuple with a str/Path to a zip file and a str folder inside) – Path to directory containing data files (.csv)
filename_regex (str or raw str (optional)) – Regular expression to use to extract metadata from data filenames. Any named capturing groups will be added as metadata. If not included, the filenames are assumed to follow this format (default export format from FlowJo):
export_[condition]_[population].csvcolumns (list of strings (optional)) – If specified, only these columns are loaded out of the .csv files. This can drastically reduce the amount of memory required to load flow data.
csv_kwargs (dict (optional)) – Additional kwargs to pass to pandas
read_csv. For instance, to skip rows or to specify alternate delimiters.
- Returns:
A single pandas DataFrame containing all data with associated filename metadata.
- Return type:
DataFrame