API Reference

Library Prep

Library Prep Pipeline

exception beers.library_prep.library_prep_pipeline.BeersLibraryPrepValidationException[source]
class beers.library_prep.library_prep_pipeline.LibraryPrepPipeline[source]

The class runs all the steps in the library_prep pipeline as described and wired together in the configuration file. The point of entry into this class is the static method main().

execute(configuration: dict, global_config: dict, packet_file: str, output_quant_file: str, input_quant_file: str, log_paths: str, input_molecule_packet: MoleculePacket, rng: Generator, full_logs: bool)[source]

Performs the Sequence Pipeline on a MoleculePacket.

Parameters:
  • configuration – json-like object of the configuration data relevant to this pipeline stage.

  • global_config – json-like object of the full configuration data

  • packet_file – path to write out the packet file to

  • output_quant_file – path to write out the quantified counts of the lib prep output

  • input_quant_file – path to write out the quantified counts file to

  • log_paths – list of paths to output the pipeline step logs to, in the same order as the steps appear in the config file

  • input_molecule_packet – the molecule packet to run through this pipeline stage

  • rng – random number generator to use

  • full_logs – whether to output molecule logs at intermediate steps

static main(seed: str, configuration: str, global_configuration: str, packet_file: str, output_quant_file: str, input_quant_file: str, log_paths: str, molecule_packet_filename: str | None, packet_id: str, sample_id: str, distribution_directory: str | None = None, molecules_per_packet_from_distribution: int = 10000, full_logs: bool = False)[source]

This method would be called by a command line script in the bin directory. It sets a random seed, loads a directory containing the relevant parts of the user’s configuration file, unmarshalls a molecule packet from the provided molecule packet filename, initializes and validates the library prep pipeline stage and then executes it for the molecule packet.

Parameters:
  • seed – value to use as the seed for the random number generator

  • configuration – the json string containing the configration data specific to the library prep pipeline

  • global_configuration – json string containing the configuration data for the whole run

  • packet_file – path to write out the packet file to

  • output_quant_file – path to write out the quantified counts of the lib prep output

  • input_quant_file – path to write out the quantified counts file to

  • log_paths – list of paths to output the pipeline step logs to, in the same order as the steps appear in the config file

  • molecule_packet_filename – the file from which to unmarshall the molecule packet

  • packet_id – id number to assign the packet

  • sample_id – id number of the sample

  • distribution_directory – directory of CAMPAREE output distribution datas from which to generate molecules (default None, must supply molecule_packet_filename instead)

  • molecules_per_packet_from_distribution – packet size to generate if using distributions

  • full_logs – whether to output all logs for intermediate steps If True, output may be very large

print_summary(original_ids: set[str], sample: list[beers_utils.molecule.Molecule], elapsed_time: float | None = None)[source]

Output a summary of the sample (number of molecules, time taken, etc.).

original_ids:

Molecule IDs from input molecule packet

sample:

list of molecules in the sample

elapsed_time:

elapsed time to display, if not None

static validate(configuration: dict, global_config: dict)[source]

Static method to run each step validate process to identify errant parameters. If any errors are found, a BeersLibraryPrepValidationException is raised. Prints error messages to stderr.

Parameters:
  • configuration – json-like object containing the SequencePipeline-specific configuration

  • global_config – json-like object containing the full config of the BEERS run

PolyA Selection Step

class beers.library_prep.polya_step.PolyAStep(parameters, global_config)[source]

This step simulates the polyA selection step intended to separate mRNA from all other RNA. It includes biases but idealized behavior is available by not specifying parameter values and the default values provide idealized behavior. Any parameter not specified via the configuration file will default to its idealized setting.

Configuration Example:

# Chance per base of fragmentation prior to selection
# Increasing this above 0 induces a 3' bias.
# A value of 0.001 induces a reasonably high 3' bias
breakpoint_prob_per_base: 0.0

# Probability of retention is computed as a value between
# min_retention_prob and max_retention_prob
# For every base of the polyA tail beyond min_polya_tail_length
# the probability increases linearly from min_retention_prob
# by length_retention_prob, up to a max of max_polya_tail_length.
max_retention_prob: 1.0
min_retention_prob: 0.0
min_polya_tail_length: 40
length_retention_prob: 0.05
apply_three_prime_bias(molecule: Molecule, tail_length: int, note: str, rng: Generator) str[source]

Model for polyA selection 3’ bias. Assuming that polyA tail plays no role in truncation of 5’ end.

Parameters:
  • molecule – molecule to evaluate for truncation

  • tail_length – length of polyA tail, which may be 0

  • note – comment added to log

  • rng – random number generator

Returns:

note to add to log

Return type:

str

First Strand Synthesis Step

class beers.library_prep.first_strand_synthesis_step.FirstStrandSynthesisStep(parameters, global_config)[source]

First Strand Synthesis Step

A library preparation step that places random primers onto the RNA molecules and then extends the primers to obtain cDNA.

Primer location can be biased by read sequence (see the ‘position probability matrix’ parameter). Even when not biased, priming can be either perfect or imperfect. Perfect means that the first base will always have a primer bind to it and therefore the entire molecule will be perfectly duplicated as cDNA. Imperfect means that primers are placed randomly (at a rate of ‘primers_per_kb’) and the 5’-most one is extended to create the molecule, potentially losing some of the 5’ end of the molecule.

Configuration example:

parameters:
    # If 'perfect_priming' is true, then we always prime
    # exactly on the 5' most end of the molecule and the
    # cDNA matches perfectly the original molecule
    # If true, all other parameters are ignored.
    perfect_priming: false

    # PPM matrix describing biases of the priming sites
    # Values must sum to 1 in each base. Length of the PPM
    # determines the primer lengths.
    # See Kasper et all, 2010 PMID: 20395217 for example
    # observation of these biases in fragment start positions.
    # If no bias is desired, set all values to 0.25.
    position_probability_matrix:
        A: [0.50, 0.1, 0.40, 0.30, 0.25, 0.15]
        C: [0.20, 0.5, 0.3 , 0.25, 0.25, 0.15]
        G: [0.15, 0.1, 0.15, 0.25, 0.25, 0.20]
        T: [0.15, 0.3, 0.15, 0.20, 0.25, 0.50]

    # Rate of priming: approximate number of primers that
    # will bind (if bias-free) per kilobase of the fragment
    # Higher numbers will loose less of the 5' side.
    primes_per_kb: 50

Second Strand Synthesis Step

class beers.library_prep.second_strand_synthesis_step.SecondStrandSynthesisStep(parameters, global_config)[source]

Second Strand Synthesis Step

A library preparation step that places random primers onto the first strand cDNA and then extends the primers to obtain double-stranded DNA.

Primer location can be biased by read sequence (see the ‘position probability matrix’ parameter). Even when not biased, priming can be either perfect or imperfect. Perfect means that the first base will always have a primer bind to it and therefore the entire molecule will be perfectly duplicated as cDNA. Imperfect means that primers are placed randomly (at a rate of ‘primers_per_kb’) and the 5’-most one is extended to create the molecule, potentially losing some of the 5’ end of the molecule.

Fragmentation Step

class beers.library_prep.fragment_step.FragmentStep(parameters, global_config)[source]

Fragmentation step breaks each molecule into pieces

Two fragmentation methods are available:
  1. ‘uniform’ fragments at each base equally and is the default. Note that fragment length distributions will not be uniform. Instead, the fragment locations are chosen uniformly.

  2. ‘beta’ biases fragment location by position within the fragment and can bias not fragmenting already short fragments NOTE: using ‘beta’ can generate unusual coverage plots since there are significant edge effects around the transcript. However, it can be used to give a more realistic fragment distribution.

Configuration

The ‘lambda’ parameter is the rate parameter for an exponential distribution which determines the time it takes until a molecule to fragments. Molecules which would take longer then ‘runtime’ to fragment, do not fragment. Molecules may also fragment multiple times if lambda is high enough.

Since fragmentation generates many very small fragments that will be quickly lost in the following steps, we have the option to drop those fragments immediately. Setting min_frag_size can significantly decrease runtime and memory requirements.

If method == ‘beta’, then the fragmentation sites can be biased within the molecule, according to a beta distribution. If using ‘beta’, you must also specify the following:

The parameters for the beta distribution, beta_A and beta_B. Setting these so that A = B > 0 to bias towards fragmentation in the middle of the fragment, with larger values biasing further towards the middle. If A > B, then would bias towards the 5’ end instead.

And the beta_N factor allows a non-linear fragmentation rate depending upon the length of the molecule. Values >1 indicate that longer molecules are more likely to fragment than smaller ones, biasing towards larger fragment sizes.

Config example:

parameters:
    # Fragmentation methods are available.
    # 'uniform' fragments at each base equally and is the default
    method: uniform
    # The 'lambda' parameter is the rate parameter for an exponential distribution
    # which determines the time it takes until a molecule to fragments.
    # Molecules which would take longer then 'runtime' to fragment, do not fragment.
    # Molecules may also fragment multiple times if lambda is high enough.
    lambda: 0.005
    runtime: 1
    # Since fragmentation generates many very small fragments that will
    # be quickly lost in the following steps, we have the option to
    # drop those fragments immediately. Setting this value can significantly
    # decrease runtime and memory requirements.
    min_frag_size: 20

    # If method == 'beta', then the fragmentation sites can be biased within
    # the molecule, according to a beta distribution.
    # NOTE: using 'beta' can generate unusual coverage plots since there are
    # significant edge effects around the transcript. However, it can be used
    # to give a more realistic fragment distribution.
    #
    # If using 'beta', you must also specify the following:
    #
    # The parameters for the beta distribution. Set these so that
    # A = B > 0 to bias towards fragmentation in the middle of the fragment,
    # with larger values biasing further towards the middle.
    # If A > B, then would bias towards the 5' end instead.
    # beta_A: 3.0
    # beta_B: 3.0
    #
    # And the N factor allows a non-linear fragmentation rate depending
    # upon the length of the molecule. Values >1 indicate that longer molecules
    # are more likely to fragment than smaller ones, biasing towards larger
    # fragment sizes.
    # beta_N: 2.0

Sizing Step

class beers.library_prep.sizing_step.SizingStep(parameters, global_config)[source]

This step simulates filtering molecules by size.

Molecules are selected with probability that ranges piecewise linearly from 0 at ‘min_length’ up to 1 between ‘select_all_start_length’ and ‘select_all_end_length’ and then down to 0 again at ‘max_length’.

Diagram:

Probability of retention by length
                 __________________
   1|        ___/                  \___
p   |   ____/                          \____
   0|__/                                    \___
    ---------------------------------------------
       |          |               |          |
       min_length |               |          max_length
                  |   select_all  |
                start            end

If select_all_start_length or select_all_end_length are not provided then they are set to min_length and max_length respectively and the probability jumps from 0 to 1 immediately.

Config Example:

parameters:
    # See above diagram for meanings of these
    min_length: 100
    max_length: 400
    select_all_start_length: 200
    select_all_end_length: 300

PCR Amplification Step

class beers.library_prep.pcr_amplification_step.PCRAmplificationStep(parameters, global_config)[source]

PCR amplification step creates many copies of each molecule Since the number of molecules generated is quite large and since in real sequencing, only a small fraction of the molecules prepped end up forming clusters on the flowcell and being sequenced, we give the option to remove most of the PCR generated molecules prior to the end of the step. This provides a very large increase in speed and decrease in memory usage.

Config Example:

parameters:
    # Number of cycles to use
    # Generates 2^n fragments for each input fragment
    number_cycles: 10

    # Retain only this percent of the data
    # NOTE: scale is 0-100, not 0-1, so 0.08 means 0.0008 of the generated molecules are kep
    # This value can be approximated from real data with UMI tags by the following:
    # Compute the PCR duplication rate using the UMI tags.
    # Let N be the number of PCR steps. Assuming that each molecule generates 2^N
    # PCR descendants, if all were sequenced, we would expect all molecules to be duped
    # 2^N times. Instead, choose retention_percentage with the following strategy:
    #
    # '''
    # import scipy.stats
    # at_least_two = scipy.stats.binom(p=retention_percentage / 100, n=2**N).sf(1)
    # at_least_one = scipy.stats.binom(p=retention_percentage / 100, n=2**N).sf(0)
    # dupe_rate = at_least_two / at_least_one
    # # Choose retention_percentage so that dupe_rate is approximately the observed PCR dupe rate
    # '''
    #
    # For example, retention_percentage = 0.08 and number_cycles=10 gives dupe rate
    # of about 35%, which is pretty typical. Note that we are defining dupe rate as
    # the fraction of all reads after deduping that had at least one duplicate.
    # If number_cycles changes, this should be modified too to maintain dupe rate.
    retention_percentage: 0.08

    # Induce a GC bias by discarding some PCR duplicates
    # according to their GC bias. All fragments overall
    # GC content is computed and then the following three
    # parameters are used to compute a probability of retention
    # gc_bias_constant + gc_bias_linear*(gc - 0.5) + gc_bias_quadratic*(gc - 0.5)^2
    # clipped to always be within 0 and 1
    # For no bias, set to 1, 0, and 0 for const, linear, and quadratic, respectively.
    # To bias towards GC=0.5 content, set gc_bias_quadratic to be negative,
    # such as -100 for a large GC bias.
    gc_bias_constant: 1.0
    gc_bias_linear: 0.0
    gc_bias_quadratic: -100

    # During PCR, we have some chance of mis-copying the molecules
    # These are specified here as probability per-base, so 0.001 is one error per 1kb
    deletion_rate: 0.0001
    insertion_rate: 0.0001
    substitution_rate: 0.001
assign_id(new_molecule, ancestor_id)[source]

Molecule ids read as A.B.C.D and so on where the dot indicates a new step in which new molecules are generated. Since we do not know how many former steps have created molecules and since this step will only add one dot, a molecule created in this step will have an id of its parent plus a dot plus a number. Original molecules (parent molecule) will only have its original id. So we need to back up at most one “.nnn” to find the original id and we can use that to do a lookup and increment in the sample id counter held in this step. :param new_molecule: freshly created molecule :param ancestor_id: parent id of the new molecule (which may or may not be from the input sample)

execute(molecule_packet, rng, log)[source]

For each cycle, duplicates each molecule from the current list of molecules (including those created in prior cycles) and assigns a unique id. :param molecule_packet: packet containing sample info and molecules from the prior step :return: original molecule packet but with molecules from this step

static validate(parameters, global_config)[source]

Insure that the parameters supplied for this step are appropriate. In this case, the number of cycles parameter should be a counting integer no greater than the number of maximum cycles allowed. :return: True if validation passed, False otherwise

beers.library_prep.pcr_amplification_step.hypergeometric(ngood: int, nbad: int, nsamp: ndarray, rng: Generator)[source]

Sample from the hypergeometric distribution

See np.random.hypergeometric but allows nsamp to have zeros

Adapter Ligation Step

class beers.library_prep.adapter_ligation_step.AdapterLigationStep(parameters, global_config)[source]

Adpater Ligation attaches adapters to each end of the molecules These are used for the PCR step to initiate PCR and include the sample identifying barcodes.

This uses the adapters specified in the resources section as well as the sample i5/i7 barcodes in the samples section Each adapter flanks the corresponding sample barcode, which all flanks the original molecule So the molecule ends up looking like (5’ to 3’):

(pre_i5_adapter) (i5) (post_i5_adapter) (molecule sequence) (pre_i7_adapter) (i7) (post_i7_adapter)

The example config adapters and barcodes have been obtained from: https://support-docs.illumina.com/SHARE/AdapterSeq/Content/SHARE/AdapterSeq/TruSeq/UDIndexes.htm

Config Example:

samples:
    '1':
        # Sample one configuration
        barcodes:
            # Barcodes are used by AdapterLigationStep
            # Each sample should have unique (i5, i7) pair of barcodes
            # These get sequenced to determine which sample the read came from
            # Typical examples of these can be founda at:
            # https://support-docs.illumina.com/SHARE/AdapterSeq/Content/SHARE/AdapterSeq/TruSeq/UDIndexes.htm# TODO: set this value
            i5: AGCGCTAG
            i7: AACCGCGG
    '2':
        # Sample 2 configuration, same as above
        barcodes:
            i5: GATATCGA
            i7: TTATAACC


resources:
    pre_i5_adapter: AATGATACGGCGACCACCGAGATCTACAC
    post_i5_adapter: ACACTCTTTCCCTACACGACGCTCTTCCGATCT
    pre_i7_adapter: GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
    post_i7_adapter: ATCTCGTATGCCGTCTTCTGCTTG

Sequence

Cluster

class beers.cluster.Cluster(cluster_id: int, molecule: ~beers_utils.molecule.Molecule, lane: int, coordinates: tuple[int, int, int], molecule_count: int = 1, diameter: int = 0, called_sequences: list[str] = <factory>, called_barcode: str | None = None, quality_scores: list[str] = <factory>, read_starts: list[int] = <factory>, read_cigars: list[str] = <factory>, read_strands: list[str] = <factory>, base_counts: ~numpy.ndarray | None = None)[source]

The cluster object represents initially, a single molecule bound to a flowcell which is ultimately amplified and read from either or both directions. As such it contains base counts for each position in the original molecule’s sequence, the sequence(s) called as a result of reading those base counts in either or both directions, and the sample barcodes (both 5’ and 3’ ends). A collection of these objects forms a cluster_packet.

cluster_id

The unique id of this cluster

Type:

int

molecule

The original molecule object from which the cluster is derived.

Type:

beers_utils.molecule.Molecule

lane

The flowcell lane where this cluster is found

Type:

int

coordinates

The other flowcell coordinates (tile, x, y) identifying the cluster’s location in the flowcell

Type:

tuple[int, int, int]

molecule_count

The total number of molecules contained within the cluster (starts a 1 and geometrically increases with each bridge amplification cycle

Type:

int

diameter

The physical space take up by the cluster (currently unused)

Type:

int

called_sequences

An array of sequences that result from single or paired end reads. The first sequence in the array is the forward read. The second is the reverse read.

Type:

list[str]

called_barcode

A string representing the read 5’ barcode + 3’ barcode - used in the flowcell header

Type:

str | None

quality_scores

An array of quality score sequences corresponding to the called sequences array.

Type:

list[str]

read_starts

An array of alignment start position, one for each read, aligning them to the reference genome

Type:

list[int]

read_cigars

An array of alignment cigar strings, one for each read, aligning them to the reference genome

Type:

list[str]

read_strands

An array of alignment strand strings, one for each read, aligning them to the reference genome

Type:

list[str]

base_counts

An 4xn array of counts for each of the 4 nt bases for each position in the original molecule. In the order of A, C, G, T counts. The initial base counts array exactly matches the original molecule sequence.

Type:

numpy.ndarray | None

assign_coordinates(coordinates: tuple[int, int, int])[source]

Assign a lane coordinates object to the cluster.

Parameters:

coordinates – (tile, x, y)

static deserialize(data: str, skip_base_counts: bool = False)[source]

Method to re-render the result of the serialize method above back into a complete cluster object (without losing or scrambling state)

Parameters:
  • data – The string data containing the serialized version of the object

  • skip_base_counts – If True, don’t load the base_count data (for memory efficiency)

Returns:

The Cluster object created from the data provided.

Return type:

Cluster

encode_sequence_identifier() str[source]

Creates a sequence identifier for the cluster based upon the information contained in the cluster. The instrument is BEERS. The run id is the identifier set by the user when running the software. The lane and the coordinates define the location on the flowcell. The flowcell id is just filler for now.

Return type:

The sequence identifier as defined for the FASTQ format used here.

generate_fasta_header(direction: int)[source]

Assembles the cluster data into a FASTQ header for the given direction (forward or reverse)

Parameters:

direction (int, either 1 or 2) – Direction of the read. 1 is the ‘forward’ read, 2 is the ‘reverse’ read. If single ended reads, just use 1.

get_base_counts_by_position(index: int) tuple[int, int, int, int][source]

Conveninece method to gather base counts by index (position)

Parameters:

index – index (position) of interest

Returns:

tuple of base counts in ACGT order.

Return type:

ndarray

initialize_base_counts()[source]

Generates basecounts from the starting molecule

serialize() str[source]

Method to serialize data into a string. The initial line contains single value attrbitutes and is preceded by a ‘#’. The second line contains serialized versions of the coordinates and the molecules and is also preceded by a ‘#’. Next, may be 0 to 2 lines preceded by ‘##’, each containing a called sequence and a called quality score. The remaining line have no preceding hashes but list the base counts for each position on the molecule.

Returns:

The serialized string output.

Return type:

str

class beers.cluster_packet.ClusterPacket(cluster_packet_id: int, sample: Sample, clusters: list[beers.cluster.Cluster])[source]

The cluster packet object is the medium of exchange between steps in the sequence pipeline. Each step execution in the sequence pipeline accepts a cluster packet as input and returns a chuster packet, usually modified somehow, as output. The cluster packet is primarily composed of clusters that are derived from molecules. The cluster packet itself is derived from the molecule packet that contained those precursor molecules to begin with. At the end of the pipeline, when FASTQ reports are being created the data contained in the cluster packet is used to populate those FASTQ reports.

static deserialize(file_path: str, skip_base_counts: bool = False) ClusterPacket[source]

Deserialize the compressed data found in the gzipped file located via the given file path, into a fully restored object of the ClusterPacket class.

Parameters:
  • file_path – The locatation of the gzipped file containing the serialized object.

  • skip_base_counts – if True, don’t load base counts (for memory efficiency)

Return type:

ClusterPacket

serialize(file_path: str)[source]

Cluster packets are serialized and saved to the file system and likewise de-serialized from the file system. The serialized data is written, in compressed form, to a gzip file. The first line contains the cluster packet id and the serialized sample data, prepended with a ‘#’. Following that, each cluster is serialized and added to the file.

Parameters:

file_path – location of the file into which the serialized, compressed data is to go.

Flowcell

class beers.flowcell.Flowcell(parameters: dict, rng: ~numpy.random._generator.Generator, min_coords: dict = <factory>, max_coords: dict = <factory>, available_lanes: list[int] = <factory>, lanes_to_use: list[int] = <factory>, flowcell_lanes: dict[str, beers.flowcell.FlowcellLane] = <factory>, coordinate_generators: dict[str, typing.Generator] = <factory>)[source]

Models a flowcell. MoleculePackets can be loaded and are converted to ClusterPackets with assigned coordinates for all the clusters.

A flowcell’s geometry (i.e., coordinate ranges for lane, tile, x, y) are either provided via the configuration The flowcell retention rate is specified as a percentage in the configuration file. The user may restrict which lanes of the flowcell to use via the configuration. Otherwise, all the lanes described by the flowcell’s geometry are used. The individual flowcell lanes to be used are instantiated as FlowcellLane objects, each of which keeps track of the coordinates used for its given lane. Each lane has its own coordinate generator to produce new coordinates for each newly retained molecule.

convert_molecule_pkt_to_cluster_pkt(molecule_packet)[source]

Takes the molecule packet containing only retained molecules and evenly apportions those molecules across the flowcell lanes available for use (i.e., first group is attached to the first lane to use, the seocnd group to the second lane to use and so on). Each molecule is then used to create a cluster, identified by the lane and the tile,x, and y coordinates as provided by the lane’s coordinate generator. The clusters are rolled up into a cluster packet, which again picks up the molecule packet’s sample information. So the cluster packet, like the molecule packet from which it was derived, also applies to one sample only. :param molecule_packet: The molecule packet now containing only those molecule retained by the flowcell. :return: A cluster packet containing the retained molecules as cluster objects with a lane and set of coordinates identified.

generate_coordinates_distinct(lane)[source]

A Python generator that randomly selects tile, x, and y coordinates and uses them to create a coordinates tuple (tile, x, y). The new lane coordinates object is compared with all the consumed coordinates in the given lane. If the object is in the consumed coordinates, it is discarded and the random selection repeated. Once a selection is found to be unique, it is added to the consumed coordinates for the given lane and the object is returned. If no set of unique coordinates can be found in 100 attempts, an exception is thrown. :param lane: the lane for which a unique combination of tile, x, and y cooredinates is to be selected :return: a unique combination of coordinates for the lane given.

generate_coordinates_random(lane)[source]

A Python generator that randomly selects tile, x, and y coordinates and uses them to create a coordinates tuple (tile, x, y). May generate the same coordinate location multiple times, so not appropriate for simulations that are intended for uses that depend upon coordinates. Lower memory usage than generate_coordinates_distinct. :param lane: the lane for which a unique combination of tile, x, and y cooredinates is to be selected :return: a unique combination of coordinates for the lane given.

load_flowcell(molecule_packet)[source]

The entry point into the flowcell object. Aside from the instantiation and validation, all other bound methods are helper methods. The portion of the molecules of the molecule packet retained by the flowcell replaces the original molecule collection and that diminished molecule packet is converted into a cluster packet, where each molecule is converted into a cluster having a unique set of flowcell coordinates. :param molecule_packet: incoming molecule packet (from library prep or a simulated library prep output) :return: cluster packet contained those retained molecules affixed to the flowcell via unique coordinates.

set_flowcell_coordinate_ranges()[source]

Determines the minimum and maximum range for each flowcell coordinate, including the lane. If the ranges are supplied via the configuration file, those ranges (min and max) are rendered as dictionaries. Otherwise, each input FASTQ file is applied to obtain the same ranges.

validate()[source]

Insures the the configuration file parameters relating to the flowcell are appropriate. Otherwise one or more error messages are crafted. :return: A tuple indicating whether or not the flowcell configuration is valid and a diagnostic message for the user.

class beers.flowcell.FlowcellLane(lane: int, consumed_coordinates: set = <factory>)[source]

These objects partly compose the flowcell object, representing the lanes that the current BEERS run is using. The object keeps track of all used coordinates to prevent duplication.

lane

Lane number

Type:

int

consumed_coordinates

set of coordinates already used

Type:

set

Sequence Pipeline

exception beers.sequence.sequence_pipeline.BeersSequenceValidationException[source]
class beers.sequence.sequence_pipeline.SequencePipeline[source]

The class runs all the steps in the sequence pipeline as described and wired together in the configuration file. The point of entry into this class is the static method main().

execute(configuration: dict, global_config: dict, input_cluster_packet: ClusterPacket, output_packet_path: str, log_paths: str, rng: Generator)[source]

Performs the Sequence Pipeline on a ClusterPacket.

Parameters:
  • configuration – json-like object of the configuration data relevant to this pipeline stage.

  • global_config – json-like object of the full configuration data

  • input_cluster_packet – the cluster packet to run through this pipeline stage

  • output_packet_path – the path to serialize the results to

  • log_paths – list of paths to write out the logs to, in the same order as each the steps appear in the config file

  • rng – random number generator to use

static main(seed: int, configuration: dict, global_configuration: dict, input_packet_path: str, output_packet_path: str, log_paths: str)[source]

Prepares the pipeline, loads the cluster packet, and then executes() the sequence pipeline.

Parameters:
  • seed – value to use as the seed for the random number generator.

  • configuration – the json-like object containing the configration data specific to the sequencing prep pipeline

  • global_configuration – json-like object containing the full configuratino of the entire run

  • input_packet_path – the file path to the cluster packet to read in

  • output_packet_path – the file path to output the packet to

  • log_paths – list of paths to write out the logs to, in the same order as each the steps appear in the config file

static validate(configuration: dict, global_config: dict)[source]

Static method to run each step validate process to identify errant parameters. If any errors are found, a BeersSequenceValidationException is raised. Prints error messages to stderr

Parameters:
  • configuration – json-like object containing the SequencePipeline-specific configuration

  • global_config – json-like object containing the full config of the BEERS run

Bridge Amplification Step

class beers.sequence.bridge_amplification_step.BridgeAmplificationStep(step_log_file_path: str, parameters: dict, global_config: dict)[source]

This step serves to simulate the amplification of the molecule in each cluster on the flowcell. Introducing substitutions is allowed via a substitution_rate parameter.

Config Example:

# Number of cycles to perform. Generates 2^N molecules in the cluster
# via a PCR-like bridge amplification process.
cycles: 10
# Substitution rate per base in the cluster formation
# No indels are generated here, for computation simplicity.
substitution_rate: 0.01
execute(cluster_packet: ClusterPacket, rng: Generator) ClusterPacket[source]

Amplifies the molecule in the cluster for each cluster in the cluster packet by the given number of cycles and keeps track of which base positions have incurred substitutions

Parameters:
  • cluster_packet – The input cluster packet

  • rng – Random number generator instance

Return type:

The updated cluster packet

beers.sequence.bridge_amplification_step.multinomial(n: ndarray, p: list, rng: Generator)[source]

Partially vectorized version of np.random.multinomial, see https://stackoverflow.com/questions/55818845/fast-vectorized-multinomial-in-python

Parameters:
  • n – 1-d array of non-negative integers denoting the ‘number of experiments’

  • p – 1-d array of probabilities of each possibility being drawn

Returns:

A 2-d array where the first dimension corresponds to the possible draws (length of p) and the second to the length of n.

Return type:

ndarray

Sequence by Synthesis Step

class beers.sequence.sequence_by_synthesis_step.SequenceBySynthesisStep(step_log_file_path: str, parameters: dict, global_config: dict)[source]

Sequence by Synthesis Step

Given a bridge-amplified cluster packet, generates the reads from them. Reads are determined by flourescence as bases are added one-by-one. We approximate this, allowing for phasing (where some molecules are further ahead or behind others in the sequencing steps).

Flourescence reads are then turned into base calls. Then the values are filled in on the Clusters for the reads, in: - called_sequences - called_barcodes - quality_scores - read_starts - read_cigars - read_strands Which together give all the read information necessary to make a fastq and the ideal true alignment necessary to report as a SAM file with true alignment.

Config Example:

parameters:
    # Determines which read is the forward and which is the reverse read
    # NOTE: currently only 'true' is implemented
    forward_is_5_prime: true
    # Whether to sequence both ends or just one
    # NOTE: currently only 'true' is implemented
    paired_ends: true
    # Length of the reads to generate, in bases
    read_length: 100
    # The rate of phasing, either forward (skip) or backwards (drop)
    # per base per molecule
    skip_rate: 0.002
    drop_rate: 0.002
call_bases(flourescence: ndarray, epsilon_est: float, cross_talk_est_inv: ndarray) tuple[str, str][source]

From a flourescence reading, call sequences bases and quality score From an approximation of the Bustard algorithm

Parameters:
  • flourescence – 2d array of shape (4, read_length) with flourescence values for each of the 4 frequencies for each base read

  • epsilon_est – estimate for EPSILON, the noise size in the flourescence imaging

  • cross_talk_est_inv – 4x4 inverse of the estimate of the 4x4 cross talk matrix

Return type:

called sequence and quality scores (as phred score string)

execute(cluster_packet: ClusterPacket, rng: Generator) ClusterPacket[source]

Execute the Sequence By Synthess step on one packet :param cluster_packet: The input cluster packet :param rng: Random number generator instance

Return type:

The updated cluster packet

read_flourescence(cluster: Cluster, range_start: int, range_end: int, direction: str, rng: Generator) tuple[numpy.ndarray, int, str, str][source]

Generates flourescence readings from sequence-by-synthesis over the range given (specified from 5’ to 3’ if direction = ‘+’ else from 3’ to 5’). Use call_bases to then get bases and quality scores from the read

Parameters:
  • cluster – cluster to read by sequence-by-synthesis

  • range_start – The starting position (from the 5’ end), 0-based

  • range_end – The ending position (exlusive, from the 5’ end), 0-based

  • direction – ‘+’ if reading from the 5’-3’ direction, ‘-’ if reading from 3’-5’ direction

  • rng – The random number generator

Return type:

(flourescence, read_start, read_cigar, read_strand)

beers.sequence.sequence_by_synthesis_step.get_frac_skipped_py(rate: float, max_skips: int, molecule_count: int, read_len: int, rng: Generator) ndarray[source]

Compute the array of length (read_len, max_skips+1) that indicate how many of the molecule_count molecules have incurred exactly i skips by the jth base in (i,j) element

Parameters:
  • rate – Positive number indicating rate of skipping

  • max_skips – Maximum number of skipps we consider

  • molecule_count – Number of molecules in the cluster

  • read_len – lenght of reads

  • rng – Random number generator instance

beers.sequence.sequence_by_synthesis_step.get_inv_phasing_matrix(read_len: int, skip_rate: float, drop_rate: float) ndarray[source]

Computes the inverse of Q, the phasing matrix

The (j,t) entry of Q gives the probability of a template terminating at position j after t cycles

Cached for speed-ups. Since reads are all the same length, they all get the same matrix and caching is very useful.

Parameters:
  • read_len – lenght of reads

  • skip_rate – Rate of skipping events (forward phasing)

  • drop_rate – Rate of skipping events (reverse phasing)

Return type:

Inverse of the phasing matrix

FastQ and SAM generation

class beers.fast_q.FastQ(flowcell: Flowcell, sample_id: str, sample_barcode: str)[source]

The FastQ object generates a FastQ report for the run for a given flowcell and for each direction in the case of paired end reads. This report function is called by the controller but ONLY when all cluster packets have been processed.

generate_report(cluster_packet_paths, output_file_paths, bad_barcode_file_paths, sort_by_coordinates=False)[source]

The principal method of this object generates one or two reports depending upon whether paired end reads are called for. All the information needed to create the FASTQ files is found in the cluster packets themselves. For each cluster packet, we identify whether there is one called sequence or two (paired ends). The first called sequence in the list is always the forward one. We find each cluster in the cluster packet that is affixed to the lanes of interest. A fasta header is generated internally for each of those remaining clusters for the given direction. The remaining clusters are sorted by their coordinates and each entry is written to the FASTQ file - header, called sequence, + quality score.

Output files are named according to barcode_S#_L#_R#.fastq specifying sample, lane and read direction numbers.

Parameters:
  • cluster_packet_paths – list of cluster packet file paths to a generate report from

  • output_file_paths – list of two (one for R1/R2 read directions) lists of sam/bam file to output to for each lane, in the same order as config.flowcell.lanes_to_use

  • bad_barcode_file_paths – list of two (one for R1/R2 read directions) list of sam/bam files to output to for each lane, in the same order as config.flowcell.lanes_to_use

  • sort_by_coordinates – Whether to sort output by coordinates, as would typically be done with a fastq file from Illumina. Default: False. Setting this to True will consume considerably more memory as all reads are read in at once.

class beers.sam.SAM(flowcell: Flowcell, sample_id: str, sample_barcode: str)[source]

The SAM object generates a SAM/BAM report for the run for a given flowcell and for each direction in the case of paired end reads. This report function is called by the controller but ONLY when all cluster packets have been processed.

generate_report(cluster_packet_paths, output_file_paths, bad_barcode_file_paths, reference_seqs, BAM=False, sort_by_coordinates=False)[source]

The principal method of this object generates one or two reports depending upon whether paired end reads are called for. All the information needed to create the SAM files is found in the cluster packets themselves. For each cluster packet, we identify whether there is one called sequence or two (paired ends). The first called sequence in the list is always the forward one. We find each cluster in the cluster packet that is affixed to the lanes of interest. The remaining clusters are sorted by their coordinates and each entry is written to the SAM file.

Parameters:
  • cluster_packet_paths – list of cluster packet file paths to a generate report from

  • output_file_paths – list of sam/bam file to output to for each lane, in the same order as config.flowcell.lanes_to_use

  • bad_barcode_file_paths – list of sam/bam files to output to for each lane, in the same order as config.flowcell.lanes_to_use

  • reference_seqs – dictionary mapping reference names to reference sequences, used for SAM header

  • BAM – if true, output in BAM format, else SAM (default)

  • sort_by_coordinates – Whether to sort output by coordinates, as would typically be done with a fastq file from Illumina. Default: False. Setting this to True will consume considerably more memory as all reads are read in at once.