Overview
Workflow
1. Copy Data
If you are running the pipeline in EBI infrastructure, the pipeline will copy data from the original log file location to your path
Currently, original log files are stored in a place where only datamover
can be read. So, as the first step, our pipeline will copy(rsync
) the log files to the location you specified which can be accessed by the standard
queue.
Once this job is completed, it will automatically launched the next dependant job to process the log files and do the statistical analysis.
Running first time
It could take 2-3 hours to copy the log files for the first time, then it is will be few minutes for the subsequent runs.
2. Process Log Files
This step will collect the names of log files, process the log files parallel and apply many filters excluding the unwanted data. The processed log files will be stored in the Parquet format which is a columnar storage format that is optimized for reading and writing large datasets.
3. Produce Statistics Report
Using dask framework, parquet will be queried and the statistics will be generated. This step will generate the statistics report in the HTML format and will be stored in the location you specified.
Detailed workflow steps can be found in the workflow documentation.