Quick Start
Quick Start#
We use a very simple task definition file and submit it to a LSF batch system.
Hint
The default batch system currently is LSF, so if you do not change it, LSF will be used. Check out Batch Processing for more information.
Our task will be very simple: we want to create 100 files with some random number in it. Later, we will build the average of those numbers.
Open a code editor and create a new file
simple-example.py
with the following content:import b2luigi import random class MyNumberTask(b2luigi.Task): some_parameter = b2luigi.IntParameter() def output(self): return b2luigi.LocalTarget(f"results/output_file_{self.some_parameter}.txt") def run(self): random_number = random.random() with self.output().open("w") as f: f.write(f"{random_number}\n") if __name__ == "__main__": b2luigi.set_setting("result_dir", "results") b2luigi.process([MyNumberTask(some_parameter=i) for i in range(100)], workers=200)
Each building block in
(b2)luigi
is ab2luigi.Task
. It defines (which its run function), what should be done. A task can have parameters, as in our case thesome_parameter
defined in line 6. Each task needs to define, what it will output in itsoutput
function.Note
We have defined a result path in the script with
b2luigi.set_setting("results")
You can ignore that for not - we will come back to it later.
In our run function, we generate a random number and write it to the output file, which is named after the parameter of the task and stored in a result folder.
Hint
For those of you who have already used
luigi
most of this seems familiar. Actually,b2luigi
’s task is a superset ofluigi
’s, so you can reuse your old scripts!b2luigi
will not care, which one you are using. But we strongly advice you to useb2luigi
’s task, as it has some more superior functions (see below).Please not that we could have imported
b2luigi
withimport b2luigi as luigi
to make the transition between
b2luigi
andluigi
even simpler.Call the newly created file with python:
python simple-example.py --batch
Instead of giving the batch parameter in as argument, you can also add it to the
luigi.process(.., batch=True)
call.Each task will be scheduled as a batch job to your LSF queue. Using the dependency management of
luigi
, the batch jobs are only scheduled when all dependencies are fulfilled saving you some unneeded CPU time on the batch system. This means although you have requested 200 workers, you only need 100 workers to fulfill the tasks, so only 100 batch jobs will be started. On your local machine runs only the scheduling mechanism needing only a small amount of a single CPU power.Hint
If you have no LSF queue ready or you do not want to run on the batch, you can also remove the batch argument. This will fall back to a normal
luigi
execution. Please see Batch Processing for more information on batch execution and the discussion of other batch systems.After the job is completed, you will see something like:
===== Luigi Execution Summary ===== Scheduled 100 tasks of which: * 100 ran successfully: - 100 MyTask(some_parameter=0,1,10,11,12,13,14,15,16,17,18,...) This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====
The log files for each task are written to the
logs
folder.After a job is submitted,
b2luigi
will check if it is still running or not and handle failed or done tasks correctly.The defined output file names will in most of the cases depend on the parameters of the task, as you do not want to override your files from different tasks. However this means, you always need to include all parameters in the file name to keep them different. This cumbersome work can be handled by
b2luigi
automatically , which will also help you ordering your files at no cost. This is especially useful in larger projects, when many people are defining and executing tasks.This code listing shows the same task, but this time written using the helper functions given by
b2luigi
.import b2luigi import random class MyNumberTask(b2luigi.Task): some_parameter = b2luigi.IntParameter() def output(self): yield self.add_to_output("output_file.txt") def run(self): random_number = random.random() with open(self.get_output_file_name("output_file.txt"), "w") as f: f.write(f"{random_number}\n") if __name__ == "__main__": b2luigi.set_setting("result_dir", "results") b2luigi.process([MyNumberTask(some_parameter=i) for i in range(100)], workers=200)
Before continuing, remove the output of the former calculation.
rm -rf results
If you now call
python simple-example.py --batch
you are basically doing the same as before, with some very nice benefits:
The parameter values are automatically added to the output file (have a look into the
results/
folder to see how it works and where the results are stored)The output for different parameters are stored on different locations, so no need to fear overriding results.
The format of the folder structure makes it easy to work on it using bash commands as well as automated procedures.
Other files related to your job, e.g. the submission files etc. are also placed into this folder (this is why the very first example defined it already).
The default is to use the folder where your script is located.
Hint
In the example, the base path for the results is defined in the python file with
b2luigi.set_setting("result_dir", "results")
Instead, you can also add a
settings.json
with the following content in the folder where your script lives:{ "result_dir": "results" }
The
settings.json
will be used by all tasks in this folder and in each sub-folder. Alternatively, you can also set the enviroment variable B2LUIGI_SETTINGS_JSON, to set the path of your settings file. You can use it to define project settings (like result folders) and specific settings for your local sub project. Read the documentation onb2luigi.get_setting()
for more information on how to use it.Attention
The result path (as well as any other paths, e.g. the log folders) are always evaluated relatively to your script file. This means
results
will always be created in the folder where your script is, not where your current working directory is. If you are unsure on the location, callpython simple-example.py --show-output
More on file systems is described in Batch Processing, which is also mostly true for non-batch calculations.
Let’s add some more tasks to our little example. We want to use the currently created files and add them all together to an average number. So edit your example file to include the following content:
import b2luigi import random class MyNumberTask(b2luigi.Task): some_parameter = b2luigi.Parameter() def output(self): yield self.add_to_output("output_file.txt") def run(self): random_number = random.random() with open(self.get_output_file_name("output_file.txt"), "w") as f: f.write(f"{random_number}\n") class MyAverageTask(b2luigi.Task): def requires(self): for i in range(100): yield self.clone(MyNumberTask, some_parameter=i) def output(self): yield self.add_to_output("average.txt") def run(self): # Build the mean summed_numbers = 0 counter = 0 for input_file in self.get_input_file_names("output_file.txt"): with open(input_file, "r") as f: summed_numbers += float(f.read()) counter += 1 average = summed_numbers / counter with open(self.get_output_file_name("average.txt"), "w") as f: f.write(f"{average}\n") if __name__ == "__main__": b2luigi.set_setting("result_dir", "results") b2luigi.process(MyAverageTask(), workers=200)
See how we defined dependencies in line 19 with the
requires
function. By callingclone
we make sure that any parameters from the current task (which are none in our case) are copied to the dependencies.Hint
Again, expert
luigi
users will not see anything new here.By using the helper functions
b2luigi.Task.get_input_file_names()
andb2luigi.Task.get_output_file_name()
the output file name generation with parameters is transparent to you as a user. Super easy!When you run the script, you will see that
luigi
detects your already run files from before (the random numbers) and will not run the task again! It will only output a file in results/average.txt with a number near 0.5.
You are now ready to read some more documentation in API Documentation or have a look into the FAQ. Please also check out the different Run Modes.