Benchmarking environment

PyDial offers the opportunity to develop a Reinforcement Learning based Dialogue Management benchmarking environment, where a fair comparison between different algorithms interacting with a varied set of dialogue environments. Here, we introduce the benchmarking setup presented in Casanueva et al. 2017, where 18 dialogue environments spaning different noise conditions, user behaviours and different domains are introduced. In addition, 4 state of the art dialogue policy optimisation algorithms are compared in these environments.

To run the benchmarking tasks, first download PyDial and install the requirements. The config files specifying the different environments can be found in


Then, run the benchmarking task selected using the train command:

python train config/pydial_benchmarks/env1-hdc-CR.cfg --seed=(0,9)

Note that in some terminals you might have to write \ before the opening and closing parentheses -i.e. --seed=\(0,9\). This command will run the environment 1 in the Cambridge Restaurants domain for 10 different seeds using the handcrafted policy. To run a different environment, just select a different config file. To run one of the benchmarked RL algorithms, open the config file and uncomment the parameters for that algorithm. Due to updates in the environment code, the results obtained might differ to the ones presented in the paper, but the difference shouldn't be statistically significant.

To print the mean results for all the seeds, run the plot command giving as arguments the list of logfiles:

python plot --noplot _benchmarklogs/env1-hdc-CR-seed*-00.1-4.train.log

In the future, we plan to extend these benchmarks with more challenging environments. The benchmarking environment will be updated as these tasks are developed.