The following document describes how to use the PAC experimental software to measure the runtime of the PAC approach and its competitor approaches (TDB and LRU). == Package contents - pec.jar : The program that takes care of executing the indicated caching method on the a given dataset. The program outputs the results of the experimental evaluation as well as log information in different files. - pec.py : Wrapper script that invokes pec.jar. We recommend to always use this script when testing PAC. == Software requirements - Python 2.7 - Java >= 8 - Gurobi >= 7.0.1 === Installing Gurobi (On Linux, using an academic license) Gurobi is a propietary library for combinatorial optimization. We use it to solve the fragment selection problem, which is formulated as an Integer Linear Program. I will briefly detail the steps to install Gurobi in a GNU/Linux machine. It is important to remark that I used an academic license for this purpose. The steps may be different for non-academic licenses and/or other operating systems. 1) Download the right Gurobi Optimizer at http://www.gurobi.com/downloads/download-center (You will have to create an account in the website if it is your first time using Gurobi) 2) Uncompress the package and locate the uncompressed folder (called gurobi[version]) somewhere in your file system. The Gurobi files are located under gurobi[version]/linux[64|32], depending on the package you downloaded. 3) Get an academic license. This license allows you to run the Gurobi libraries for a single user in one computer. If you have multiple computers or users you have to get a license for each. More information here: http://www.gurobi.com/academia/for-universities. 3.1) Visit the Free Academic License page (https://user.gurobi.com/download/licenses/free-academic) and follow the instructions on that page. Those instructions will produce a signature, e.g., ae36ac20-16e6-acd2-f242-4da6e765fa0a, that you will need to validate your license. 3.2) Get a key by running the tool 'grbgetkey' (located at e.g., gurobi702/linux64/bin) using the signature from the previous step (ex: grbgetkey ae36ac20-16e6-acd2-f242-4da6e765fa0a). The 'grbgetkey' program will retrieve your license key and prompt you to store it on your machine. It will also validate your academic license eligibility by confirming that you are connected to the Internet from a recognized academic domain (e.g., any '.edu' address). 4) Configure the following environment variables in your system: - GRB_LICENSE_FILE : It should point to the location of your license key defined in step 3.2) - GUROBI_HOME : It should point to the Gurobi installation path defined in step 2) (gurobi[version]/linux[64|32]) - PATH : Extend the variable's value to contain the value GUROBI_HOME/bin - LD_LIBRARY_PATH : Extend the variable's content to contain the value GUROBI_HOME/lib How to configure those variables in your computer depends on your GNU/Linux distribution. == Running the software Once all the software dependencies are installed, we are ready to run the PAC experimental software. To test it run the following command in a command line. $ ./pec.py -c Here CONFIG_FILE.ini is a configuration file using the INI format (https://en.wikipedia.org/wiki/INI_file, the parser does not support sections). This format can store arguments of the form (one per line): parameter: value == The configuration file This input configuration file defines all the arguments for the PAC experimental tool pec.py. In the following we describe the most relevant. - load-instance-data: Path to the Jena TDB directory where the input dataset was loaded. See http://qweb.cs.aau.dk/pac/data/README-DATA.txt for the instructions on how to load an NQ dataset into a Jena TDB database. To run more than one dataset at a time, just add multiple lines of type "load-instance-data", one per dataset. - load-cube-structure: Path to the dataset's cube definition. We provide the schemas for both SSB and the QBOAirbase families of datasets as TSV files. - ilp-log-location: Logging file for the fragment selection policy. This file is used by Gurobi to output information about the optimization problem defined to solve the fragment selection problem. - offline-log-location: Logging file for the offline experimental part. The offline part covers the loading of the TDB data structures and the construction of the fragment tree. This log file will thus contain statistics and runtimes about the TDB dataset and the fragment tree. - database-type: Let it be always tdb - analytical-queries-dir: Directory containing the analytical queries (SPARQL) that will be tested by the experimental software. Each query should be stored in a different file. - budget: Memory budget in number of triples. - budget-percentage: Memory budget in percentage of triples of the database that should be cached. The file can accept multiple values for the budget and budget-percentage arguments. In that case, the experimental software will test every value provided for the budget and budget-percentage parameters. - provenance-queries-dir: Directory containing the provenance queries that will be tested by the experimental software. Each query should be stored in a different file. The queries can be either SPARQL SELECT queries or lists of provenance identifiers from the database's provenance graph. In the latter case, the file should contain a provenance identifier per line (without brackets <>). This parameter assumes always a relative path with respect to the parent folder of the dataset defined by the parameter load-instance-data. - add-cache: Cache setting. It can be "warm", "cold" or "tepid". warm: It means that Jena TDB will use its cache default settings (https://jena.apache.org/documentation/tdb/architecture.html#caching-on-32-and-64-bit-java-systems). This value is required to test PAC or any cache strategy under "warm settings". In addition, to implement a fully warm setting as presented in the paper, the OS cache should have been populated by having run each query at least once. cold: It means that the Jena TDB internal cache has been set to zero. This value is required to test PAC or any cache strategy under "cold settings". A fully cold setting requires to purge the OS cache before running the experiments. The steps to purge the OS cache are fully platform dependent. tepid: It means that the Jena TDB internal cache will be sized according to a budget argument. This value is required to test the TDB caching strategy defined as competitor in the paper. - fragment-selector: The fragment selection strategy used to populate the cache. It accepts the following values: ilp-distance-improved: This is the PAC's approach used as presented in the paper. ilp-distance: A variation of PAC that benefits bigger fragments in the selection. It performs worse than PAC in our test datasets. lru: Least recently used (LRU) caching strategy tdb: TDB caching. It should be used in conjunction with "add-cache: tepid" mockup: It corresponds to PAC using the naive query rewriting approach mentioned in the paper (no graph filtering when adding the graph labels to the analytical query) dummy-lru: LRU without graph filtering dummy-tdb: TDB without graph filtering. It should be used in conjunction with "add-cache: tepid" - timeout: Query timeout in minutes - evaluation-strategy: It can be either "fullMaterialization" or "basic". We recommend the first one. "basic" uses Jena TDB standard evaluation which proved too slow for the queries used in our experiments. - experimental-runs: Number of times each query (combination of analytical and provenance query) will be executed. - debug-query: If "true", the software will verify whether the queries deliver the right results when using the cache. It compares the results of a cache-based approach with the results of a standard Jena TDB execution without caching. - optimized-query-rewriting: "true" by default. If set to false, graph filtering is disabled. It should be set to "false" when combined with the "dummy-tdb", "dummy-lru", and "mockup" fragment selectors. experimental-log-location: Path to the file containing the experimental results (runtimes) reported by pec.py. Each query execution is described in a tab-separated line.