Differences between revisions 10 and 11
Revision 10 as of 2011-01-30 16:03:13
Size: 9231
Editor: ?Hajo Krabbenhöft
Comment:
Revision 11 as of 2011-01-30 16:08:29
Size: 9517
Editor: ?Hajo Krabbenhöft
Comment:
Deletions are marked like this. Additions are marked like this.
Line 131: Line 131:

Before a user is able to upload a new use case, we check that he executed it successfully inside Taverna at least once. We also ask him to upload or specify example workflows to myExperiment, so that we can have bi-directional links between myExperiment and the use case repository.

Notes on the discussion about the external tool at the Debian Med meeting during January 2011.

External tool invocation discussion

Ways of calling tools are described by "use case descriptions". The use case descriptions can be held in a registry. For example,

http://usecase.taverna.org.uk/sharedRepository/index.php

The KnowARC project developed a plugin for Taverna that can read the use case descriptions and allow you to include calls to the a tool in a workflow.

You need to know how/where to call the tool - the "invocation mechanism". There are currently three options for the invocation mechanism:

  • local
  • ssh to specific machines (uses Taverna's credential manager to keep the login information confidential)
  • on a KnowARC grid

In the current plugin, the setting for how to call the tool is shared by all external tool services in a workflow e.g. all the tools are run locally.

Planned improvement

Taverna will manage a set of invocation environments that are named and identified by a UUID, for example "fred" and "62A81F2F-4C3D-4C0C-ACF1-681327130328". (The UUID may be changed to being a URL.)

An external tool service in a workflow will state the invocation environment that the tool will be run in.

The invocation environments can specify the settings for the various invocation mechanisms and also which invocation mechanism is currently to be used e.g. ssh goes to phoebus.cs.man.ac.uk but to currently use local. So you can have some services using environment "fred", some "bob" and some "jim".

The invocation environment manager allows users to edit the settings and also change which mechanism to use. This allows you to easily change where a set of tools will be run e.g. to change all services using setting "bob" to run locally.

There is also a proposal to understand the idea of test and production invocation environments. So, "fred" can be set to run services locally during test and on a grid during production. The choice of whether to run a workflow in a test or production mode will be made when the workflow is run.

It is not clear how the choice of mode will be shown to the user

A workflow run may vary the data that it uses according to the run mode e.g. to use different data during test and production. There needs to be explicit support for this in the workflow but it is not clear how. Maybe a tee filter with two input ports and one output port, which switches the data it let's through depending on test / production mode?

Ongoing issues

  • ssh equivalent for windows.

External tool description

There is an XML format for specifying the use case descriptions. We have also looked at:

  • acd as used by EMBOSS and ?SoapLab

  • Galaxy
  • ?BioPieces

  • GIMIAS
  • BOINC

Current plan is to be able to translate the EMBOSS acd descriptions and put them in a repository. Future work will be done to ensure that the external tool capability is sufficient.

Quick proof of concept including Java source copied&modified from EMBOSS is at git@s2.spratpix.com:acd-to-use-case.git. Ask Hajo for access.

Additional invocation mechanisms

For Taverna 2.3, we will include local and ssh invocation as part of the release. The KnowARC invocation will probably be made available as a plugin.

Need to look at cloud invocation soon. Possible clouds:

Amazon, Eucalyptus, Rackspace and VMWare all use virtual machine images of various formats. Using a REST/SOAP API, one of these pre-prepared images can be started on a new virtual machine and the IP is retrieved. For Amazon, we also need to create SSH Keys and use the API to register them with our new virtual machine. After that, invocation proceeds as with SSH invocation.

Amazon and Rackspace are transient in the way, that shutting down a virtual machine will remove all data of that vm. Eucalyptus and VMWare keep the data so that we could restart the VM later on to satisfy outstanding references. For BOINC, references are impossible because every user and thus every node might go offline at any time.

Invocation environment checking

There is currently some limited ways of specifying what needs to be on the machine where a tool will be run, and also how to check if the tool can be run there. Need to look at more general ways of specifying this.

For all xRSL-compatible grids, we can just use the RE-lines we currently use. For BOINC, we need a mapping from RE to tar files, which we need to submit to BOINC alongside our data & job description. For Amazon/Eucalyptus/Rackspace/VMWare, the user needs to prepare VM images beforehand. We therefore require him to create a list of URLs to his VM images and tag each VM image with the REs he installed inside the VM.

So our problem child is local (on linux) and SSH invocation. For those invocation mechanisms, we include RE blocks inside the usecase.xml:

<RE name="APPS/JAVA-1.6.1">
<test local="test -x /usr/bin/java" />
<hint package="sun-java6-jdk">
You're missing Java, please install.
</hint>
</RE>

We can then automatically suggest "apt-get install sun-java6-jdk" on debian systems, otherwise we show the user "You're missing Java, please install.". The benefit of this over having a <test ...> in every use case is that we only have to write the test once for all use cases using the same tools.

Sensible handling of data

Want to minimize the transfer of data. So data stays, where possible, with the tools that will use it. Also, tools are invoked where the data is.

Need to extend Taverna's data handling mechanisms to deal with this. Need to improve some of the invocation mechanisms to better decide where to run the tools.

The plugin currently logs the job id's of all executed KnowARC grid jobs. This means we can freely re-use references to data because the user can clean up all his old grid jobs on demand. The user should be warned here, that this will invalidate any references and therefore should not be done while Taverna workflows are running.

In general, every invocation, independently of it's method, should be logged. Also, invocations should generally not clean-up any of their data, so that we can use references. When downloading a reference, we can after that delete the original file, because it's data is now stored inside Taverna.

This still leaves the problem on when to clean up past invocations. Should this happen after running the workflow? (Would make it impossible for the user to view unused output ports) When closing Taverna? Only when the user requests to do so? Timeout?

Windows Use cases

All current use cases or any new use case without an explicit windows flag is to be considered linux/mac.

Use case invocation is steered in two locations:

Every use case activity has a tab in it's configuration for selecting two invocation environments, one for testing and one for production. The meaning of these invocation environments is specified in Tavern's configuration options and is therefore globally shared for all workflows.

Invocation environments are always used for linux use cases, which means that on a windows machine it should be impossible to select local invocation in the taverna-wide configuration dialog.

A windows-only use case has it's tab for selecting invocation environments grayed out, because all invocation environments are for linux only. So a windows-only use case will always use local execution. On a linux machine, this should produce an error when health checking or running the workflow.

A linux-only use case will have aforementioned tab for selecting, which will be used to choose one of the taverna-wide invocation environments. This makes sense on windows and linux machines, because even when running on windows, the user will not be able to create an invocation environment to use local execution.

Reading and publishing tool descriptions

Every use case description is identified by UUID and NOT by name. Duplicate names are allowed. Every use case description used is embedded in the workflow.

When loading, ask all repositories for their UUID to timestamp mappings. Compare the retrieved timestamps with the timestamp of the embedded use case description and ask the user to update or use old. On a server, it should either always take the latest from the repository or from the workflow. Best to take those in the workflow.

Editing use cases is done exclusively inside the Taverna GUI. Each edit updates the timestamp. The GUI also contains a "Share" button which will upload the use case to a public use case repository. After the use case has been approved, it will then be available for download to all users of that repository.

Need long term plan for this e.g. go for Heroku. Taverna would need to know about possible repositories for upload.

Talk to ?BioCatalogue people. At minimum would like to know about ?UseCase repositories and to synchronize with them.

Before a user is able to upload a new use case, we check that he executed it successfully inside Taverna at least once. We also ask him to upload or specify example workflows to myExperiment, so that we can have bi-directional links between myExperiment and the use case repository.