Home | About NASC | Address/Staff
Ask a question | HELP

NASC

The European Arabidopsis Stock Centre

Programming using the NASCArrays website and XML

Introduction

Many people download data using NASCArrays, and analyse expression data from the website. Some people might write scripts to automatically parse this data- for instance loading into another database they have.

These web pages are about how to parse the NASCArrays annotation automatically. Every web page you see in NASCArrays can be used as a pseudo-web service. So if there is a web page that you want to use programmatically you can do so.

On this web page we will go through how to access NASCArrays automatically. We will discuss how XML is used to construct the website, and how you can use the XML to answer questions using the website.

First a bit of a proviso- our first priority is to make a useful website, not to make useful XML. Although we think using the XML can help you, in places the XML is more useful for the web pages rather than other uses. If you write to us, we may be persuaded to improve it, but it is not our top priority. Also, the details mentioned in this website are subject to change.

How does it work? It's all XML underneath

Every HTML page you view in the NASCArrays website is generated as a piece of XML. Every piece of XML is then transformed using XSLT into the XHTML pages that you see. This makes it easy for us to keep presentation and content separate, and helps when generating valid XHTML.

In this document we are going to talk a lot about XML, XHTML, XSLT, XML Schema and so on. If you are not familiar with XML technologies, you might want to read up on them before continuing. I'm also assuming that you know how to work NASCArrays generally.

You can see this XML by appending passthru=1 on any URL in NASCArrays. For example NASCArrays' homepage is:

http://affymetrix.arabidopsis.info/narrays/experimentbrowse.pl

and a page for an experiment is:

http://affymetrix.arabidopsis.info/narrays/experimentpage.pl?experimentid=XXXX

Where XXXX is an experimentid (see later- if you want to try this, choose any experiment from the homepage and you'll get an id )

Add passthru=1 on the end in URL style to give:

http://affymetrix.arabidopsis.info/narrays/experimentbrowse.pl?passthru=1
http://affymetrix.arabidopsis.info/narrays/experimentpage.pl?experimentid=XXXX&passthru=1

If you click on any of these links, depending on your browser, various things might happen. For instance if you use Mozilla, the browser will display the bits of markup that it understands (the tags that are in the XHTML 1.0 transitional namespace). This is probably correct, but it is not very helpful. Using the "view source" will allow you to see the underlying XML.

Every single XHTML page in NASCArrays can be altered in this way to give XML. Already, you can see that we have a more powerful way of processing the website rather than trying to parse the web pages- these XML files contain just the information, rather than all of the graphics and trimmings a web browser needs to make an attractive website.

The XML produced is ugly to human eyes- no concessions to formatting are made for ease of reading. Things can be made better by changing the indentation to make it easier to read. To do this save the XML to file by doing "Save file as..." or similar in the web browser. UNIX and Mac OS X users might want to use the wget utility like so:

wget http://affymetrix.arabidopsis.info/narrays/experimentbrowse.pl?passthru=1

Once the XML is in a file, you can format it using a utility to make it readable. We use xmllint --format:

xmllint --format download.xml > pretty.xml

Windows users could use an XML Editor such as XMLSpy. Visual Studio from Microsoft has an XML editor built in that will format XML.

XSLT practise

XSLT is a language for turning one type of XML into another type of XML. Since XHTML is a type of XML, you can you use XSLT to "transform" any kind of XML into XHTML. This is how all of the web pages are made on the NASCArrays website.

In the xslt/ directory of this website are all the XSLT stylesheets used in NASCArrays. Download them all and put them somewhere.

For this exercise, what we are going to do is go through, in slow motion, the process the website goes through in producing the web page for an AtGenExpress experiment. The URL for the experiment page for this experiment is: this is an example -

http://affymetrix.arabidopsis.info/narrays/experimentpage.pl?experimentid=xx

Have a look at the page, and remember what it looks like!

Download the XML file as detailed above (add passthru=1 onto the end of the above URL).

You now need an XSLT processor. Run the XML you've downloaded through the ep2html.xslt stylesheet you downloaded earlier. Linux or Mac OS X users can use xsltproc. Run:

xsltproc path-to-xslt/ep2html.xslt xmlfilenamehere.xml>output.html.

View the output in a web browser. You should now have exactly the same web page. This is how the website produces all of its web pages internally.

In the XSLT directory above there are all of the stylesheets used in the website (currently). You should be able to reproduce any of the web pages by following the procedure above and choosing the correct stylesheet.

Guide to XML produced

What you (presumably) want to do, if you have read this far, is use this XML to get information about the experiments, or run the Spot History automatically for example. There are two stages to this. Firstly you will need to know how to construct the URL for each part of the website- so you can get the answers you want. Secondly you need to understand the format of the XML. This section describes the second part.

XML is a very flexible language- providing you obey the well-formedness rules, tags can come in any order. It is very difficult to work with XML if you do not know what markup to expect. Suppose you wanted to figure out the experiment page for example. You could take a sample of experiment pages for different experiments and try and spot the pattern, but there would always be a possibility of surprise tags appearing you did not expect. This makes it hard to write applications.

XML Schema is a language for describing the format of XML documents. It describes things like a <p> tag can only occur inside a <q> tag, and that only numbers occur inside a <q> tag, and that <z> tags must have three <x> and three <y> tags in them.

There are other languages for doing the same thing, notably DTDs (which are now out of date) and RELAX-NG. We use XML Schema. There are various XML Schema editing applications available.

You can get the schema from http://affymetrix.arabidopsis.info/NS/CriminalNetwork/1/schema.xsd. This should give a complete description of the XML that is produced by the website for most common pages. It is not currently complete (that is to say, not all pages produced by the website will validate using this schema). As well as the structure, there are also comments that should help decipher the XML for you.

All of the tags are (at the time of writing) in the XML namespace http://affymetrix.arabidopsis.info/NS/CriminalNetwork/1. If you are writing programs to work with this XML, use a technology that understands XML namespaces (like SAX2 or DOM Level 2 and up). Check the namespace of all tags when you parse the XML. We'll change the namespace when we update the tags- this will allow you to detect updates to the XML format and allow your software to die gracefully.

Some tags contain XHTML snippets- these are in the XHTML namespace http://www.w3.org/1999/xhtml.

Programmers guide to some useful pages on the website

This section is to give an introduction to some of the useful pages on the website and what you could use them for from a programmer's point-of-view. All of the pages mentioned should produce XML that validates with the XML Schema described above. All of the below URLs should have passthru=1 appended on the end to get the XML described.

Important IDs

Lots of things in NASCArrays have ID numbers (in fact, in the underlying database, everything has). There are a few important ID numbers that you will need for working with the XML.

  • The Experiment ID is a number that uniquely identifies the experiment
  • Experiments are made up of Slides- each slide has its own Slide ID
  • Each Slide has its own data- this is the ResultSet. Use the ResultSet ID to obtain data for a slide
Old Experiment Browse
URL format: http://affymetrix.arabidopsis.info/narrays/oldexperimentbrowse.pl

This page used to be NASCArrays' homepage, before we had our new tree-based page. It is still very useful to XML programmer's however. This page (from an XML point of view) contains a list of the experiments in the database sorted by date. This is represented by a series of <ExperimentLite> elements, each one representing an experiment. If you wanted to find an experiment you could search through this XML.

Every experiment has a unique experiment id. This experiment id can be used in other parts of the website.

Experiment Page
URL format: http://affymetrix.arabidopsis.info/narrays/experimentpage.pl?experimentid=XXXXXX

Once you have an experiment number from somewhere, you can get all of the details for the experiment using the experiment page. This page is the most complicated on the website.

Each Experiment (which is the root element) has a Submitter - the person who sent the experiment into NASC. They also have a number of hybridisation sets (HybrSet)- typically just one.

In this XML, each thing a parent thing owns is represented as a tag inside the parent tag. So in this case, you can expect a Submitter tag inside the Experiment tag, and a series of HybrSet tags.

A HybrSet is a method of grouping Slides. So each HybrSet contains a series of Slide tags.

Each Slide contains two things- an Image which is the scanned image of a slide, an Extract which is information about the sample hybridised to the slide.

Each Extract contains a Source- this is where the original plant is described.

Each Image contains a ResultSet- this is how you get the data. The ResultSet tag has a ResultSetID attribute.

Each one of these tags has information about the thing it represents (Experiment, Slide, Extract, Source, Image, ResultSet). Often these are in a subtag called Details.

Obviously this is only a very rough guide. If you'd like to know more, please consult the XML Schema. If you cannot work out how to get out the information you require, please email us.

Keyword Search
URL format: http://affymetrix.arabidopsis.info/narrays/supersearch.pl?searchterms=XXXXXXX

Supply a list of "keyword" search terms that you want to search for (separated by spaces, if you want). Returns a list of ExperimentLite tags.

Slide Search
URL format: http://affymetrix.arabidopsis.info/narrays/search.pl?f1=ENUM&s1=TERM&f2=ENUM&s2=TERM&f3=ENUM&s3=TERM&f4=ENUM&s4=TERM&f5=ENUM&s5=TERM

Search for thing in certain fields to find certain slides. You can search for up to five things. ENUM in the URL above should be replaced with the number below, depending on what you want to search on:

  1. Name
  2. Alias (e.g. Col-0)
  3. Growth Conditions
  4. Development Stage (e.g. 1.0)
  5. Tissue
  6. Stock Code
  7. Genetic Background (e.g. Col-0)
  8. Genetic Variation (e.g. T-DNA)
  9. Treatment
  10. Slide Type (e.g. ATH1, AG)

The TERM should be set to whatever you want to search for. Leave spare TERMs blank, and they will be ignored.

This XML returns a list of Slide tags- see the experiment page documentation for what these contain.

This is particularly useful for automatically annotating a sample, if you have the slide name (for instance, from a data download spreadsheet)

Data download system

URL format: http://affymetrix.arabidopsis.info/narrays/download.pl ....

By using download.pl, you can automatically download data from the website. This may be useful if you want to download data that you have found using the XML services above.

What to download

Data can be automatically downloaded from the database for the following things:

  • An entire experiment (requires the experimentid)
  • An single slide, or a series of slides (requires ResultSetIDs)
  • Some AGI codes over all experiments (Bulk Gene Download)

You can get these by supply the appropriate parameters to download.pl.

  • For an entire experiment, append experimentid= the experimentid
  • For a series of slides append resultsetid= for each slide you wish to download- e.g. resultsetid=5401&resultsetid=5402
  • For the Bulk Gene Download append agicodebox= then a space separated list of AGI codes
Which format would you like

Data can be downloaded in various formats. If you omit any of these parameters you will get the defaults. Unfortunately for reasons of backwards compatibility, some of the defaults are not very sensible. Therefore, we advise you to fill in all of these parameters if you can.

Spreadsheet format is controlled by appending format= onto the URL:

  • To get tab-delimited format, append format=tab
  • To get comma separated values (CSV) format, append format=csv
  • To get Gnumeric format, append format=gnumeric (coming soon)
  • To get MS Excel format, append format=excel (coming soon)

Annotation levels can be controlled by appending annotationtype=

  • To get full annotation, append annotationtype=full
  • To get gene-level annotation only, append annotationtype=basic
  • To get no annotation, append annotationtype=noannot
  • To get probeset names only, append annotationtype=probesets

Which data to download can be chosen by appending datacols=

  • To get all data, append annotationtype=all
  • To get just signal values (not Detection Call, StatPairsUsed etc.) data, append annotationtype=signalonly

Compression- you can download compressed spreadsheets by appending compression=.

  • To get ordinary uncompressed data, append compression=none
  • To get data compressed with gzip, append compression=gzip

Note- if you choose Excel file format, the files are never compressed. Gnumeric file format is a compressed format anyway, so this parameter has no effect for Gnumeric.

One final note: data downloading takes a long time. Make sure that timeouts are set high. This is especially the case for Gnumeric, Excel and files compressed with gzip -these files have to be completely produced before data transfer begins.

The End!

This is the end of our short guide to programming the NASCArrays website. Hopefully this guide will have given enough information to get you started. If you want more information, please write to affy@arabidopsis.info