Or maybe another color altogether. Then again, I could ask when tall is actually short, or a whole host of apparently contradictory questions.
What a conundrum.
No, this isn’t some fiction. It was the reality I faced when I took up the reins as head of IRRI’s Genetic Resources Center (GRC) in July 1991 and asked for a demonstration of the ‘genebank data management system’.
A large germplasm collection, or was it?
The International Rice Genebank (IRG) at IRRI holds the world’s largest and (almost certainly) the most genetically diverse collection of rice varieties of Asian rice (Oryza sativa), African rice (O. glaberrima) and wild species of rice (not only Oryza species, but representatives from related genera).
Besides providing the very best conditions to ensure the long-term survival of these precious seed samples (as I blogged about recently), it’s also essential to document, curate, and easily retrieve information about the germplasm stored in the genebank. That’s quite a daunting prospect, especially for a collection as large as the International Rice Genebank Collection (IRGC), with over 126,600 samples or accessions at the last count¹. (During my tenure as head of GRC, the collection actually grew by about 25% or so, with funding for germplasm collecting from the Swiss government.)
I discovered that the three rice types—Asian, African and wild species—were being managed essentially as three separate germplasm collections, each with its own data management system. What a nightmare! It was almost impossible to get a quick answer to any simple question, such as ‘How many accessions are there in the genebank from Sri Lanka?’ It took three staff to query the databases, formulating their queries in slightly different ways because of the different database structures.
But why was it necessary to ask such questions, and require a rapid response? In 1993 the Convention on Biological Diversity (CBD) came into force. I had anticipated that IRRI would receive an increasing number of requests from different countries about the status and disposition of rice germplasm from each that was conserved in the IRG. Until we had an effective data management system we would have to continue trawling through decades of paperwork to find answers. And indeed there was an increase in such requests as countries became concerned that their germplasm might be misappropriated in some way or other. I should say that the changes we subsequently implemented put IRRI in good stead when the International Treaty on Plant Genetic Resources for Food and Agriculture came into force, with its requirements to track all germplasm flows and use. But I’m getting ahead of myself.
It made no sense to me that the rice types should be managed as separate collections, since once in the same genebank vaults seeds were stored under identical conditions. So, as I indicated elsewhere on this blog, I appointed Flora de Guzman as genebank manager with overall responsibility for the entire rice collection, and started to study various aspects of germplasm regeneration and seed conservation. Since the wild rices had a special nursery screenhouse for multiplication of seed stocks (a requirement of the Philippines Quarantine Service), another member of staff became curator of the wild species on a day-to-day basis.
The data management challenge
In 1991 the IRG had three very competent data management staff: Adel Alcantara, Vanji Guevarra, and Myrna Oliva, soon to be joined by a technical assistant, Nelia Resurreccion.
Due to the lack of oversight for data management, I realized the trio were each doing their own thing for the sativas, the glaberrimas, and the wild species, so to speak, with limited reference to what the others were doing. To make any significant improvements to data management, it would be necessary to build a single data system for all germplasm in the genebank. I thought this would be quite a straightforward undertaking, taking maybe a couple of months or so. How wrong I was! It was much more complex than I had, in my naivety, envisaged.
Back in 1991, PC technology was still in its infancy; well maybe approaching juvenility. The databases were managed using ORACLE on a VAX mainframe. More nightmares! Fortunately, with some investment in office design and furniture, providing each staff with a proper workstation, and the ability to work better as a team, and more powerful PCs, we were able to migrate the new data management systems to local servers. We left the VAX behind, but unfortunately still had an ORACLE legacy that was far more difficult to ditch. I also wanted to develop an online data management system that would permit researchers at IRRI, and eventually around the world, to access germplasm data for themselves rather than always having to request information from genebank staff. This was the less than ideal situation when I joined IRRI. In fact, in order to access genebank data then it was necessary to make a request in writing that was approved by the head of the genebank, then Dr TT Chang. I put a stop to that right away. Because data had been accumulated using public funds they should be made freely available henceforth to anyone. Direct and unhindered access to genebank data was my goal.
The underlying problem
However, the three databases could not ‘talk’ to one another, because their structures and data were different for the three ‘collections’. Let me explain.
There are basically two types of germplasm data, what we call passport data, and characterization and evaluation data. The passport data include such pieces of information as the identity of germplasm (often referred to as the accession number), the donor number and the collector’s number, for example. These data are, or should be, unique to a piece of germplasm or an accession. But passport data also include information about the date of acquisition, when it was first stored in the genebank, who has requested a seed sample, and when. Of course there’s a great deal more, but these examples suffice to explain something of the nature of these data.
Characterization (qualitative) and evaluation (mainly quantitative) data describe various aspects (or traits as they are known) of rice plants such as leaf and grain color, or plant height, days to flowering, and resistance or tolerance to pests and diseases, using agreed sets of descriptors and scoring codes or actual measurements. The International Board for Plant Genetic Resources (IBPGR, which became the International Plant Genetic Resources Institute, then Bioversity International) had developed these crop descriptors, and the first—for rice—was published jointly with IRRI in 1980 (and revised and updated in 2007).
An essential condition for a successful data management system therefore is that information is recorded and stored consistently. In order for the three databases to talk to each other, we had to correct any differences in database structure, such as the naming and structure of database fields, as well as consistent use of codes, units, etc. for the actual information. This is what we discovered.
Take the most basic (and one of the most important) database field for accession number, for example. In one database, this field was named ‘ACC_NO’, in another ‘ACCNO’. And the structure was different as well. For the sativas it was a five digit numeric field; for the glaberrimas, a six digit numeric field; and for the wild species, a seven digit alphanumeric field. No wonder the databases couldn’t talk to each other at the most basic level.
But why were there three structures? The field name was easily resolved, incidentally. Well, when the collection was first established, the accession numbers from ‘00001’ to ‘99999’ were reserved for the O. sativa accessions. Then the the numbers from ‘100000’ and above were assigned to O. glaberrima and the wild species. However, thirteen wild species samples were found to be mixtures of two species. So they were divided and each given a suffix ‘A’ or ‘B’, such as ‘100569A’ and ‘100569B’ (not actual numbers, just illustrative). That meant that the wild species now had a seven digit alphanumeric field. Why one of the mixture wasn’t just assigned a new six digit number—as we did—I’ll never understand. Then we had to convert the O. sativa accession number into a six digit numeric field (‘000001’ etc.) and, with a consistent field name across databases (‘ACCNO’ perhaps), we could then link databases for the first time. In 1991, there was a gap between the sativa numbers (perhaps between ‘80000’ and ‘99999’) before the other accessions started at ‘100000’. Irrespective of rice type, we just inserted consecutive numbers as we received new samples, until there were no gaps at all in the sequence.
White is white, yeah?
Now imagine achieving consistency right across the databases for all fields. We found that a character was often recorded/coded in different ways between rice types. So in one, the color ‘white’ might have been coded as a ‘1’, but as a ‘5’ in another. Or ‘1’ was ‘green’ in another database. And so it went on. We had to convert all codes to a meaningful and consistent description, each independent of the other. So ‘1’ was converted in one database to ‘white’ and ‘5’ to ‘white’ as well, etc. Having made all these conversions, with very careful cross checking along the way, and regular data back-ups, we finally had consistent field names and structures, and recording/coding of data for the entire germplasm collection. I don’t remember exactly how long this took, but it must have been between 18 months and two years.
The next step
But once completed, we could move on to the next phase of developing an online system to access genebank data, the International Rice Genebank Collection Information System (IRGCIS), with inputs from the former System-wide Genetic Resources Program (SGRP), an initiative of all the CGIAR centers with genebanks and genetic resources activities.
IRGCIS is a comprehensive system that manages the data of all rice germplasm conserved at IRRI. It is designed to manage the genebank operations more efficiently. It links all operations associated with germplasm conservation and management from acquisition of samples through seed multiplication, conservation, characterization, rejuvenation and distribution to end-users.
The system aims to:
- Assist the genebank staff in day-to-day activities.
- Facilitate recording, storage and maintenance of germplasm data.
- Allow the request of desired seeds and provide direct access to information about accessions in the genebank.
The data that are accessible are:
- Passport data.
- Morpho-agronomic descriptions.
- Evaluation data on the International Rice Genebank Collection.
- Germplasm availability.
A couple of years after IRGCIS, work began to develop the International Rice Information System (IRIS) as part of the International Crop Information System (ICIS) for the management of improved germplasm, breeding lines and the like, with full genealogy data. INGER also developed the INGERIS, but to tell the truth I’m not sure exactly where IRRI is these days with regard to cross system integration and the like.
But as I mentioned earlier, of one thing I am certain. Had we not taken the fundamental steps to clean up our data management act almost 25 years ago, we would not have had an effective platform to respond to global germplasm initiatives like the International Treaty or CBD, nor take advantage relatively easily of new data management software and hardware. It did require that broad perspective in the first instance. That I could bring to the party even though I didn’t have the technical know-how to undertake the detailed work myself.
¹ Source: the International Rice Genebank Collection Information System (IRGCIS), 8 June 2015.