Annex A:
Duplicates Management System

The first step is to prepare the input file for the program. This involves a prescan of the input file to identify the date/time range covered by the data to be processed through the duplicates management system and loaded into the database.

Once the prescan has identified the date/time range, a retrieval of data from all ocean vertical profile type databases for that time range is submitted. The data from the databases and the input file are sort/merged by date/time and the resulting file serves as input to the duplicates management program.

This process enables the duplicates management system to deal with duplicates in the input file, and between the input file and the databases. It provides for the identification, for example, of a CTD observation duplicating an IGOSS TESAC received earlier and will specify the de-activation of the TESAC so that requests for temperature and salinity data will not result in duplicate observations being given to the user.

Potential duplicates are reviewed with respect to a target message. The review is forward in time for a window of delta-t. There is no need to go backwards as the target message would already have been reviewed with respect to a previous target.

The list of potential duplicates is established by examining each message in the delta-t window with respect to the target message in terms of
i) coincidences of platform identification, date and time; and
ii) both observations occurring in a delta-t, delta-d window (15 minutes and 5 km in the initial implementation of the system).

Once the list of potential duplicates is established with respect to the target observation and all observations within the delta-t window forward, more detailed analysis of the list occurs.

The first step is to attempt to remove entries from the list according to two criteria. Each observation is examined once more relative to the target. If the position is different from the position of the target by more than delta-d (5 km) the observation is removed from the list. This can occur in the case of an identification/time duplicate.

The second check examines the subsurface information for the target and each other observation on the duplicates list.

At this point it becomes necessary to consider an additional factor, the source of the observation which is carried in the databases as a variable named STREAM_IDENT.

The STREAM_IDENT identifies the observation source as a MEDS BATHY, delayed mode XBT, an observation from the scientific QC stream, etc. It is relatively easy to compare sub-surface profiles from two IGOSS BATHY messages because a duplicate observation should have the same depths and temperatures, or very nearly so. However, a comparison of a BATHY trace to a delayed mode XBT trace is not straight forward.

This means that the sub-surface test can at this time only be carried out automatically on observations from the same or similar streams. Similar streams would include the delayed mode XBT and scientific QC streams as the sub-surface variables are not changed in this step.

At this time, the concept of reviewable and non-reviewable decisions by the duplicates checking program is introduced. Once the duplicates checking program has produced an output file containing all data and the database update decision, a post processor is run to permit review and alteration of "reviewable" decisions by an operator. At the post processor stage, non-reviewable decisions are accepted and are not referred to the operator.

As implied above, there are "reviewable" and "non-reviewable" decisions. The following are the tests and types of decisions (i.e. reviewable or non-reviewable) that are included in the sub-surface checking algorithm. Note that the algorithm must deal with cases of different profiles attached to the two messages. This would occur for a CTD reporting salinity as well as temperature when the IGOSS message included only temperature.

1. If the observations are from non-similar streams, the profiles are assumed to be duplicates and the decision is reviewable.
2. If for all profiles, the depths and variables are the same, the profiles are assumed to be duplicates and the decision is non-reviewable.
3. If for all profiles, the depths and variables to some level involving more than n levels or 80% of the maximum depth range are the same, the profiles are assumed to be duplicates and the decision is reviewable.
4. If more than 80% of depths and variables are different for all profiles the observation is assumed not to be a duplicate and is removed from the duplicates list. The decision is non-reviewable.

The goal of this strategy is to refer all grey area decisions to the operator in the post-processor phase. As capabilities in duplicate detection improve, attempts will be made to implement software to reduce the requirements for operator review.

After completion of the final duplicates list, further processing becomes a question of deciding on the action to be taken with each observation on the final duplicates list. These decisions are based on a priorization of the STREAM_IDENTs occurring in the input file (which now contains the data from the database as well) stream and whether the observations come from the original input stream or the database.

The next group of decisions regarding the duplicates list is to decide the actions necessary in regard to updating the observations into the database, removing them from the database, or altering their "active status". The principles are as follows.

1. Duplicates from the same or similar input streams are not entered into the database. If such a duplicate occurs, then the decision depends on a control parameter set for the run. This control parameter specifies either "database priority" or "input stream priority". If the control parameter specifies "database priority", then the database copy and the duplicate in the input stream are marked to be "ignored" at database update which leaves the existing copy in the database. If the control parameter specifies "input stream priority" then the database copy is marked to be "deleted" from the database and the input stream copy is marked to be "updated" into the database which replaces the copy in the database with the input stream copy.

This facility provides the ability to correct data in the database by reprocessing the data and then updating back into the database.

2. If there are duplicates from two different input streams, then the observation with the highest priority in the STREAM_IDENT priority list is chosen to be the active copy. The observation(s) in the database with the lower priority will be marked to be "flagged inactive" during the update. The highest priority will be flagged to be "updated" if it is not already in the database or it will be flagged to be "ignored" in the update if it is already present and is to be left there.

Thus, all observations in the input stream to the duplicates management system (including the ones that have been extracted from the databases following the prescan) are written to an output file with flags to indicate the appropriate action to be taken at update time. This output file is passed to the post processor.

The post processor is an interactive program that presents textual and graphic information to the operator in a form that allows him or her to judge whether the decision made by the duplicates management system was appropriate. If the operator disagrees with the decision, the decision can be altered at this stage relative to the observations that were on the final duplicates list. The final product of the post processor program is a data file that is ready for input to the database update system.

Note that in the MEDS implementation of the duplicates management system, there are several separate databases including a BATHY database, a TESAC database, a bottle database, an MBT/XBT database, and a CTD database. The processing systems described here open and deal with all these databases during duplicates checking and update phases of the data management system as if they were in fact one database.