Genealogy Merging Algorithms

Introduction

Most personal genealogy software packages available today suffer from weak merging functionality, which makes collaboration with others much more difficult than it needs to be. In particular, if someone shares a copy of their electronic genealogical database with a relative, and then both make additions and/or changes to the data, there is currently no reasonable way in most genealogical programs to merge these changes together.

A more powerful approach was presented in the paper entitled Graph-based Remerging of Genealogical Databases by Randy Wilson, which was presented at the 2001 Family History Technology Workshop on March 29, 2001, at Brigham Young University.

Click here for a PDF file of that paper.

The ideas in that paper and on this web page are available to be freely used by anyone who wants to use them. Including these algorithms in your genealogical software, thus allowing people to collaborate and avoid wasting more time would be payment enough to me.

Further Developments

Since the publication of that report, additional ideas have come up that could improve the usefulness of the algorithm. After reading the paper, feel free to also read these ideas, and contact me if you have any further suggestions, questions, or if you would be interested in implementing these algorithms into your own code.

Incremental Updates

The graph-based merging algorithm allows you to share your database with someone else, and then "resynchronize" them back together by using the matching individuals to determine how any non-matching individuals fit together into the relationship graph.

However, many people's modems are slow, so sending a genealogical database that is several megabytes in size may not be practical for some people. Fortunately, it is also not necessary after the first copy of the database is sent.

Once both people have a copy of the same database (on a certain date), they can both make updates independently. When someone wants to send another update, they can do an "incremental update," which works as follows.

This results in a small subset of the total database, which can then be easily e-mailed to a relative. Upon receipt, their program would do the graph-based "remerging" algorithm as usual, but there would just be fewer individuals in the incoming database. The "anchored" individuals would match, causing the new or modified individuals to fit into the relationship graph in the correct place. The thousands of unchanged individuals would not need to be e-mailed nor reviewed by the person receiving the update.

Current Status

I have not yet implemented the merging algorithm, because I do not have access to the source code of a major genealogy program (especially the ones I use), and most of the programming for me would be involved in the user interface rather than the actual merging code.

However, the makers of Personal Ancestral File (PAF) have expressed an interest in including this algorithm into their product, and I'm hoping others will as well (especially Reunion for Macintosh, which is what I am using at home).

In the mean time, I am planning on writing a utility to at least operate on GEDCOM files, though this is not ideal, since going to and from GEDCOM files tends to lose information. The very first step will be to write a "GEDDIFF" program that simply tells what is in one database that is not in the other, and what is different, so that you can at least tell what has changed, though it would still mean typing in the changes yourself.

The next step would be to allow the user to modify the text file output by GEDDIFF in order to automatically incorporate the additional information from the second database into the first, as well as resolve conflicts.

Better yet would be to have a graphical user interface to handle the process, and the best solution would be to integrate it into real genealogy programs so you can use it without messing up your original database by having to export to GEDCOM.

Contact Information

If you have any ideas related to merging genealogical databases, or if you would like to include these algorithms in your code, discuss them, volunteer to write some code, or anything else, please contact me!