Automate import of DBLP entries (1)
Import DBLP entries on a schedule
The import of DBLP entries into researchr should be automated so that the database is up-to-date all the time.
The entries are available from
Sander used (some version of) Acoda to insert these records into the database. I think this process can probably be improved.
But the basic idea is to translate each DBLP entry into an entry in theSubmitted by Eelco Visser on 13 August 2010 at 13:52
DBLP_Publicationtable, which is a straightforward representation of the entries. A WebDSL function then takes care of the actual conversion.
I started writing a parser, which takes as input the location of the full XML file; a batch size for persisting batches to the db of researchr; and an mdate indicating from which modified date the publications should be processed.
Current implementation creates a
DBLP_Publicationentity for each publication it encounters in the dblp.xml file.
It does not yet persist it to the db.
Some performance figures for parsing + constructing the entity instances on my local machine, different mdates each.
JVM Heapspace set to 320MB. This allows parsing the whole
dblp.xmlof 1.3GB atm.
Note: The whole hibernate initialization chain in WebDSL is triggered, because of some code in setters of entity properties and is included in these numbers.
Summary: Publications with mdate larger later than: 2000-09-20 Processed 3720862 Publications Converted 3720862 Publications Skipped 0 Publications Processing time: 49831ms (49s)
Summary: Publications with mdate larger later than: 2011-09-20 Processed 3720862 Publications Converted 1266362 Publications Skipped 2454500 Publications Processing time: 41616ms (41s)
Summary: Publications with mdate larger later than: 2012-09-20 Processed 3720862 Publications Converted 596663 Publications Skipped 3124199 Publications Processing time: 31450ms (31s)
Summary: Publications with mdate larger later than: 2013-09-20 Processed 3720862 Publications Converted 6357 Publications Skipped 3714505 Publications Processing time: 21189ms (21s)
Parsing and persisting is (almost) finished, little todo here: check character encoding and special symbols.
I’m now extending researchr code:
DBLPXMLImportentity is added for administering the date of last download, download urls, schedule, and functions for downloading and importing the xml as
- Writing (wip)
org.researchr.DBLPImport.DBLPDownloaderwhich will download the
dblp.gzto a temporary directory. It will extract the gz-archive in the same temp dir and return the location of the temp dir for further processing (read: parsing and importing by
org.researchr.DBLPImport.DBLPImporter) followed by a deletion of the temp directory.
auto-dblp-import-branch in its current state is now deployed on http://webdsl-test.ewi.tudelft.nl/researchr.
I will start a complete dblp import now, which overwrites existing
DBLP_Publications only when
mdateis higher than current mdate of that dblp_pub. It will also skip homepage entries as requested by Eelco.
auto-dblp-importbranch is now merged into the trunk (r779). Considering it finished.
Log in to post comments