Automate import of DBLP entries (1)
Import DBLP entries on a schedule
The import of DBLP entries into researchr should be automated so that the database is up-to-date all the time.
The entries are available from
Sander used (some version of) Acoda to insert these records into the database. I think this process can probably be improved.
But the basic idea is to translate each DBLP entry into an entry in the
Submitted by Eelco Visser on 13 August 2010 at 13:52DBLP_Publication
table, which is a straightforward representation of the entries. A WebDSL function then takes care of the actual conversion.
Issue Log
I started writing a parser, which takes as input the location of the full XML file; a batch size for persisting batches to the db of researchr; and an mdate indicating from which modified date the publications should be processed.
Current implementation creates a
DBLP_Publication
entity for each publication it encounters in the dblp.xml file.
It does not yet persist it to the db.Some figures
Some performance figures for parsing + constructing the entity instances on my local machine, different mdates each.
JVM Heapspace set to 320MB. This allows parsing the wholedblp.xml
of 1.3GB atm.
Note: The whole hibernate initialization chain in WebDSL is triggered, because of some code in setters of entity properties and is included in these numbers.Summary: Publications with mdate larger later than: 2000-09-20 Processed 3720862 Publications Converted 3720862 Publications Skipped 0 Publications Processing time: 49831ms (49s)
Summary: Publications with mdate larger later than: 2011-09-20 Processed 3720862 Publications Converted 1266362 Publications Skipped 2454500 Publications Processing time: 41616ms (41s)
Summary: Publications with mdate larger later than: 2012-09-20 Processed 3720862 Publications Converted 596663 Publications Skipped 3124199 Publications Processing time: 31450ms (31s)
Summary: Publications with mdate larger later than: 2013-09-20 Processed 3720862 Publications Converted 6357 Publications Skipped 3714505 Publications Processing time: 21189ms (21s)
Progess update:
Parsing and persisting is (almost) finished, little todo here: check character encoding and special symbols.
I’m now extending researchr code:
- A
DBLPXMLImport
entity is added for administering the date of last download, download urls, schedule, and functions for downloading and importing the xml asDBLP_Publication
s.- Writing (wip)
org.researchr.DBLPImport.DBLPDownloader
which will download thedblp.dtd
anddblp.gz
to a temporary directory. It will extract the gz-archive in the same temp dir and return the location of the temp dir for further processing (read: parsing and importing byorg.researchr.DBLPImport.DBLPImporter
) followed by a deletion of the temp directory.
The
auto-dblp-import
-branch in its current state is now deployed on http://webdsl-test.ewi.tudelft.nl/researchr.
I will start a complete dblp import now, which overwrites existingDBLP_Publication
s only whenmdate
is higher than current mdate of that dblp_pub. It will also skip homepage entries as requested by Eelco.
The
auto-dblp-import
branch is now merged into the trunk (r779). Considering it finished.
Log in to post comments