Import DBLP entries on a schedule

The import of DBLP entries into researchr should be automated so that the database is up-to-date all the time.

The entries are available from

Sander used (some version of) Acoda to insert these records into the database. I think this process can probably be improved.

But the basic idea is to translate each DBLP entry into an entry in the DBLP_Publication table, which is a straightforward representation of the entries. A WebDSL function then takes care of the actual conversion.

Submitted by Eelco Visser on 13 August 2010 at 13:52

On 14 July 2012 at 11:19 Eelco Visser tagged 64

On 14 July 2012 at 13:25 Eelco Visser tagged 65

On 14 July 2012 at 13:25 Eelco Visser removed tag 64

On 4 January 2013 at 15:24 Eelco Visser removed tag feature

On 26 September 2013 at 15:21 Elmer van Chastelet commented:

I started writing a parser, which takes as input the location of the full XML file; a batch size for persisting batches to the db of researchr; and an mdate indicating from which modified date the publications should be processed.

Current implementation creates a DBLP_Publication entity for each publication it encounters in the dblp.xml file.
It does not yet persist it to the db.

Some figures

Some performance figures for parsing + constructing the entity instances on my local machine, different mdates each.
JVM Heapspace set to 320MB. This allows parsing the whole dblp.xml of 1.3GB atm.
Note: The whole hibernate initialization chain in WebDSL is triggered, because of some code in setters of entity properties and is included in these numbers.

Summary:
Publications with mdate larger later than: 2000-09-20
Processed 3720862 Publications
Converted 3720862 Publications
Skipped 0 Publications
Processing time: 49831ms (49s)
Summary:
Publications with mdate larger later than: 2011-09-20
Processed 3720862 Publications
Converted 1266362 Publications
Skipped 2454500 Publications
Processing time: 41616ms (41s)
Summary:
Publications with mdate larger later than: 2012-09-20
Processed 3720862 Publications
Converted 596663 Publications
Skipped 3124199 Publications
Processing time: 31450ms (31s)
Summary:
Publications with mdate larger later than: 2013-09-20
Processed 3720862 Publications
Converted 6357 Publications
Skipped 3714505 Publications
Processing time: 21189ms (21s)

On 26 September 2013 at 16:31 Elmer van Chastelet removed tag @sandervermolen

On 26 September 2013 at 16:32 Elmer van Chastelet tagged @elmer

On 27 September 2013 at 16:18 Elmer van Chastelet commented:

Progess update:

Parsing and persisting is (almost) finished, little todo here: check character encoding and special symbols.
I’m now extending researchr code:

  • A DBLPXMLImport entity is added for administering the date of last download, download urls, schedule, and functions for downloading and importing the xml as DBLP_Publications.
  • Writing (wip) org.researchr.DBLPImport.DBLPDownloader which will download the dblp.dtd and dblp.gz to a temporary directory. It will extract the gz-archive in the same temp dir and return the location of the temp dir for further processing (read: parsing and importing by org.researchr.DBLPImport.DBLPImporter) followed by a deletion of the temp directory.

On 3 October 2013 at 10:48 Elmer van Chastelet commented:

The auto-dblp-import-branch in its current state is now deployed on http://webdsl-test.ewi.tudelft.nl/researchr.
I will start a complete dblp import now, which overwrites existing DBLP_Publications only when mdate is higher than current mdate of that dblp_pub. It will also skip homepage entries as requested by Eelco.


On 4 October 2013 at 16:22 Elmer van Chastelet commented:

The auto-dblp-import branch is now merged into the trunk (r779). Considering it finished.


On 4 October 2013 at 16:22 Elmer van Chastelet closed this issue.

Log in to post comments