We have DBLP_Publications, which were previously imported using Acoda, but these have badly encoded title and author names in case a special character is part of them.

A subset of these are already converted to Publication entities, others are not yet converted.

The DBLPImporter and DBLPEntityPersister classes currently skip publications which are already imported, except when the mdate field of the newly encountered entry from the XML has a higher mdate than the one already in the database, it then overwrites all properties except for the DBLP_Publication.publication property.

We might add a mode where we force overwrite all existing DBLP_Publications titles and author names (are there more fields?). But then we also need to update the Publications which are already converted.

After symbol fix, entries like these should be fixed
http://researchr.org/dblp/conf%5Esf-egc%5EsX10
http://webdsl-test.ewi.tudelft.nl/researchr/dblp/conf%5Esf-egc%5EsX10

Submitted by Elmer van Chastelet on 7 October 2013 at 11:56

On 7 October 2013 at 12:14 Eelco Visser commented:

(1) apparently not all of these records were already converted

(2) even if we overwrite the DBLP_Publication record, we don’t have to re-convert the publication; we can postpone that to a later date. But at least we know the character encodings will be good

(3) when the DBLP_Publication record has not yet been converted, it makes sense to overwrite the record anyway


On 16 October 2013 at 19:25 Elmer van Chastelet commented:

Implemented. Needs to be extended to also check on ‘school’ field

Log in to post comments