Recover from badly imported symbols in author names, title from DBLP Publications
We have
DBLP_Publication
s, which were previously imported using Acoda, but these have badly encoded title and author names in case a special character is part of them.A subset of these are already converted to
Publication
entities, others are not yet converted.The
DBLPImporter
andDBLPEntityPersister
classes currently skip publications which are already imported, except when themdate
field of the newly encountered entry from the XML has a highermdate
than the one already in the database, it then overwrites all properties except for theDBLP_Publication.publication
property.We might add a mode where we force overwrite all existing
DBLP_Publication
s titles and author names (are there more fields?). But then we also need to update thePublication
s which are already converted.After symbol fix, entries like these should be fixed
Submitted by Elmer van Chastelet on 7 October 2013 at 11:56
http://researchr.org/dblp/conf%5Esf-egc%5EsX10
http://webdsl-test.ewi.tudelft.nl/researchr/dblp/conf%5Esf-egc%5EsX10
Issue Log
(1) apparently not all of these records were already converted
(2) even if we overwrite the DBLP_Publication record, we don’t have to re-convert the publication; we can postpone that to a later date. But at least we know the character encodings will be good
(3) when the DBLP_Publication record has not yet been converted, it makes sense to overwrite the record anyway
Implemented. Needs to be extended to also check on ‘school’ field
Log in to post comments