Error reading Library of Congress Catalogue

ForumBug Collectors

Melde dich bei LibraryThing an, um Nachrichten zu schreiben.

Error reading Library of Congress Catalogue

1newcrossbooks
Aug. 6, 2020, 9:00 am

I've just tried adding a book from the Library of Congress but some of the title is truncated, as is the main author's name.

ISBN 9782744904332

It is read by Library thing as:
Reines de Saba, itinaires textiles au Yen = Reinas de Saba, itinerarios textiles en el Yemen by Arnaud Mauries

But on the LoC it's catalogued as:
Reines de Saba, itinéraires textiles au Yémen = Reinas de Saba, itinerarios textiles en el Yemen / Ar̀naud Maurières

I guess it's something to do with the é and è characters - but why is it taking out the next letter as well? And is it an LoC cataloguing problem or an LT reading problem?

The LoC have catalogued Arnaud Maurières as Ar̀naud Maurières, which is incorrect so the downgrading of r̀ to r is not so problematic.

2MarthaJeanne
Bearbeitet: Aug. 6, 2020, 11:09 am

There is a problem, but you can get a better version from Overcat, which makes me think it is a character set mismatch.

3newcrossbooks
Aug. 6, 2020, 4:05 pm

Strange because the Overcat version is also taken from the Library of Congress - so why is LT now failing to read the data correctly?

4MarthaJeanne
Aug. 6, 2020, 4:26 pm

There are two there, one with the problem, one without. This makes me think that LT has changed the character set it reads LoC with. Why? I can think of various possibilities.

Unintended result of another change somewhere.
LoC may have changed its character set at some point, but old records still use the old set.

The developers will be able to read the underlying code, and see what is really happening. The bit about dropping not just the nonstandard letter, but also the next one is fairly widespread.

This is a hard enough problem when dealing with a work from a library in that library's language, so that the character set is consistent from work to work, but the various sets seem to be adjusted for a specific language, so between the LoC cataloguing books in 'other' languages, and LT trying hard to deal with books in any language that is presented, things aren't always quite right. This isn't really an LoC problem or an LT problem, it is an international standards problem. Things are a lot better than they were 10 years ago. I'm not sure when my library's emails started being clearly legible in my browser.

5newcrossbooks
Aug. 13, 2020, 6:30 am

I've tried entering two more foreign language books via the Library of Congress catalogue and have encountered the same problem:
ISBN: 3925064214
ISBN: 9782738454737

Both books were able to be imported correctly from the LoC via Overcat, despite no member being listed as having a copy of either book.

Checking further if I try to add from the LoC by title, author (Äthiopien, Gerd Gräber) then the correct information is provided - Äthiopien : ein Reiseführer by Gerd Gräber
If I search by ISBN: 3925064214, then I get - hiopien : ein Reisefrer by Gerd Grer

Same with the other book - searching by title, author (Les Ethiopiens, Thibaut Mourgues) gives the correct information, searching by ISBN (9782738454737) gives a problem.

6MarthaJeanne
Bearbeitet: Aug. 13, 2020, 6:48 am

Why don't you try using German and French sources for entering German and French books?

7newcrossbooks
Bearbeitet: Aug. 13, 2020, 7:32 am

>6 MarthaJeanne: I'm not sure what that's got to do with the recent error in pulling data from the LoC catalogue but if I use French or German sources then all of the extra subject text is in French or German, rather than English.

So, for instance, my copy of "Éleveurs d'Éthiopie" has "Animal owners Ethiopia, Ethnology Ethiopia, Herders Ethiopia, Livestock Ethiopia" listed as its subject matter when downloaded from the LoC.
If I download it from the BnF - Bibliothèque nationale de France the subject matter is listed as: Ã�levage Ã�thiopie, Moeurs et coutumes Ã�thiopie, Pasteurs Ã�thiopie

OK?

8jjwilson61
Aug. 13, 2020, 1:11 pm

You said that you imported from LoC via Overcat, but that's not the same as importing from LoC as it says in the title of this thread. Did you try importing using LoC as the source?

9newcrossbooks
Aug. 13, 2020, 4:17 pm

>8 jjwilson61: I first imported directly from LoC using the ISBN number but the title was corrupted as outlined in >1 newcrossbooks:.

Following >2 MarthaJeanne: I checked using Overcat, which lists where its copies originated from. There was an uncorrupted title copy from LoC available.

I now find I can import uncorrupted titles from LoC if I search for title, author as outlined in >5 newcrossbooks:.
If I import from LoC after searching using ISBN then the titles are corrupted >5 newcrossbooks:.

10newcrossbooks
Aug. 27, 2020, 11:24 am

Here's another one, this time the book is an English title.

If I try to add ISBN 9780710304018 from the Library of Congress Catalogue, then its other author (the editor) is listed as: Khid, Mans

If I try to add by title, The call for democracy in Sudan, then its other author is listed as Khālid, Mansūr (his name is listed as Mansour Khalid on the title page)

Why are LT searches of the LoC catalogue able to pull in the accented characters if searching by title, but not if searching by ISBN?

11bnielsen
Bearbeitet: Aug. 27, 2020, 12:11 pm

I see the same problems adding from some Danish libraries. The short answer is that searching is done via a protocol called Z39.50 and LT should probably adjust the Z39.50 profile for Library of Congress. Some of us have been granted the power to add and remove sources, but there's no tools available to adjust the profiles. So someone from LT staff has to fix this.

Hmm, details for connection are given as lx2.loc.gov:210/lcdb utf-8 USMARC utf-8, which means that it uses USMARC as record syntax and unicode character set both for input and output.

If I just use a Z39.50 client (i.e. I query Library of Congress directly and don't use LT at all), I get the following:

$ perl z.perl lx2.loc.gov:210/lcdb 9780710304018
Query '9780710304018' found 1 records
=== Record 1 of 1 ===
01216pam a2200313 a 4500
001 1999200
005 20190130091300.0
008 900815s1992 enk 000 0 eng
906 $a 7 $b cbc $c orignew $d 1 $e ocip $f 19 $g y-gencatlg
955 $a pc14 to bc00 08-15-90; bc27 to SCD 08-17-90; fd05 08-20-90; fr26 08-22-90; CIP ver. lh04 to SL 09-28-92
010 $a 90046459
020 $a 0710304013
035 $9 (DLC) 90046459
040 $a DLC $c DLC $d DLC
043 $a f-sj---
050 00 $a DT159.6.S73 $b G37 1992
082 00 $a 322.4/2/09624 $2 20
100 1 $a Garang, John, $d 1945-2005.
245 14 $a The call for democracy in Sudan / $c by John Garang ; edited and introduced by Mansour Khalid.
250 $a 2nd ed., rev. and enl.
260 $a London ; New York : $b Kegan Paul International, $c 1992.
300 $a xxv, 292 p. ; $c 22 cm.
500 $a Rev. ed. of: John Garang speaks. 1987.
651 0 $a South Sudan $x Politics and government.
610 20 $a Sudan People's Liberation Movement.
651 0 $a Sudan $x Politics and government $y 1985-
700 1 $a Khālid, Mansūr, $d 1931-
700 1 $a Garang, John, $d 1945-2005. $t John Garang speaks.
991 $b c-GenColl $h DT159.6.S73 $i G37 1992 $p 00015378955 $t Copy 1 $w BOOKS

As you can see, field 700, lists Khālid, Mansūr

$ perl z.perl lx2.loc.gov:210/lcdb 'and "call" and "sudan" "democracy"'
Query 'and "call" and "sudan" "democracy"' found 1 records
=== Record 1 of 1 ===
01216pam a2200313 a 4500
001 1999200
005 20190130091300.0
008 900815s1992 enk 000 0 eng
906 $a 7 $b cbc $c orignew $d 1 $e ocip $f 19 $g y-gencatlg
955 $a pc14 to bc00 08-15-90; bc27 to SCD 08-17-90; fd05 08-20-90; fr26 08-22-90; CIP ver. lh04 to SL 09-28-92
010 $a 90046459
020 $a 0710304013
035 $9 (DLC) 90046459
040 $a DLC $c DLC $d DLC
043 $a f-sj---
050 00 $a DT159.6.S73 $b G37 1992
082 00 $a 322.4/2/09624 $2 20
100 1 $a Garang, John, $d 1945-2005.
245 14 $a The call for democracy in Sudan / $c by John Garang ; edited and introduced by Mansour Khalid.
250 $a 2nd ed., rev. and enl.
260 $a London ; New York : $b Kegan Paul International, $c 1992.
300 $a xxv, 292 p. ; $c 22 cm.
500 $a Rev. ed. of: John Garang speaks. 1987.
651 0 $a South Sudan $x Politics and government.
610 20 $a Sudan People's Liberation Movement.
651 0 $a Sudan $x Politics and government $y 1985-
700 1 $a Khālid, Mansūr, $d 1931-
700 1 $a Garang, John, $d 1945-2005. $t John Garang speaks.
991 $b c-GenColl $h DT159.6.S73 $i G37 1992 $p 00015378955 $t Copy 1 $w BOOKS

This is the exact same record, so something weird is going on.

There is a note: On 2020-04-27 at 09:14 ccatalfo wrote: "Updating per request of LoC of 4/23/20 with connection details."

But yeah, something weird that LT staff should look at. Maybe the mapping from utf-8 to whatever LT uses internally is less than perfect.

12newcrossbooks
Mrz. 8, 2021, 10:42 pm

This is still happening, and most annoying! I try to use the Library of Congress as my main source of cataloguing information.

Today it was ISBN 2702122612 - Somalie - La guerre perdue de l'humanitaire by Stephen Smith.

I added the title searching by ISBN and all looked fine - until I checked the publsher, who was listed as:
(Paris) : Calmann-Ly, c1993

Searching for the book by title gave the correct publisher:
(Paris) : Calmann-Lévy, c1993

Why is there a discrepancy between searching the LoC catalogue by title, or author, and searching the LoC catalogue by ISBN?

13newcrossbooks
Apr. 26, 2021, 9:46 am

On further checking it looks as though the Library of Congress catalogue uses Unicode combining characters in its catalogue. For instance it writes e acute, é, as 61+301 (e+ ́), rather than using character 233, and a with a macron, ā, as 65+304 (a+ ̄) rather than using character 257.

Has LibraryThing has lost some of its ability to read, and deal with, Unicode combining characters?

14bnielsen
Bearbeitet: Mai 4, 2021, 9:39 am

>13 newcrossbooks: On writing "character 233" and "character 257" you assume some specific character set. (I'm also thinking that having e acute as e + something and a acute as a + something sounds more like a Marc character set than utf-8.)

My native z39.50 client returns utf-8 so I guess that is what LibraryThing should expect.

perl z.perl lx2.loc.gov:210/lcdb 2702122612 | grep ^260

260 $a Paris : $b Calmann-Lévy, $c c1993.

I think the only conclusion for the moment is that LibraryThing and LoC don't seem to agree on the character set.

ccatalfo should take a look here I think.

15ccatalfo
Mai 4, 2021, 10:14 am

>14 bnielsen: Yes, what's curious is that we have it set to utf-8 as the character set. On the connection side it appears to be coming in correctly. The error must be cropping in in the transfer from the raw data to the internal representation.

I'm taking a look.

16ccatalfo
Mai 4, 2021, 10:57 am

Yes so looks like we're getting a UTF8 error on the record for

9782744904332
Reines de Saba, itinéraires textiles au Yémen = Reinas de Saba, itinerarios textiles en el Yemen

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 771: invalid continuation byte

Looking at that.

17newcrossbooks
Mai 4, 2021, 6:32 pm

>14 bnielsen: >16 ccatalfo:
When I said "The Library of Congress catalogue uses Unicode combining characters in its catalogue" I should really have said "LibraryThing uses Unicode combining characters when catalogue entries are taken from the Library of Congress catalogue".
And that's when searching by title, when searching the LoC catalogue by ISBN then the data is corrupted, accented characters are stripped out.

It didn't use to do this - my entries prior to May 2020 don't have the same problem (The problem was reported in August 2020).

My latest problem is that title searches do not find e + acute when looking for e acute so I end up buying duplicate copies of titles...

Anyway, thanks for looking at it again.

18bnielsen
Bearbeitet: Mai 6, 2021, 4:15 am

>16 ccatalfo: and >17 newcrossbooks: Nice! I don't get any error retrieving the "Reines de Saba" record, but that might be thanks to the perl module I use to fetch it, so I'm not even sure where the problem is.

For the record I much prefer to avoid "Unicode combining characters" at all. They create all sorts of weird behaviour in editors. I.e. I want to delete the word Yémen in this message, but it takes two backspaces to delete é because the first backspace just deletes the accent.

>17 newcrossbooks: "I end up buying duplicate copies of titles" Now that's an acute problem! (Sorry about the bad pun.)

I already have a script looking for some other weirdnesses in my tsv export file (like non-breaking spaces) and non-unicode text in Subjects, so I'll see if I have any of the "Unicode combining characters" hiding out as well. I "think" that I basically have a list of allowed characters in my export file and just report any not on the list, so it should catch the e+acute in Yémen.

Hmm, that didn't work out so well since I actually allow it in words like Кышты́м. As far as I know Russians don't use accented characters, but others (like me) do if they want to indicate the pronunciation. Since Russians don't need ы́ it's not in unicode, but we can use the combining character to paint a ' on top of the ы.

Hmm, this actually found a book in my catalogue with é where I'd prefer é :-)
The title is "The adventure of the Christmas pudding : and a selection of entrées" with British Library as source.