Saturday, April 21, 2007

Wiktionary quality issues

On the Wiktionary project I run the interwiki bot. The process is simple; when an article exists in another language spelled exactly the same, I create an "interwiki" link. This allows you to see the information on another language Wiktionary. This process is an automated process, it works on all Wiktionaries and it is an unattended process.

I have received a request from the Polish Wiktionary to stop adding interwiki links for the Russian and for the Vietnamese Wiktionary. The reason given is one of quality. On the Russian Wiktionary many of the articles are created by a bot and they do not provide good information. An example is dispersion, there is nothing really in there. The Vietnamese Wiktionary is more problematic because a bot was used to generate declension and conjugation tables of Russian words and they got it wrong.

The Russian Wiktionary has some 81.000 empty shells and refuse to remove it. The Vietnamese are not willing to remove there incorrect data.

I have been asked to stop including the Russian Wiktionary and the Vietnamese Wiktionary when I run the interwiki process. To be honest, I run the bot as a service and I do not think it is the right thing to do. I think the Vietnamese are wrong not to correct the wrong data that they have. I am less sure about the Russian approach; in essence it is a stub. However, creating a Wiktionary in this way is like stamp collecting; you can look at it but there is not information about it.

Given how the process works, I am not sure that I can exclude either the Russian or the Vietnamese Wiktionary. The way it works is that I run explicitly on all Wiktionaries. When I exclude Russian or Vietnamese, I will probably end up removing all references to these projects. They are the third and fourth Wiktionary is size.

When I do not exclude the Russian and the Vietnamese Wiktionary, the bot may end up being blocked on the Polish Wiktionary. This will also kill off the interwiki process.

From my point of view, using bots to generate content in a Wiktionary only makes sense when there is at least a link to the word in the base language. When the initial creation of stubs is followed by the enrichment of these stubs it is acceptable. For having information that is completely wrong, there is no excuse.

The question is, will there be a discussion about acceptable practices in Wiktionary. The question are:
  • Can the Polish demand what they do?
  • Is having a project that consists mainly of stubs acceptable?
  • Is having incorrect data acceptable?

Thanks,
GerardM

2 comments:

Minh Nguyễn said...

Yes, the Vietnamese Wiktionary is aware of the conjugation and declension issues. We added a warning to the affected entries last week, and a couple of us have discussed what to do. The thing is, most of the time, I’ve been the sole active contributor at the site. (Lately I haven't been contributing due to schoolwork, though.) Since I don't speak Russian, I have no way of verifying the myriad of entries that PiedBot created. (Laurent Bouvier, its maintainer, doesn’t know any Russian.) Now that I know so many of the Russian entries are incorrect, I don't know how to fix the templates except to remove them altogether. Hopefully someone who knows Russian can tell us if there are any correct Russian entries; if not, I’ll delete the templates.

It’s really sad that the Vietnamese Wiktionary has turned into a “stamp collection”, as you put it. I spent countless hours trying to add useful information to many parts of the site. Most of our English-language entries are not in fact “stamps”. But French, Russian, and Norwegian are languages that I know virtually nothing about, so there’s nothing I can do there.

The Vietnamese Wikipedia has a number of Russian speakers, but the Vietnamese Wiktionary has always been devoid of Wikipedia contributors. Developing a dictionary, it seems, is nowhere near as exciting as writing an encyclopedia. Open source and open content rely on the constant watch of a large community, but that’s not what we have. So we have two or three people tasked with gardening the Amazon Rainforest.

GerardM said...

Hoi,
I did not call the Vietnamese Wiktionary a stamp collection. That dubious honour is for the Russian Wiktionary. I am impressed with the huge amount of work done on the vi.wiktionary.

Given that Laurent wrote the Russian entries, it may be that the French Wiktionary has a problem too. When this is the case, the solution will be in collaborating on a fix. :)

The issue that I try to address is that discussions like this have to be visible. Threatening like the Polish do damages. Threatening to stop the bot is damaging. Not showing that issues get addressed is damaging. Just adding words and hoping that other will come to make it better is damaging.

There has to be a balance and this balance can only be achieved when there is communication and collaboration.

We do not use the Wiktionary-l as much as we should.

Thanks,
GerardM