User:LennardHofmann/GSoC 2022/Report 1

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

It has now been two weeks since Google announced the accepted contributors for Google Summer of Code 2022. When I saw that my proposal was accepted, I was excited to jump right into the coding. In this blog post, I will share my current progress on rewriting Template:Wikidata Infobox in Lua and how I got there.

Why does the infobox need to be rewritten?

[edit]

The infobox currently consists of over 600 dense lines of wikitext and over 1000 calls to Lua modules. Not only does this make the code hard to read, it also slows down Wikimedia's servers: Previewing a category page on Commons often takes more than four seconds. And for really big Wikidata items like COVID-19 pandemic in Colombia (Q87483673) the infobox runs out of Lua memory, which produces script errors.

My mentor Mike Peel and me believe these problems can be fixed by fully rewriting the infobox in Lua. This will give us a free performance boost—for example, we can perform a hash table lookup instead of having to pass around a string and performing a linear search on it every time we want to determine whether the connected Wikidata item has a given property.

My journey

[edit]

Because the code editor on Commons is not that powerful, I spent some time configuring my text editor. After installing lua-language-server and adding a keybinding to quickly copy the contents of the file I am working on into my clipboard, I was ready to go.

During a public online conference about the Wikidata Infobox I heard about Module:Databox, an infobox with similar purpose but far less complex than the Wikidata Infobox. Since Module:Databox is very fast I tried to copy its approach but the Wikidata Infobox was still much slower. After a lot of trial-and-error, I found out why: Module:Databox only tries to fetch values for those properties that are actually used by the connected Wikidata item. So I added if entity.claims[pid] to avoid making unnecessary Wikidata requests and finally managed to match Databox's performance.

At this point I still thought we could drop the dependency on WikidataIB. I wrote a simple function getAudioByLang to replace WikidataIB's more general function getValueByLang. As a benchmark, I ran both functions 99999 times and my function was more than twice as fast! However, the performance difference becomes unmeasurable if you run both functions only 1000 times because the time it takes to fetch Wikidata fluctuates a lot (at least I think this is the source of the noise). In reality, the function is never called more than three times.

Did I waste my time by writing getAudioByLang? No, I learned how WikidataIB works and got valuable insights from the benchmark. I learned that the small performance gain obtained by getting rid of WikidataIB is not worth the additional complexity of having to render dates, quantities, and coordinates in over 200 languages. Yet, there are some easy performance optimizations that don't compromise readability—you can find the results of my research here.

Recently I have taken a look into resyncing WikidataIB on Wikipedia and Commons because these two versions of the module have developed somewhat separately since January 2021. While writing this blog post I also skimmed the source code of mw.wikibase (for reasons I don't remember) and noticed that it creates deep clones of Lua tables. That explains why I keep seeing recursiveClone at the top of the Lua profiler! I had an idea to save performance by avoiding to clone tables and wrote a comment on Phabricator, in which I tell the story of how this idea turned into a proof-of-concept implementation.

Results

[edit]

So far I have converted 60 out of the 350 property requests to Lua. The properties for humans are already fully ported; try it out by going to this page and clicking the button. The new infobox is not faster yet, but I'm optimistic this will change once it is fully rewritten in Lua.

Next post: Report 2