Sections

 Home
 Dev Log
 Resume
 Free Media
 Riddles
 Bob's Gallery
 Links & Resources
 Contact


Projects

 Stem
 Nyx
 Kana Quizzer
 Keybinding Util
 Model Gallery
 Applets
 Nexus


Tor Application

Abstract:

This project concerns the migration of web site translation from duplicates of the WML (often leading to stale pages) to using the Pootle interface. Doing this will include integrating po4a (a bidirectional HTML/PO converter) and the Pootle translations into the page build process and site commits. Experimental improvements include using auto-translations for missing parts and adding W3C validation as an added safeguard.

Application:

  1. What project would you like to work on?

    I'm interested in the second item from the ideas list: migration of web site translation from editing WML to using the Pootle interface. My initial proposal was to use Python to write a bidirectional HTML ↔ PO converter but as Jérémy Bobbio pointed out on the mail list po4a already does this work for us (either converting between HTML/PO or WML/PO). This is fortunate since writing converters is a bit tedious and error prone. As I see it the project has a few interesting points to address:

    1. Do we translate the WML or the HTML we're compiling it into? The po4a converter is able to handle both and in terms of workflow it makes little difference. I'm at a bit of a loss for why we'd translate the WML and there's a couple substantial benefits to translating HTML instead:

      • It frees us from being dependent on using WML. If the Tor site moves to another web framework in the future (PHP, Django, Rails, etc) it won't matter as much for translations.

      • It provides all the content the page would display to translators. If we try to convert the WML instead then any parts that aren't HTML (such as echoed Perl strings) would be incredibly difficult to parse. It might be possible to find all the sprintf statements and the variables they use but this opens a parsing nightmare that is best avoided. Parsing the HTML it's compiled into sidesteps this headache.

        See translation-status.wml for an example of the sort of page we couldn't translate via WML. In addition the po4a translator for WML seems to be less mature (converting this file gives buggy output when detecting DIV tags embedded in Perl, something I'll report to the po4a team later).

    2. Website translations differ substantially from Vidalia, Torbutton, or Torcheck in that instead of words or phrases we're dealing with paragraphs (to preserve context). The GNU Gettext format was intended for code so we might find that Pootle has issues presenting especially large blocks of text. Instances with Vidalia show small paragraphs aren't a problem and if we run into issues the frontend is Python so I should be able to make it more accommodating.

    3. The web site isn't going to magically keep its translations up to date. We need to assume that for the majority of languages the web site will usually be out-of-date to some extent or another. The current tactic seems to be an all or nothing approach: if a current translation is available then great! Otherwise serve a stale page.

      Automatic translation tools still aren't truly ready for the prime time but as a stopgap measure they do the trick. When exporting the page (either on a translator commit or change in the site) we can submit untranslated text to places like Babblefish or the Google AJAX Language API for its guess. Oddly both seem to be restricted to JavaScript but with a quick hack, such as the one found here we should be able to get around this. If you have any suggestions for a better API then I'm all ears!

    4. How do we handle minor changes? Unfortunately when spelling or punctuation corrections are made we can't say "hey, this is a couple characters off so we can keep 99% of the paragraph's translation" because we don't know what is off. There's a world of difference between 'Tor rocks!' and 'Tor sucks!', despite a tiny edit distance. The best solution I've come up with is that if the diff is below a certain threshold keep the translation in Pootle as a suggestion. An advantage of using diffs rather than positioning is that this will handle rearranged paragraphs more gracefully.

    5. What if there's a change to the website and translators apply an update at the same time? This will create a conflict in the SVN server Pootle points at for one of them. For web developers this window is tiny (there would need to be a translation commit in the few seconds we took to merge PO files, so try again). For translators, however, the window of conflict is as large as the translator makes it (the time between commits). As frustrating as it is we can't easily reconcile this and discarding the changes probably beats the alternatives (locks or presenting translators with conflict resolution diffs that might not even match up). As much of a CMS cliché as it is, we should advise translators to commit early and often to minimize this problem. Jacob suggested a cron job that periodically commits which is an interesting idea, but this would only shrink the window and remove their ability to discard work they're displeased with. I'm certainly open to suggestions if anyone has a magic bullet!

    6. Finally, what about errors? We need to assume invalid pages will occasionally be submitted to our scripts. I don't particularly like the idea of relying on po4a erroring out. As shown earlier it isn't infallible, and the results of failure can be more than a little... unsightly. I'd be interested in trying to integrate strict W3C validation into our build process. Looking around I haven't spotted a scripting means of doing this, but the W3C validator allows files or direct input to be checked via POST calls so I suspect with a little work I could come up with a scripted solution. However, this is funky to say the least and a rather low priority.

    Workflow:

    All these issues and experimental features are well and good, but how will it work? There's three locations to be concerned with: (A) the web SVN repository (currently with the master English versions of sites, the (usually stale) translated mirrors, and build scripts), (B) the translation server with Pootle, and (C) a temporary location where the site can be built. There's two different scenarios, building the site and committing a change to the web repository. Building is pretty simple and done as follows:

    - (A), which only has the English WML pages and build scripts has make called. It compiles the WML to HTML and puts it temporary location (C).

    - At the end of the make script we copy the most recent site translations from (B) to (C) and run:
    po4a-gettextize -f xhtml -m <file>.html -p <locale>/<file>.po
    This should merge the newest translations with the generated PO files. However the po4a-gettextize man page warns that this has very little smarts so I might need to write a helper script for the merge. Still, for the first version this will do. There's no security concerns for (B) since this only requires read access.

    - If we're using auto-translation this is the point where we access a 3rd party API and populate missing entries in the PO files with the strings we collect.

    - Finally we run:
    po4a-translate -f xhtml -m <file>.html -p <file>.pot -l <locale>
    to generate the translated versions of the web pages at (C).

    Commits are just as simple. We'll need to assume anyone able to checkin site changes has write access to Pootle translations.

    - A web developer has just finished his masterful work and is ready to commit changes to (A).

    - We don't want to get stuck in an inconsistent state so first we run error checks. If W3C validation is available then great! Otherwise we'll need to depend on po4a erroring out if a page has problems. It'd probably be best to abort the commit if it'll be problematic.

    - We execute the first and second steps of the build as above to generate the merged PO files in a temporary location (C) (probably '/tmp').

    - Provide our credentials to (B) and upload the merged PO files for changed sites. The basic case is simple enough - check in the new PO files to the Pootle SVN, then tell the Pootle client to refresh (calling svn update). However, as mentioned above this can get pretty ugly for translators if there's a conflict.

    - While uploading changed PO files try to salvage translations for subtly changed entries as suggestions. This will require some work with Pootle but is probably quite doable.

    That's it! When translators want to apply their changes we could push new pages but I'm not sure if this would be wise due to possible vandalism. Note that in terms of interactions (A) is blissfully unaware that it's dealing with anything other than English web pages (making web developers happy). (B) never knows what's going on except that it's the front end to translate some PO files. (C) is where we gather all the ingredients to make this work. If successful we have a new web page. If not the page doesn't get updated and Damian gets yelled at.

    Estimated Deliverable Timetable:

    5/23 - Official start of gsoc - I'll be finishing graduate work in mid April so I should be able to start much earlier.

    5/30 (1 week) - Set up a separate public ably available Pootle server and mirror of the Tor SVN repository where I can experiment. Visually check over the PO files generated by po4a for all web pages to make sure there's no obvious problems (if so it's best to contact the po4a team early...).

    6/13 (2 weeks) - Modify the make script to perform the build process mentioned above (none of the fancy bells or whistles like auto-translation, just snag the PO files from the translation server and use po4a to generate translated web pages).

    7/4 (3 weeks) - Use hooks in SVN to perform the basic commit process proscribed above (don't add Pootle suggestions yet, just make sure that when a site's committed the changes are immediately visible in Pootle). I'm expecting some difficulties here after Roger's warning that Pootle might need a helping hand noticing the changes.

    7/11 (1 week) - Write a helper script for merging old translations into new PO files. This is necessary since po4a-gettextize relies on positioning, which means many translations will fail to merge. This is just blending entries from one PO file to another so it shouldn't be hard.

    7/13 - This is midpoint evaluations

    7/18 (1 week) - Now for some of the fun stuff. Next I'd like to try making Pootle use old translations as suggestions for minorly changed text. This will likely require some tweaks to Pootle.

    7/25 (1 week) - Integrate automatic translation into PO files during the build process and check with translators to see just how bad this is. My experiences with auto-translation has been decent, but this feature might be disabled by default if feedback is negative...

    8/1 (1 week) - Write a script allowing us to use W3C validation during the build process. If it passes this will be a very good indicator that we won't have a problem committing the page. Of course it will be toggleable in case web developers want to skip this step. Great, now I'm an enabler... :(

    8/10 (bit over 1 week) - Iron out issues merging with the current system. I'm hoping we'll have been merging all along due to intermediate steps being preferable to the current system. I'd also like to get feature requests from translators and look into other projects, possibly Onion-coffee (though the level of interest expressed for this on IRC has not been encouraging). If the project runs over I'd of course be more than happy to continue my work!

  2. Point us to a code sample.

    Most of my past work is in Java, the best being a quizzing application to help in learning Japanese. It's available at: www.atagar.com/kanaQuizzer/

    Other substantial Java projects include:

    • Last year's gsoc project: www.atagar.com/misc/gsocBlog/
    • Keybinding chooser (open source api on Source Forge): http://www.atagar.com/keybindingUtil/
    • 3D and AI related applets for courses: www.atagar.com/applets/

    Python being a scripting language, most of my work with it is... well, short scripts. Here's something I wrote to parse an XML wordlist:
    www.atagar.com/transfer/tmp/vocabParser

    I don't particularly enjoy C/C++, but I've used it on occasion. Some code samples I wrote while TAing a beginning C course can be found at:
    www.atagar.com/misc/cpts122/lab/

  3. Why do you want to work with The Tor Project?

    My interest in Tor is partly technical and partly political. About a year ago I began developing an interest in computer security - a tad late since I'd already picked a thesis, but cest la vi. Since then I've been picking up what I could from videos of blackhat/Defcon presentations, podcasts like Security Now, and wargames with a newly formed CSG (computer security group). I already had an interest in testing, and as a speaker at Google put it - security is simply taking a bug on a joy ride to see how far it'll go. In short, it's fun stuff.

    I also have strong libertarian beliefs. Be they liberal or conservative one doesn't need to go far to find individuals wishing to impose their views to 'make the world a better place'. Take Australia for instance - parliamentary systems provide a great deal of power to minority parties and now one spouting moral values is leading the country to censor the Internet 'for the children'. Taking for granted that these people will have power from time to time, the best defense for civil liberties is to deny centralized authority the power it needs to restrict personal freedoms. Privacy is a vital part of that.

    Stepping down from the soap box, I began looking into open source anonymity projects a few months ago. With a background in Java Freenet looked interesting, but it's primarily a tool for countering censorship and it's a tad difficult to feel enthusiastic about a project you can't use yourself. Tor, however, I've had installed for over a year for cases where I need privacy or face especially uptight barriers to the Internet. It's a great project and would be thrilled to make it better.

  4. Tell us about your experiences in free software development environments.

    All the substantial projects I've done have been open source, but unfortunately I can't claim much in terms of experience with collaborative development. Last year I was a GSOC participant working on the Sip-Communicator. However, there wasn't any interest in working over IRC and with around a week delay on the mail list it was best to work alone. I didn't particularly enjoy that aspect of the experience and I'm hoping for a more lively community this time!

  5. Will you be working full-time on the project for the summer?

    Yup

  6. Will your project need more work and/or maintenance after the summer ends? What are the chances you will stick around and help out with that and other related projects?

    No software is perfect and I expect this project to be pretty error prone, so yes - it'll probably require maintenance. From what I've seen of the Tor community so far it's very likely I'd like to stick around afterward to become a core developer.

  7. What is your ideal approach to keeping everybody informed of your progress, problems, and questions over the course of the project?

    Last year I used a blog to provide updates of my progress (www.atagar.com/misc/gsocBlog/) which seemed to work well. The IRC channel seems the perfect place for quick questions and I'd use the mail list for anything more substantial (code examples and such).

  8. What school are you attending? What year are you, and what's your major/degree/focus? If you're part of a research group, which one?

    I'm going to Washington State University, about to finish my Masters in Computer Science. My thesis concerns the use of FPS (first person shooter) game engines to simulate a real world environments for security threats (in our case we're analyzing the Port of Seattle for the smuggling of radiological devices). The lab's web site is pretty out of date, but it's available at:
    www.eecs.wsu.edu/sgl/

  9. Is there anything else we should know that will make us like your project more?

    I can juggle while singing Gilbert & Sullivan, and keep pet rats. Need I say more?