User:Haus/Hanzo: Difference between revisions

Content deleted Content added

Inline

Revision as of 16:43, 27 March 2008

Hanzo is an experimental plug-in for jEdit built to partially automate the process of trasforming from {{Infobox Ship}} to {{Infobox Ship Begin}}. It's so named because The Bride was wreaking havoc with a Hattori Hanzo (Kill Bill) sword on TV while I was searching for a name for a Java class.

It has absolutely no other use in the universe, and the only way someone else could use it would be to basically do a bit-by-bit copy of my hard drive. It depends on about a zillion other packages. The interesting bit, however, is shown below in the section entitled "The interesting bit."

Hanzo has successfully translated about 10 infoboxes to date. I'm at work trying to connect better pipes from Wikipedia to the parser described below.

Feedback

If you're here, you probably saw an edit summary. I have a watch on the discussion page here. Feedback away.

Technical

Hanzo is a lexical analyzer written in Java with jFlex. To a large extent, it lives inside a jEdit environment. A single-purpose program, it just barely functions. It was written in three rather arduous days: about a day to write the lexer (twice, per Stallman's law), half a day to uninstall/reinstall/fix jEdit to work with wmjed, and a day and a half to do stuff like:

get communication from WP to the lexer and back
preserve UTF-8 characters
do automatic page loading
do local diffs
automate to a 1-click process

It has one goal in life: to translate Ship-specific infoboxes.

Translating these infoboxes with regular expression search-and-replace seemed nuts to me. I couldn't bring myself to hack out code to do it. On the other hand, a small lexer with dozen rules and 4 parse states seems to do it pretty nicely.

The interesting bit

Here's a (somewhat dated) version of the guts of a flex specification for parsing infoboxes:


...
%%
%{
  private int comment_count = 0;
  private String name;
  private String value;
%} 
%line
%char
%caseless
%unicode
%standalone
//%debug
%state SHIPBOX
%state NAME
%state VALUE


ALPHA=[A-Za-z]
DIGIT=[0-9]
NONNEWLINE_WHITE_SPACE_CHAR=[\ \t\b\012]
WHITE_SPACE_CHAR=[\n\ \t\b\012]
STRING_TEXT=(\\\"|[^\n\"]|\\{WHITE_SPACE_CHAR}+\\)*
COMMENT_TEXT=([^/*\n]|[^*\n]"/"[^*\n]|[^/\n]"*"[^/\n]|"*"[^/\n]|"/"[^*\n])*
LineTerminator = \r|\n|\r\n
InputCharacter = [^\r\n]
WhiteSpace     = {LineTerminator} | [ \t\f]

%% 
<YYINITIAL> {

"{{Infobox Ship"[|]*{WhiteSpace} |
"{{Ship table"[|]*{WhiteSpace} {
   yybegin(SHIPBOX);
   comment_count +=1;
   return (1); 
   }

[^\n]*[\n]* {
    System.out.println(yytext());  //printlns replaced to preserve UTF-8
    return (100);
   }
}


<SHIPBOX> {
"{{" { comment_count = comment_count + 1; }
[\|]*"}}" { 
	comment_count = comment_count - 1; 
	Utility.Assert(comment_count >= 0);
	if (comment_count == 0) {
		ShipBox.printnv(name,value);
		ShipBox.printbox();
    		yybegin(YYINITIAL);
        }
     }

\|    {
	if(name!=null && value!=null){
		ShipBox.printnv(name,value);
	} 
	yybegin(NAME);
}

[^\|] { value += yytext();}
}

<NAME> {
[^=]*"=" {
	name = new String(yytext());
	yybegin(VALUE);
      }
}

<VALUE> {
[^\n\r]*[\n\r]+ {
	value = new String(yytext());
	yybegin(SHIPBOX);
      }
}