User:Haus/Hanzo: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Line 20: Line 20:
Translating these infoboxes with [[regular expression]] search-and-replace seemed nuts to me. I couldn't bring myself to hack out code to do it. On the other hand, a small lexer with dozen rules and 4 parse states seems to do it pretty nicely.
Translating these infoboxes with [[regular expression]] search-and-replace seemed nuts to me. I couldn't bring myself to hack out code to do it. On the other hand, a small lexer with dozen rules and 4 parse states seems to do it pretty nicely.


==The interesting bit==
===The interesting bit===
Here's a (somewhat dated) version of the guts of a flex specification for parsing infoboxes:
Here's a (somewhat dated) version of the guts of a flex specification for parsing infoboxes:


Line 29: Line 29:
...
...
%%
%%

%{
%{
private int comment_count = 0;
private int comment_count = 0;

Revision as of 16:43, 27 March 2008

Hanzo is an experimental plug-in for jEdit built to partially automate the process of trasforming from {{Infobox Ship}} to {{Infobox Ship Begin}}. It's so named because The Bride was wreaking havoc with a Hattori Hanzo (Kill Bill) sword on TV while I was searching for a name for a Java class.

It has absolutely no other use in the universe, and the only way someone else could use it would be to basically do a bit-by-bit copy of my hard drive. It depends on about a zillion other packages. The interesting bit, however, is shown below in the section entitled "The interesting bit."

Hanzo has successfully translated about 10 infoboxes to date. I'm at work trying to connect better pipes from Wikipedia to the parser described below.

Feedback

If you're here, you probably saw an edit summary. I have a watch on the discussion page here. Feedback away.

Technical

Hanzo is a lexical analyzer written in Java with jFlex. To a large extent, it lives inside a jEdit environment. A single-purpose program, it just barely functions. It was written in three rather arduous days: about a day to write the lexer (twice, per Stallman's law), half a day to uninstall/reinstall/fix jEdit to work with wmjed, and a day and a half to do stuff like:

  • get communication from WP to the lexer and back
  • preserve UTF-8 characters
  • do automatic page loading
  • do local diffs
  • automate to a 1-click process

It has one goal in life: to translate Ship-specific infoboxes.

Translating these infoboxes with regular expression search-and-replace seemed nuts to me. I couldn't bring myself to hack out code to do it. On the other hand, a small lexer with dozen rules and 4 parse states seems to do it pretty nicely.

The interesting bit

Here's a (somewhat dated) version of the guts of a flex specification for parsing infoboxes:



...
%%
%{
  private int comment_count = 0;
  private String name;
  private String value;
%} 
%line
%char
%caseless
%unicode
%standalone
//%debug
%state SHIPBOX
%state NAME
%state VALUE


ALPHA=[A-Za-z]
DIGIT=[0-9]
NONNEWLINE_WHITE_SPACE_CHAR=[\ \t\b\012]
WHITE_SPACE_CHAR=[\n\ \t\b\012]
STRING_TEXT=(\\\"|[^\n\"]|\\{WHITE_SPACE_CHAR}+\\)*
COMMENT_TEXT=([^/*\n]|[^*\n]"/"[^*\n]|[^/\n]"*"[^/\n]|"*"[^/\n]|"/"[^*\n])*
LineTerminator = \r|\n|\r\n
InputCharacter = [^\r\n]
WhiteSpace     = {LineTerminator} | [ \t\f]

%% 
<YYINITIAL> {

"{{Infobox Ship"[|]*{WhiteSpace} |
"{{Ship table"[|]*{WhiteSpace} {
   yybegin(SHIPBOX);
   comment_count +=1;
   return (1); 
   }

[^\n]*[\n]* {
    System.out.println(yytext());  //printlns replaced to preserve UTF-8
    return (100);
   }
}


<SHIPBOX> {
"{{" { comment_count = comment_count + 1; }
[\|]*"}}" { 
	comment_count = comment_count - 1; 
	Utility.Assert(comment_count >= 0);
	if (comment_count == 0) {
		ShipBox.printnv(name,value);
		ShipBox.printbox();
    		yybegin(YYINITIAL);
        }
     }

\|    {
	if(name!=null && value!=null){
		ShipBox.printnv(name,value);
	} 
	yybegin(NAME);
}

[^\|] { value += yytext();}
}

<NAME> {
[^=]*"=" {
	name = new String(yytext());
	yybegin(VALUE);
      }
}

<VALUE> {
[^\n\r]*[\n\r]+ {
	value = new String(yytext());
	yybegin(SHIPBOX);
      }
}