forked from GNUsocial/gnu-social
		
	
		
			
	
	
		
			82 lines
		
	
	
		
			3.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
		
		
			
		
	
	
			82 lines
		
	
	
		
			3.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| 
								 | 
							
								Introduction
							 | 
						||
| 
								 | 
							
								============
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								This project is a PHP 5.2 to PHP 5.6 parser **written in PHP itself**.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								What is this for?
							 | 
						||
| 
								 | 
							
								-----------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								A parser is useful for [static analysis][0], manipulation of code and basically any other
							 | 
						||
| 
								 | 
							
								application dealing with code programmatically. A parser constructs an [Abstract Syntax Tree][1]
							 | 
						||
| 
								 | 
							
								(AST) of the code and thus allows dealing with it in an abstract and robust way.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								There are other ways of processing source code. One that PHP supports natively is using the
							 | 
						||
| 
								 | 
							
								token stream generated by [`token_get_all`][2]. The token stream is much more low level than
							 | 
						||
| 
								 | 
							
								the AST and thus has different applications: It allows to also analyze the exact formatting of
							 | 
						||
| 
								 | 
							
								a file. On the other hand the token stream is much harder to deal with for more complex analysis.
							 | 
						||
| 
								 | 
							
								For example an AST abstracts away the fact that in PHP variables can be written as `$foo`, but also
							 | 
						||
| 
								 | 
							
								as `$$bar`, `${'foobar'}` or even `${!${''}=barfoo()}`. You don't have to worry about recognizing
							 | 
						||
| 
								 | 
							
								all the different syntaxes from a stream of tokens.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Another questions is: Why would I want to have a PHP parser *written in PHP*? Well, PHP might not be
							 | 
						||
| 
								 | 
							
								a language especially suited for fast parsing, but processing the AST is much easier in PHP than it
							 | 
						||
| 
								 | 
							
								would be in other, faster languages like C. Furthermore the people most probably wanting to do
							 | 
						||
| 
								 | 
							
								programmatic PHP code analysis are incidentally PHP developers, not C developers.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								What can it parse?
							 | 
						||
| 
								 | 
							
								------------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The parser uses a PHP 5.6 compliant grammar, which is backwards compatible with all PHP version from PHP 5.2
							 | 
						||
| 
								 | 
							
								upwards (and maybe older).
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								As the parser is based on the tokens returned by `token_get_all` (which is only able to lex the PHP
							 | 
						||
| 
								 | 
							
								version it runs on), additionally a wrapper for emulating new tokens from 5.3, 5.4, 5.5 and 5.6 is provided.
							 | 
						||
| 
								 | 
							
								This allows to parse PHP 5.6 source code running on PHP 5.3, for example. This emulation is very hacky and not
							 | 
						||
| 
								 | 
							
								perfect, but it should work well on any sane code.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								What output does it produce?
							 | 
						||
| 
								 | 
							
								----------------------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The parser produces an [Abstract Syntax Tree][1] (AST) also known as a node tree. How this looks like
							 | 
						||
| 
								 | 
							
								can best be seen in an example. The program `<?php echo 'Hi', 'World';` will give you a node tree
							 | 
						||
| 
								 | 
							
								roughly looking like this:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								```
							 | 
						||
| 
								 | 
							
								array(
							 | 
						||
| 
								 | 
							
								    0: Stmt_Echo(
							 | 
						||
| 
								 | 
							
								        exprs: array(
							 | 
						||
| 
								 | 
							
								            0: Scalar_String(
							 | 
						||
| 
								 | 
							
								                value: Hi
							 | 
						||
| 
								 | 
							
								            )
							 | 
						||
| 
								 | 
							
								            1: Scalar_String(
							 | 
						||
| 
								 | 
							
								                value: World
							 | 
						||
| 
								 | 
							
								            )
							 | 
						||
| 
								 | 
							
								        )
							 | 
						||
| 
								 | 
							
								    )
							 | 
						||
| 
								 | 
							
								)
							 | 
						||
| 
								 | 
							
								```
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								This matches the structure of the code: An echo statement, which takes two strings as expressions,
							 | 
						||
| 
								 | 
							
								with the values `Hi` and `World!`.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								You can also see that the AST does not contain any whitespace information (but most comments are saved).
							 | 
						||
| 
								 | 
							
								So using it for formatting analysis is not possible.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								What else can it do?
							 | 
						||
| 
								 | 
							
								--------------------
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Apart from the parser itself this package also bundles support for some other, related features:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								 * Support for pretty printing, which is the act of converting an AST into PHP code. Please note
							 | 
						||
| 
								 | 
							
								   that "pretty printing" does not imply that the output is especially pretty. It's just how it's
							 | 
						||
| 
								 | 
							
								   called ;)
							 | 
						||
| 
								 | 
							
								 * Support for serializing and unserializing the node tree to XML
							 | 
						||
| 
								 | 
							
								 * Support for dumping the node tree in a human readable form (see the section above for an
							 | 
						||
| 
								 | 
							
								   example of how the output looks like)
							 | 
						||
| 
								 | 
							
								 * Infrastructure for traversing and changing the AST (node traverser and node visitors)
							 | 
						||
| 
								 | 
							
								 * A node visitor for resolving namespaced names
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								 [0]: http://en.wikipedia.org/wiki/Static_program_analysis
							 | 
						||
| 
								 | 
							
								 [1]: http://en.wikipedia.org/wiki/Abstract_syntax_tree
							 | 
						||
| 
								 | 
							
								 [2]: http://php.net/token_get_all
							 |