170 lines
		
	
	
		
			5.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
		
		
			
		
	
	
			170 lines
		
	
	
		
			5.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
| 
								 | 
							
								# Migrating From Nokogiri
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								If you're parsing XML/HTML documents using Ruby, chances are you're using
							 | 
						||
| 
								 | 
							
								[Nokogiri][nokogiri] for this. This guide aims to make it easier to switch from
							 | 
						||
| 
								 | 
							
								Nokogiri to Oga.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Parsing Documents
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								In Nokogiri there are two defacto ways of parsing documents:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								* `Nokogiri.XML()` for XML documents
							 | 
						||
| 
								 | 
							
								* `Nokogiri.HTML()` for HTML documents
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								For example, to parse an XML document you'd use the following:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    Nokogiri::XML('<root>foo</root>')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Oga instead uses the following two methods:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								* `Oga.parse_xml`
							 | 
						||
| 
								 | 
							
								* `Oga.parse_html`
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Their usage is similar:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    Oga.parse_xml('<root>foo</root>')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Nokogiri returns two distinctive document classes based on what method was used
							 | 
						||
| 
								 | 
							
								to parse a document:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								* `Nokogiri::XML::Document` for XML documents
							 | 
						||
| 
								 | 
							
								* `Nokogiri::HTML::Document` for HTML documents
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Oga on the other hand always returns `Oga::XML::Document` instance, Oga
							 | 
						||
| 
								 | 
							
								currently makes no distinction between XML and HTML documents other than on
							 | 
						||
| 
								 | 
							
								lexer level. This might change in the future if deemed required.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Querying Documents
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Nokogiri allows one to query documents/elements using both XPath expressions and
							 | 
						||
| 
								 | 
							
								CSS selectors. In Nokogiri one queries a document as following:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document = Nokogiri::XML('<root><foo>bar</foo></root>')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document.xpath('root/foo')
							 | 
						||
| 
								 | 
							
								    document.css('root foo')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Oga currently only supports XPath expressions, CSS selectors will be added in
							 | 
						||
| 
								 | 
							
								the near future. Querying documents works similar to Nokogiri:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document = Oga.parse_xml('<root><foo>bar</foo></root>')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document.xpath('root/foo')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Nokogiri also allows you to query a document and return the first match, opposed
							 | 
						||
| 
								 | 
							
								to an entire node set, using the method `at`. In Nokogiri this method can be
							 | 
						||
| 
								 | 
							
								used for both XPath expression and CSS selectors. Oga has no such method,
							 | 
						||
| 
								 | 
							
								instead it provides the following more dedicated methods:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								* `at_xpath`: returns the first node of an XPath expression
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								For example:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document = Oga.parse_xml('<root><foo>bar</foo></root>')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document.at_xpath('root/foo')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								By using a dedicated method Oga doesn't have to try and guess what type of
							 | 
						||
| 
								 | 
							
								expression you're using (XPath or CSS), meaning it can never make any mistakes.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Retrieving Attribute Values
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Nokogiri provides two methods for retrieving attributes and attribute values:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								* `Nokogiri::XML::Node#attribute`
							 | 
						||
| 
								 | 
							
								* `Nokogiri::XML::Node#attr`
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The first method always returns an instance of `Nokogiri::XML::Attribute`, the
							 | 
						||
| 
								 | 
							
								second method returns the attribute value as a `String`. This behaviour,
							 | 
						||
| 
								 | 
							
								especially due to the names used, is extremely confusing.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Oga on the other hand provides the following two methods:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								* `Oga::XML::Element#attribute` (aliased as `attr`)
							 | 
						||
| 
								 | 
							
								* `Oga::XML::Element#get`
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The first method always returns a `Oga::XML::Attribute` instance, the second
							 | 
						||
| 
								 | 
							
								returns the attribute value as a `String`. I deliberately chose `get` for
							 | 
						||
| 
								 | 
							
								getting a value to remove the confusion of `attribute` vs `attr`. This also
							 | 
						||
| 
								 | 
							
								allows for `attr` to simply be an alias of `attribute`.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								As an example, this is how you'd get the value of a `class` attribute in
							 | 
						||
| 
								 | 
							
								Nokogiri:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document = Nokogiri::XML('<root class="foo"></root>')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document.xpath('root').first.attr('class') # => "foo"
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								This is how you'd get the same value in Oga:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document = Oga.parse_xml('<root class="foo"></root>')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document.xpath('root').first.get('class') # => "foo"
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Modifying Documents
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Modifying documents in Nokogiri is not as convenient as it perhaps could be. For
							 | 
						||
| 
								 | 
							
								example, adding an element to a document is done as following:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document = Nokogiri::XML('<root></root>')
							 | 
						||
| 
								 | 
							
								    root     = document.xpath('root').first
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    name = Nokogiri::XML::Element.new('name', document)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    name.inner_html = 'Alice'
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    root.add_child(name)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The annoying part here is that we have to pass a document into an Element's
							 | 
						||
| 
								 | 
							
								constructor. As such, you can not create elements without first creating a
							 | 
						||
| 
								 | 
							
								document. Another thing is that Nokogiri has no method called `inner_text=`,
							 | 
						||
| 
								 | 
							
								instead you have to use the method `inner_html=`.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								In Oga you'd use the following:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    document = Oga.parse_xml('<root></root>')
							 | 
						||
| 
								 | 
							
								    root     = document.xpath('root').first
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    name = Oga::XML::Element.new(:name => 'name')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    name.inner_text = 'Alice'
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    root.children << name
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Adding attributes works similar for both Nokogiri and Oga. For Nokogiri you'd
							 | 
						||
| 
								 | 
							
								use the following:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    element.set_attribute('class', 'foo')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Alternatively you can do the following:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    element['class'] = 'foo'
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								In Oga you'd instead use the method `set`:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    element.set('class', 'foo')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								This method automatically creates an attribute if it doesn't exist, including
							 | 
						||
| 
								 | 
							
								the namespace if specified:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    element.set('foo:class', 'foo')
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Serializing Documents
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Serializing the document back to XML works the same in both libraries, simply
							 | 
						||
| 
								 | 
							
								call `to_xml` on a document or element and you'll get a String back containing
							 | 
						||
| 
								 | 
							
								the XML. There is one key difference here though: Nokogiri does not return the
							 | 
						||
| 
								 | 
							
								exact same output as it was given as input, for example it adds XML declaration
							 | 
						||
| 
								 | 
							
								tags:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    Nokogiri::XML('<root></root>').to_xml # => "<?xml version=\"1.0\"?>\n<root/>\n"
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Oga on the other hand does not do this:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								    Oga.parse_xml('<root></root>').to_xml # => "<root></root>"
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Oga also doesn't insert random newlines or other possibly unexpected (or
							 | 
						||
| 
								 | 
							
								unwanted) data.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								[nokogiri]: http://nokogiri.org/
							 |