 IT Service Management Consultants
|
|
|
Home > Perl > Web programming > Parsing HTML Pages using HTML::Parser
|
|
|
|
|
|
Parsing HTML Pages using HTML::Parser |
|
Written by Philip L Yuson
|
Who is this for
This
article is for those who want to write Perl scripts to remove tags
from an HTML file.
What you need to know You need to
know:
Basic Perl scripting
HTML tags
Introduction
There
are times when you will need to read an HTML file and extract a field
from that file. Perl has a module called HTML::Parser that simplifies
this task.
HTML::Parser
This module reads an HTML
file and allows you to define actions when it reads a starting tag,
the body and the end tag. To do this, you can define subroutines that
are to be executed during these events. The HTML::Parser
documentation lists all the events that can happen during processing.
For our discussion, we will discuss only the start,
text and end events.
You define
the subroutine to handle an event in this format:
|
event =>
[\&handler, token]
|
Event is the
name of the event
handler is the name of the
subroutine
tokens represent the values to be passed
to the subroutine. To pass the tag name to the subroutine, you
specify the literal 'tag'.
This will be clearer
in the sample code.
Sample code
First thing to do is
to create an instance of the parser. When you create the instance,
you can specify which subroutine is to handle processing at a
specific event.
|
# Define module to use use
HTML::Parser(); # Create instance $p =
HTML::Parser->new(start_h => [\&start_rtn,
'tag'],
text_h => [\&text_rtn, 'text'],
end_h => [\&end_rtn, 'tag']); # Start parsing the
following HTML string $p->parse(' <HTML> <HEAD> <TITLE>Sample
HTML Page</TITLE> </HEAD> <BODY> Hello
World This is a test </BODY> </HTML>');
sub
start_rtn { # Execute when start tag is encountered
foreach (@_) { print
"===\nStart: $_\n"; } } sub
text_rtn { # Execute when text is encountered
foreach (@_) { print
"\tText: $_\n"; } } sub
end_rtn { # Execute when the end tag is encountered
foreach (@_) { print "End:
$_\n"; } }
|
Result Save
this and run it. The result will be something like this:
|
Text:
=== Start: html Text:
===
Start: head Text:
=== Start:
title Text: Sample HTML Page End:
/title Text:
End: /head
Text:
=== Start: body Text: Hello
World This is a test
End: /body
Text:
End: /html
|
Notice
that the text subroutine is always executed. Likewise, everytime the
start tag is encountered, the start_rtn is executed.
What
use is this then?
You can write routines to execute when a
specific tag is encountered. You can also write routines to execute
only if it is part of a specific tag.
In our example also, we
passed an HTML string to the parser. You can also pass a file to it
by using the parse_file($file) method of the module.
For
more information To learn more about HTML::Parser, you can
check out the perl documentation for the module.
|
Add comment
|
|
Copyright: © 2012 Concept Solutions Corporation
|