??xml version="1.0" encoding="utf-8" standalone="yes"?>亚洲剧场午夜在线观看,久久久久久a亚洲欧洲aⅴ,亚洲av综合日韩http://www.tkk7.com/pyguru/A blog of technology and life.zh-cnSat, 10 May 2025 00:04:07 GMTSat, 10 May 2025 00:04:07 GMT60Parsing MIME & HTMLhttp://www.tkk7.com/pyguru/archive/2005/02/19/1312.htmlpygurupyguruFri, 18 Feb 2005 16:33:00 GMThttp://www.tkk7.com/pyguru/archive/2005/02/19/1312.htmlhttp://www.tkk7.com/pyguru/comments/1312.htmlhttp://www.tkk7.com/pyguru/archive/2005/02/19/1312.html#Feedback0http://www.tkk7.com/pyguru/comments/commentRss/1312.htmlhttp://www.tkk7.com/pyguru/services/trackbacks/1312.html
Parsing MIME & HTML

Understanding an email message encoded with MIME can be very, very, very difficult. It can get frustrating due to the number of options and different ways to do the actual encoding. Now add to that the sometimes too-liberal interpretations of the relevant RFCs by the email client designers and you will begin to get the idea. This article will show you how this task can be laughably simple thanks to Perl's extensive bag of tricks, CPANno alt defined.

I started out with a simple and straightforward mission: Fetch an Email from a POP mailbox and display it in a 7-bit, text-only capable device. This article describes the different stages for a simple tool that accomplishes this task, written in Perl with a lot of help from CPAN modules. I hope this to be useful to other Perl folks who might have a similar mission. Let's discuss each part of this task in turn, as we read through mfetch, the script I prepared as an example. Keep in mind that TIMTOWTDI.

Setting up the script

The first thing, as you know, is loading up all of the modules I will be using. I'm sure you already know strict and warnings. We'll see how do we use the rest of the modules a bit later.

    1: #!/usr/bin/perl
2:
3: # This script is (c) 2002 Luis E. Muñoz, All Rights Reserved
4: # This code can be used under the same terms as Perl itself. It comes
5: # with absolutely NO WARRANTY. Use at your own risk.
6:
7: use strict;
8: use warnings;
9: use IO::File;
10: use Net::POP3;
11: use NetAddr::IP;
12: use Getopt::Std;
13: use MIME::Parser;
14: use HTML::Parser;
15: use Unicode::Map8;
16: use MIME::WordDecoder;
17:
18: use vars qw($opt_s $opt_u $opt_p $opt_m $wd $e $map);
19:
20: getopts('s:u:p:m:');
21:
22: usage_die("-s server is required\n") unless $opt_s;
23: usage_die("-u username is required\n") unless $opt_u;
24: usage_die("-p password is required\n") unless $opt_p;
25: usage_die("-m message is required\n") unless $opt_m;
26:
27: $opt_s = NetAddr::IP->new($opt_s)
28: or die "Cannot make sense of given server\n";

Note the lines 27 and 28, where I use NetAddr::IPno alt defined to convert whatever the user gave us through the -s option into an IP address. This is a very common use of this module, as its new() method will convert many common IP notations into an object I can later extract an IP address from. It will even perform a name resolution for us if required. So far, everything should look familiar, as a lot of scripts start like this one.

It is worth noting that the error handling in lines 22-25 is not a brilliant example of good coding or documentation. It is much better practice to write your scripts' documentation in POD, and use a module such as Pod::Usageno alt defined to provide useful error messages to the user. At the very least, try to provide an informative usage message. You can see the usage_die() function if you download the complete script.

Fetching a message via POP3

On to deeper waters. The first step in parsing a message, is getting at the message itself. For this, I'll use Net::POP3no alt defined, which implements the POP3 protocol described in RFC-1939no alt defined. This is all done in the code below.

   30: my $pops = Net::POP3->new($opt_s->addr)
31: or die "Failed to connect to POP3 server: $!\n";
32:
33: $pops->login($opt_u, $opt_p)
34: or die "Authentication failed\n";
35:
36: my $fh = new_tmpfile IO::File
37: or die "Cannot create temporary file: $!\n";
38:
39: $pops->get($opt_m, $fh)
40: or die "No such message $opt_m\n";
41:
42: $pops->quit();
43: $pops = undef;
44:
45: $fh->seek(0, SEEK_SET);

At line 30, a connection to the POP server is attempted. This is a TCP connection, in this case to port 110. If this connection succeeds, the USER and PASS commands are issued at line 33, which are the simplest form of authentication supported by the POP protocol. Your username and password are being sent here through the network without the protection of cryptography, so a bit of caution is in order.

Net::POP3no alt defined supports many operations defined in the POP protocol that allow for more complex actions, such as fetching the list of messages, unseen messages, etc. It can also fetch messages for us in a variety of ways. Since I want this script to be as lightweight as possible (i.e., to burn as little memory as possible), I want to fetch the message to a temporary on-disk file. The temporary file is nicely provided by the new_tmpfile method of IO::Fileno alt defined in line 36, which returns a file handle to a deleted file. I can work on this file, which will magically disappear when the script is finished.

Later, I instruct the Net::POP3 object to fetch the required message from the mail server and write it to the supplied filehandle using the get method, on line 39. After this, the connection is terminated gracefully by invoking quit and destroying the object. Destroying the object insures that the TCP connection with the server is terminated, freeing the resources being held in the POP server as soon as possible. This is a good programming practice for network clients.

The interaction required by mfetch with the POP server is really simple, so I'm not making justice to Net::POP3. It provides a very complete implementation of the protocol, allowing for much more sophisticated applications.

Note that in line 45, I rewind the file so that the fetched message can be read back by the code that follows.

For this particular example, we could also have used Net::POP3Client, which provides a somewhat similar interface. The code would have looked more or less like the following fragment.

    1: my $pops = new Net::POP3Client(USER => $opt_u,
2: PASSWORD => $opt_p,
3: HOST => $opt_s->addr)
4: or die "Error connecting or logging in: $!\n";
5:
6: my $fh = new_tmpfile IO::File
7: or die "Cannot create temporary file: $!\n";
8:
9: $pops->HeadAndBodyToFile($fh, $opt_m)
10: or die "Cannot fetch message: $!\n";
11:
12: $pops->Close();

Parsing the MIME structure

Just as email travels inside a sort of envelope (the headers), complex messages that include attachments and generally, HTML messages, travel within a collection of MIME entities. You can think of these entities as containers that can transfer any kind of binary information through the Email infrastructure, which in general does not know how to deal with 8-bit data. The code reproduced below, takes care of parsing this MIME structure.

   47: my $mp = new MIME::Parser;
48: $mp->ignore_errors(1);
49: $mp->extract_uuencode(1);
50:
51: eval { $e = $mp->parse($fh); };
52: my $error = ($@ || $mp->last_error);
53:
54: if ($error)
55: {
56: $mp->filer->purge; # Get rid of the temp files
57: die "Error parsing the message: $error\n";
58: }

Perl has a wonderful class that provides the ability to understand this MIME encapsulation, returning a nice hierarchy of objects that represent the message. You access this facilities through the MIME::Parserno alt defined class, part of the MIME-Toolsno alt defined bundle. MIME::Parser returns a hierarchy of MIME::Entity objects representing your message. The parser is so smart, that if you pass it a non-MIME email, it will be returned to you as a text/plain entity.

MIME::Parser can be tweaked in many ways, as its documentation will show you. One of the points where this toggling might be important, is the decoding process. Remember that I need to be as light in memory usage as possible. The default behavior of MIME::Parser involves the use of temporary files for decoding of the message. These temporary files can be spared and core memory used instead by invoking output_to_core(). Before doing this, note all the caveats cited in the module's documentation. The most important one is that if a 100 MB file ends up in your inbox, this whole thing needs to be slurped into RAM.

In line 47 I create the parser object. The call to ignore_errors() in line 48 is an attempt to made this parser as tolerant as possible. extract_uuencode() in line 49, takes care of pieces of the email that are uu-encoded automatically, translating them back into a more readable form. The actual request to parse the message, available through reading the $fh filehandle, is in line 51. Note that it is enclosed in an eval block. I have to do this as the parser might throw an exception if certain errors are encountered. The eval allows me to catch this exception and react in a way that is sensible to this application. In this case, I want to be sure that any temporary file created by the parsing process is cleared by a call to purge(), as seen in lines 56 and 57.

Setting up the HTML parser

Parsing HTML can be a tricky and tedious task. Thankfully, Perl has a number of nice ways to help you do this job. A number of excellent books such as The Perl Cookbook (from O'Reilly & Associatesno alt defined) has a couple of recipes that came very close to what I needed, especially recipe 20.5, "Converting HTML to ASCII", which I reproduce below.

    1: use HTML::TreeBuilder;
2: use HTML::FormatText;
3:
4: $html = HTML::TreeBuilder->new();
5: $html->parse($document);
6:
7: $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
8:
9: $ascii = $formatter->format($html);

I did not want to use this recipe because of two reasons: I needed fine-grained control in the HTML to ASCII conversion and I wanted to have as little impact as possible in resources. I did a small benchmark that shows the kind of performance difference among the two options while parsing a copy of one of my web articles. The result below shows that the custom parser explained later runs faster than the Cookbook's recipe. This does not mean that the recipe or the modules it uses are bad. This result simply means that the recipe is actually doing a lot of additional work, which just happens to not be all that useful for this particular task.

bash-2.05a$ ./mbench
Benchmark: timing 100 iterations of Cookbook's, Custom...
Cookbook's: 73 wallclock secs (52.82 usr + 0.00 sys = 52.82 CPU) @ 1.89/s (n=100)
Custom: 1 wallclock secs ( 1.17 usr + 0.00 sys = 1.17 CPU) @ 85.47/s (n=100)
Rate Cookbook's Custom
Cookbook's 1.89/s -- -98%
Custom 85.5/s 4415% --

HTML::FormatTextno alt defined does an awesome job of converting the HTML to plain text. Unfortunately I have a set of guidelines that I need to follow in the conversion and that are not compatible with the output of this module. Additionally, HTML::TreeBuilderno alt defined does an excellent job of parsing an HTML document, but produces an intermediate structure - the parse tree - that in my case, wastes resources.

However, Perl has an excellent HTML parser in the HTML::Parserno alt defined module. In this case, I chose to use this class to implement an event-driven parser, where tokens (syntactic elements) in the source document cause the parser to call functions I provide. This allowed me complete control on the translation while sparing the intermediate data structure.

Converting HTML to text is a lossy transformation. This means that what goes out of this transformation is not exactly equivalent to what went in in the first place. Pictures, text layout, style and a few other information elements are lost. My needs required that I noted the existence of images as well as a reasonably accurate rendition of the page's text, but nothing else. Remember that the target device can only display 7-bit text, and this is within a very small and limited display. This piece of code sets up the parser to do what I need.

   62: my $parser = HTML::Parser->new
63: (
64: api_version => 3,
65: default_h => [ "" ],
66: start_h => [ sub { print "[IMG ",
67: d($_[1]->{alt}) || $_[1]->{src},"]\n"
68: if $_[0] eq 'img';
69: }, "tagname, attr" ],
70: text_h => [ sub { print d(shift); }, "dtext" ],
71: ) or die "Cannot create HTML parser\n";
72:
73: $parser->ignore_elements(qw(script style));
74: $parser->strict_comment(1);

Starting on line 71, I set up the HTML::Parser object that will help me do this. First, I tell it I want to use the latest (as of this writing) interface style, which provides more flexibility than earlier interfaces. On line 65, I tell the object that by default, no parse events should do anything. There are other ways to say this, but the one shown is the most efficient.

Lines 66 through 69 define a handler for the start events. This handler will be called each time an opening tag such as <a> or <img> is recognized in the source being parsed. Handlers are specified as a reference to an array whose first element tells the parser what to do and its second element, tells the parser what information to pass to the code. In this example, I supply a function that for any img tag, will output a hopefully descriptive text composed with either the alt or the src attributes. I request this handler to be called with the name of the tag as the first argument and the list of attributes as further arguments, through the string "tagname, attr" found in line 69. The d() function will be explained a bit later, but it has to do with decoding its argument.

The text event will be triggered by anything inside tags in the input text. I've set up a simpler handler for this event that merely prints out whatever is recognized. I also request that HTML entities such as &euro; or &ntilde; be decoded for me through the string "dtext" on line 70. HTML entities are used to represent special characters outside the traditional ASCII range. In the interest of document accuracy, you should always use entities instead of directly placing 8-bit characters in the text.

Some syntactic elements are used to enclose information that is not important for this application, such as <style>...</style> and <script>...</script>. I ask the parser to ignore those elements with the call to ignore_elements() at line 73. I also request the parser to follow strict comment syntax through the call to strict_comment() on line 74.

Setting up the Unicode mappings

MIME defines various ways to encode binary data depending on the frequency of octets greater than 127. With relatively few high-bit octets, Quoted-Printable encoding is used. When many high-bit octets are present, Base-64 encoding is used instead. The reason is that Quoted-Printable is slightly more readable but very inefficient in space while Base-64 is completely unreadable by standard humans and adds much less overhead in the size of encoded files. Often, message headers such as the sender's name are encoded using Quoted-Printable when they contain characters such as a 'ñ'. These headers look like From: =?ISO-8859-1?Q?Luis_Mu=F1oz?= <some@body.org> and should be converted to From: Luis Muñoz <some@body.org>. In plain english, Quoted-Printable encoding is being used to make the extended ISO-8859-1 characters acceptable for any 7-bit transport such as email. Many contemporary mail transport agents can properly handle message bodies that contain high-bit octets but will choke on headers with binary data, in case you were wondering why all this fuzz.

Lines 92 through 102 define setup_decoder(), which can use the headers contained in a MIME::Headno alt defined object to setup a suitable decoder based on the MIME::WordDecoderno alt defined class. This will translate instances of Quoted-Printable text, to its high-bit equivalent. Note that I selected ISO-8859-1 as the default in case no proper character set can be identified. This was a sensible choice for me, as ISO-8859-1 encloses spanish characters, which happen to be my native language.

   92: sub setup_decoder
93: {
94: my $head = shift;
95: if ($head->get('Content-Type')
96: and $head->get('Content-Type') =~ m!charset="([^\"]+)"!)
97: {
98: $wd = supported MIME::WordDecoder uc $1;
99: }
100: $wd = supported MIME::WordDecoder "ISO-8859-1" unless $wd;
101: }

But this clever decoding is not enough. Getting at the original high-bit characters is not enough. I must recode these high characters into something usable by the 7-bit display device. So in line 76 I set up a mapping based on Unicode::Map8no alt defined. This module can convert 8-bit characters such as ISO-8859-1 or ASCII into wider characters (Unicodeno alt defined) and then back into our chosen representation, ASCII, which only defines 7-bit characters. This means that any character that cannot be properly represented, will be lost, which for our application is acceptable.

   76: $map = Unicode::Map8->new('ASCII')
77: or die "Cannot create character map\n";

The decoding and character mapping is then brought together at line 90, where I define the d() function, that simply invokes the adequate MIME decoding method, transforms the resulting string into Unicode via the to16() method and then, transforms it back into ASCII using to8() to insure printable results in our device. Since I am allergic to warnings related to undef values, I make sure that decode() always get a defined string to work with.

   90: sub d { $map->to8($map->to16($wd->decode(shift||''))); }

As you might notice if you try this code, the conversion is again lossy because there are characters that does not exist in ASCII. You can experiment with the addpair() method to Unicode::Map8 in order to add custom character transformations (i.e., '? might be 'E'). Another way to achieve this, is through deriving a class from Unicode::Map8 and implementing the unmapped_to8 method to supply your own interpretation of the missing characters. Take a look at the module's documentation for more information.

Starting the decode process

With all the pieces in place, all that's left is to traverse the hierarchy of entities that MIME::Parser provides after parsing a message. I implemented a very simple recursive function decode_entities starting at line 103. This is a recursive function because recursion comes naturally as a way to handle trees such as those produced by MIME::Parser. At least to me.

  103: sub decode_entities
104: {
105: my $ent = shift;
106:
107: if (my @parts = $ent->parts)
108: {
109: decode_entities($_) for @parts;
110: }
111: elsif (my $body = $ent->bodyhandle)
112: {
113: my $type = $ent->head->mime_type;
114:
115: setup_decoder($ent->head);
116:
117: if ($type eq 'text/plain')
118: { print d($body->as_string); }
119: elsif ($type eq 'text/html')
120: { $parser->parse($body->as_string); }
121: else
122: { print "[Unhandled part of type $type]"; }
123: }
124: }

The condition at line 107 asks if this part or entity contains other parts. If it does, it extracts them and invokes itself recursively to process each sub-part at line 109.

If this part is a leaf, its body is processed. Line 111 gets it as a MIME::Bodyno alt defined object. On line 155 I setup a decoder for this part's encoding and based on the type of this part, taken at line 113, the code on lines 117 to 122 call the proper handlers.

In order to fire the decoding process, I call decode_entities() with the result of the MIME decoding of the message on line 86. This will invoke the HTML parser when needed and in general, produce the output I look for in this example. After this processing is done, I make sure to wipe temporary files created by MIME::Parser on line 88. Note that if the message is not actually encoded with MIME, MIME::Parser will arrange for you to receive a single part of type text/plain that contains the whole message text, which is perfect for our application.

   86: decode_entities($e);
87:
88: $mp->filer->purge;

And that's about it

After these less than 130 lines of code, I can easily fetch and decode a message, such as in the following example:

bash-2.05a$ ./mfetch -s pop.foo.bar -u myself \
-p very_secure_password -m 5

Date: Sat, 28 Dec 2002 20:14:37 -0400
From: root <root@foo.bar>
To: myself@foo.bar
Subject: This is the plain subject

This is a boring and plain message.

More complex MIME messages can also be decoded. Look at this example where I dissect a dreaded piece of junk mail, but don't worry. I used head to spare you pages and pages of worthless image links:

bash-2.05a$ ./mfetch -s pop.foo.bar -u myself \
-p very_secure_password -m 2 | head -20


Date: Sun, 22 Dec 2002 23:22:25 -0400
From: Luis Muoz <lem@foo.bar>
To: Myself <myself@foo.bar>
Subject: Fwd: Get $860 Free - Come, Play, Have Fun!



Begin forwarded message:

> From: Cosmic Offers <munged@migada.com.INVALID>;
> Date: Sun Dec 22, 2002 20:59:43 America/Caracas
> To: spam@victim.net
> Subject: Get $860 Free - Come, Play, Have Fun!
>

>
[IMG http://www.migada.com/email/Flc_600_550_liberty_mailer_.gif]
[IMG http://www.migada.com/email/Flc_600_550_liberty_mail-02.gif]
[IMG http://www.migada.com/email/Flc_600_550_liberty_mail-03.gif]
[IMG http://www.migada.com/email/Flc_600_550_liberty_mail-04.gif]

If you're curious, please download the complete script and play with it a bit. I hope this tutorial and its related script to be as helpful for you as it has been for me



pyguru 2005-02-19 00:33 发表评论
]]>
专业电子?/title><link>http://www.tkk7.com/pyguru/archive/2005/02/18/1297.html</link><dc:creator>pyguru</dc:creator><author>pyguru</author><pubDate>Thu, 17 Feb 2005 22:33:00 GMT</pubDate><guid>http://www.tkk7.com/pyguru/archive/2005/02/18/1297.html</guid><wfw:comment>http://www.tkk7.com/pyguru/comments/1297.html</wfw:comment><comments>http://www.tkk7.com/pyguru/archive/2005/02/18/1297.html#Feedback</comments><slash:comments>2</slash:comments><wfw:commentRss>http://www.tkk7.com/pyguru/comments/commentRss/1297.html</wfw:commentRss><trackback:ping>http://www.tkk7.com/pyguru/services/trackbacks/1297.html</trackback:ping><description><![CDATA[     摘要: 专业电子? 国化学文摘查阅? ?括期索引、期的内Ҏ式及沉K、期文摘、期索引、卷Q文摘)索引、卷Q辅助)索引、指导性烦引、篏计烦引、资料来源烦引、CA勘误、化学物质烦引中索引?题的选择原则、CA索引查阅原则及烦引关p表、CA查阅实例讨论及附录。共?94c彭卿1978q编Q较老,但仍然有参考h倹{因为是星格式Q文 件体U较大,...  <a href='http://www.tkk7.com/pyguru/archive/2005/02/18/1297.html'>阅读全文</a><img src ="http://www.tkk7.com/pyguru/aggbug/1297.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.tkk7.com/pyguru/" target="_blank">pyguru</a> 2005-02-18 06:33 <a href="http://www.tkk7.com/pyguru/archive/2005/02/18/1297.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>BioJava In Angerhttp://www.tkk7.com/pyguru/archive/2005/02/18/1296.htmlpygurupyguruThu, 17 Feb 2005 22:32:00 GMThttp://www.tkk7.com/pyguru/archive/2005/02/18/1296.htmlhttp://www.tkk7.com/pyguru/comments/1296.htmlhttp://www.tkk7.com/pyguru/archive/2005/02/18/1296.html#Feedback0http://www.tkk7.com/pyguru/comments/commentRss/1296.htmlhttp://www.tkk7.com/pyguru/services/trackbacks/1296.html

 

BioJava In Anger

快速指?/span>

介绍Q?/span>

BioJava 的设计涵盖了生物信息学的很多斚wQ本w就比较复杂和庞大,有时候甚至o人生畏。对于那些想快速了解ƈ且利用这个强大的工具的生物信息学家们来说Q有时候面对这一大堆的接口常怼头痛Ʋ裂。本指南能够帮助你利?span class="SpellE">BioJava开?span lang="EN-US">99Q常用的E序Q而不必ؓ此掌?span lang="EN-US">99Q的BioJava接口?/span>

本指南用多数编E快速指南的格式Q采用“我如何使用.....”的主题形式来帮助大家?span class="SpellE">BioJava。每个主题都提供你可能会期望q经怋用的源代码。这些代码基本上可以直接在你的机器中~译q行。我量详细注释q些代码Q让大家更容易理解程序中可能的一些晦涩代码?/span>

?span class="SpellE">BioJava In Anger”由Mark Schreiberl护。Q何徏议和问题误p?span lang="EN-US">biojava mailing list。点?span lang="EN-US">q里订阅邮g列表?/span>

本指南用的例子l过BioJava1.3, Java1.4试?/span>

本指南中文版?span lang="EN-US"> Wu Xin(Center of Bioinformatics,Peking University),M译问题请通过邮g列表联系或者登?span lang="EN-US">BBS?/span>


我如何?span lang="EN-US">.......?

安装

> 安装Java

> 安装BioJava

成分?span lang="EN-US">(alphabets)和标?span lang="EN-US">(symbol)

>我如何得到DNA,RNA或蛋白质的成分表?

>我如何用自定义的标记建立自定义的成分?

>我如何徏立杂交物成分表(cross product alphabet),例如密码字成分表(codon alphabet)?

>我如何从杂交产物成分?cross product alphabet)中分解出他们的组成标?component symbol)?

>我如何判别两个成分表或两个标记是否相?

>我如何徏立一个多义标?ambiguous symbol),例如Y或R?

基本序列操作

>我如何从字串中创Z条序列对象以及将其写回一条字?

>我如何从一条序列中得到子序?

>我如何将DNA序列转录到RNA序列?

>我如何得C条DNA或RNA序列的互补链?

>序列是不可变?immutable),我如何改变它的名?

>我如何编辑一条序列或者标记链(symbollist)?

(translation)

>我如何将一条DNA或RNA或标记链译成蛋白质?

>我如何将单个密码子翻译成单个氨基?

>我如何用一个非标准译?

序列输入输出(sequence I/O)

>我如何将序列以FASTA格式输出?

>我如何读取FASTA格式的文?

>我如何读取GenBank/EMBL/SwissProt格式的文?/span>

>我如何从GenBank/EMBL/SwissProt格式中抽取序列ƈ且以FASTA格式输出?

>我如何将ABI序列转化为BioJava序列?

注释(annotation)

>我如何将一条序列的注释列出?

>我如何用物种q个参数(或其他注释属?来筛选序?

位置和特?span lang="EN-US">(location and feature)

>我如何指定一个点位置(point location)?

>我如何指定一?span class="GramE">域位|?/span>(range location)?

>我如何用环状位|?circular location)?

>我如何徏立一个特?feature)?

>我如何以cd为参数筛选特?

>我如何删除特?

BLAST?span lang="EN-US">FASTA

>我如何创Z个BLAST解析?/span>?

>我如何创Z个FASTA解析?/span>?

>我如何从解析l果中抽取信?

计数和分?span lang="EN-US">(count and distribution)

>我如何计序列中的残基数?

>我如何计序列中某种标记(symbol)的频?

>我如何将计数转ؓ分布?

>我如何从一U分布中创徏一条随机序?

>我如何从一U分布中计算熵?

>我如何能扑ֈ一U简单的Ҏ来判断两U分布是否具有相同的权重?

>我如何对一个自定义的成分表创徏一个N阶分?order N distribution)?

>我如何将一U分布以XML格式输出?

权重矩阵和动态规?span lang="EN-US">(weight matrix and dynamic programming)

>我如何利用一个权重矩?span class="GramE">L模体?

>我如何创?span class="GramE">一个隐马模?/span>?profile HMM)?

>我如何徏立一个自定义的隐马模?/span>(HMM)?

用户界面(user interfaces)

>我如何将注释和特征以树状形式昄?

>我如何在GUI中显CZ条序?

>我如何显C序列标?

>我如何显C特?

OBDA

>我如何设|BioSQL?


免责声明:

q些源代码由各个作者A?span lang="EN-US">,管l过我们试,但仍然可能有错误发生。所有代码可以免费用,但我们ƈ不保证和负责代码的正性。在使用前,误我测试?/span>


版权Q?/span>

本站文档属于其A献者。如果在出版物中使用Q请先垂?span lang="EN-US">biojava mailing list。源代码?span lang="EN-US">开放资?/span>,如果你同意其声明则可以免费用?/span>

 

 Maintainted by Wu Xin, CBI, Peking University, China, 2003

 



pyguru 2005-02-18 06:32 发表评论
]]>
Bioperl?/title><link>http://www.tkk7.com/pyguru/archive/2005/02/18/1295.html</link><dc:creator>pyguru</dc:creator><author>pyguru</author><pubDate>Thu, 17 Feb 2005 22:29:00 GMT</pubDate><guid>http://www.tkk7.com/pyguru/archive/2005/02/18/1295.html</guid><wfw:comment>http://www.tkk7.com/pyguru/comments/1295.html</wfw:comment><comments>http://www.tkk7.com/pyguru/archive/2005/02/18/1295.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.tkk7.com/pyguru/comments/commentRss/1295.html</wfw:commentRss><trackback:ping>http://www.tkk7.com/pyguru/services/trackbacks/1295.html</trackback:ping><description><![CDATA[<span id="ArticleContent1_ArticleContent1_lblContent"> <p>        Bioperl 最q已l到?.0?先说bioperl.org,该组l正式成立于1995q?在此之前已经作ؓ非正式的团体存在那很多年,现在他已lŞ成了一个国? 性的开发者的协会,q些开发者开发用于生物信息学,基因l学,和生命科学研I的开放源码的Perl 工具.</p> <p> 该组l的支持者和推动者是Open Bioinformatics Foundation. 他们的伙伴还有biojava.org, biopython.org, DAS, bioruby.org, biocorba.org, ENSEMBL ?EMBOSS. </p> <p>Bioperl的服务器提供供下列服?用于生命U学的基于perl的模?脚本,web联接的Y? </p> <p>Bioperl现在已发展成Z个o人瞩目的国际性的自由软g开发计划,bioperl在生物信息学的用加速了生物信息学、基因组学以及其他生? U学研究的发展。最qbioperl 1.0版本正式发布Q这光历时七年Q成l斐然。Bioperl 1.0 包括832个文Ӟ93个ScriptQ功能丰富,源码全部开放。它是生物信息学研究的利器。详l的内容大家可以讉K<a >www.bioperl.org</a>?/p> <p>Bioperl作ؓperl的扩充的专门用于生物信息的工具与函数?自然也承了perl的众多优?</p> <p>W一. Perl强大的正则表C式(regular expression)比对以及字符串操作ɘq个工作变得单而没有其它语a能相比。Perl 非常擅长于切Ԍ扭{Q绞Q弄qIȝQ以及其它的操作文字文g。生物资料大部分是文字文?物种名称,U属关系Q基因或序列的注解,评住Q目录查? 甚至DNA序列也是cL字的。现在互怺换以以文字文件的形式存在的但是具有不兼容的资料格式生物信息资料是一个很头疼的问?perl的这个方面的? ?可以在这一斚w解决不少问题.</p> <p>W二. Perl 能容错。生物资料通常是不完全的,错误或者说误差从数据的产生时候可能就产生?另外生物数据的某值栏位可以被忽略 ,可能是空着的,或是某个栏位也就是某个?被预期要出现好几?举例来说Q一个实验可能被重复的操?Q或是资料以手动输入所以有错误。Perlq不? 意某个值是I的或是有奇怪的字符。正规表C式能够被写成取出ƈ且更正错误的一般错误。当然这U弹性也可能是各坏处?</p> <p><br>       q有,Perl 是组件导向的。Perl 鼓励Z他们的软g写成模l,不论是用 Perl 函式库模l或是正l的 Unix 工具导向的方式。外部程序能够轻易的被整合进 Perl E序,靠着道(pipe),pȝ呼叫,或是插(socket)。Perl5 引进的动态蝲入器允许Z使用 C 的函式,或者让整个~程q的函式库,被用在 Perl 直译器中。最q的成果是世界各地的l晶都会收录在一l模l里面,UCؓ”bioPerl”(请参?Perl JournalQ?br>        Perl 很容易去写ƈ且能很快开发完。直译器让你不需要宣告你所有的函数型式以及资料型态,当未定义的函式被呼叫时只会引起一个错误,除错器也能与Emacs很好 的合作ƈ且让你能用o服的交谈式的开发模式?br>         Perl 是良好的原型语言。因为它快而且?quick and dirty)Q用 Perl 建构新演的原型比直接写成一个快的需要编E过的语a来的有意义。有时候发现结果是Perl已经够快了,所以程序变不需要移?更多情Ş是某人可以用C? 一个小的核心程序,~程成动态蝲入的模组或是外部的可执行E序Q然后其它的部分用Perl来完成。这部分的例子可以参?<a >http://waldo.wi.mit.edu/ftp/distribution/software/rhmapper/)?/a> </p> <p>         有一点要的是, Perl 在写作网?CGI 斚w非常优秀Q而且重要性随着各实验将资料发表在网l上之后更是增加。我在基因中心环境下使用 Perl 的经验从头到N是值得U赞的。然而我发现 Perl 也有它的问题。它的松散的E序风格D许多错误Q这些在其它严格的语a都会被抓到。D例来_Perl 让你在一个变数在被指定g前就能用,q是个很有用的特性当你需要的时候,但是却是一个灾隑ֽ你单U的打错了L识名U。同LQ很Ҏ忘记要宣告一个函 式里面的区域变数Q导致不心地改C全域变数?br>    最后,Perl 的不之处在于徏立图形化的用者接口。虽?Unix忠实信徒所有事情都能在命o模式下完成,大多数的l端使用者却不同意。视H,选单Q弹跳的图案已经变成了必要的时尚?/p> <p><br>         直到最q,直到最q,Perl 的用者界?GUI)发展仍是不成熟的。然?Nick Ing-Simmons的努力?perlTK(pTK) 的整合得以 Perl 驱动的用者接口在 X-window上面成ؓ可能。我的伙伴和我曾l在 MIT 基因中心写过几个 pTK 为基的应用程序供互连|用者,而且从头到尾都是一个o人满意的l验。其它的基因中心则更大规模的使用 pTKQ在某些地方已经成ؓ主要的生产力?/p></span><img src ="http://www.tkk7.com/pyguru/aggbug/1295.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.tkk7.com/pyguru/" target="_blank">pyguru</a> 2005-02-18 06:29 <a href="http://www.tkk7.com/pyguru/archive/2005/02/18/1295.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>生物信息?-- 黄英?Q解攑ֆ306医院Q?q涛Q清华大学生物信息学研究所Q?/title><link>http://www.tkk7.com/pyguru/archive/2005/02/18/1294.html</link><dc:creator>pyguru</dc:creator><author>pyguru</author><pubDate>Thu, 17 Feb 2005 22:27:00 GMT</pubDate><guid>http://www.tkk7.com/pyguru/archive/2005/02/18/1294.html</guid><wfw:comment>http://www.tkk7.com/pyguru/comments/1294.html</wfw:comment><comments>http://www.tkk7.com/pyguru/archive/2005/02/18/1294.html#Feedback</comments><slash:comments>3</slash:comments><wfw:commentRss>http://www.tkk7.com/pyguru/comments/commentRss/1294.html</wfw:commentRss><trackback:ping>http://www.tkk7.com/pyguru/services/trackbacks/1294.html</trackback:ping><description><![CDATA[<b> <h1>生物信息?/h1></b> <p align="justify"><b><font face="宋体" lang="ZH-CN">撰稿人:黄英?Q解攑ֆ306医院Q? q涛Q清华大学生物信息学研究所Q?/font></b></p> <p align="justify"><b><font face="宋体" lang="ZH-CN">审稿人:孙之荣(清华大学生物信息学研I所Q?/font></b></p> <dir> <p><b>1 <a >概述</a></b></p> <p><font face="宋体" lang="ZH-CN"><b>2 <a >生物信息数据库与查询 </a></b></font></p> <blockquote> <p><a ><font face="宋体" lang="ZH-CN"><b>2.1 基因和基因组数据?/b> </font></a></p></blockquote> <blockquote> <p><a ><font face="宋体" lang="ZH-CN"><b>2.2 蛋白质数据库</b> </font></a></p></blockquote> <blockquote> <p><a ><font face="宋体" lang="ZH-CN"><b>2.3 功能数据?/b> </font></a></p></blockquote> <blockquote> <p><a ><font face="宋体" lang="ZH-CN"><b>2.4 其它数据库资?/b> </font></a></p></blockquote> <p><font face="宋体" lang="ZH-CN"><b>3 <a >序列比对和数据库搜烦</a></b></font><a ><font face="宋体" lang="ZH-CN"> </font></a></p> <blockquote> <p><a ><font face="宋体" lang="ZH-CN"><b>3.1 序列两两比对</b> </font></a></p></blockquote> <blockquote> <p><a ><font face="宋体" lang="ZH-CN"><b>3.2 多序列比?/b></font></a><font face="宋体" lang="ZH-CN"> </font></p></blockquote> <p><font face="宋体" lang="ZH-CN"><b>4 <a >栔R与蛋白质l构和功能的预测分析</a></b></font><a ><font face="宋体" lang="ZH-CN"> </font></a></p> <blockquote> <p><a ><font face="宋体" lang="ZH-CN"><b>4.1 针对栔R序列的预方?/b> </font></a></p> <p><a ><font face="宋体" lang="ZH-CN"><b>4.2 针对蛋白质的预测Ҏ</b></font></a><font face="宋体" lang="ZH-CN"> </font></p></blockquote><font face="宋体" lang="ZH-CN"> <p><b>5 <a >分子q化</a></b> </p> <p><font face="宋体" lang="ZH-CN"><b>6 <a >基因l序列信息分?/a></b></font><a ><font face="宋体" lang="ZH-CN"> </font></a></p></font> <blockquote> <p><a ><font face="宋体" lang="ZH-CN"><b>6.1 基因l序列分析工?/b> </font></a></p> <p><a ><font face="宋体" lang="ZH-CN"><b>6.2 人类和鼠cdq理图q使用</b> </font></a></p> <p><a ><font face="宋体" lang="ZH-CN"><b>6.3 SNPs识别</b> </font></a></p> <p><a ><font face="宋体" lang="ZH-CN"><b>6.4 全基因组比较</b> </font></a></p> <p><a ><font face="宋体" lang="ZH-CN"><b>6.5 EST序列应用</b> </font></a></p></blockquote><font face="宋体" lang="ZH-CN"> <p><b>7 <a >功能基因l相关信息分?/a></b><a > </a></p></font> <blockquote> <p><a ><font face="宋体" lang="ZH-CN"><b>7.1 大规模基因表达谱分析</b> </font></a></p> <p><a ><font face="宋体" lang="ZH-CN"><b>7.2 基因l水q白质功能l合预测</b></font></a><font face="宋体" lang="ZH-CN"> </font></p></blockquote><font face="宋体" lang="ZH-CN"> <p><b><a >参考文?/a></b><a > </a></p></font><font face="宋体" lang="ZH-CN" size="3"> <p align="justify"> </p></font></dir> <img src ="http://www.tkk7.com/pyguru/aggbug/1294.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.tkk7.com/pyguru/" target="_blank">pyguru</a> 2005-02-18 06:27 <a href="http://www.tkk7.com/pyguru/archive/2005/02/18/1294.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>CGI::Carp - CGI routines for writing to the HTTPD (or other) error loghttp://www.tkk7.com/pyguru/archive/2005/02/18/1293.htmlpygurupyguruThu, 17 Feb 2005 21:27:00 GMThttp://www.tkk7.com/pyguru/archive/2005/02/18/1293.htmlhttp://www.tkk7.com/pyguru/comments/1293.htmlhttp://www.tkk7.com/pyguru/archive/2005/02/18/1293.html#Feedback0http://www.tkk7.com/pyguru/comments/commentRss/1293.htmlhttp://www.tkk7.com/pyguru/services/trackbacks/1293.html

NAME

CGI::Carp - CGI routines for writing to the HTTPD (or other) error log


SYNOPSIS

    use CGI::Carp;

    croak "We're outta here!";
confess "It was my fault: $!";
carp "It was your fault!";
warn "I'm confused";
die "I'm dying.\n";


DESCRIPTION

CGI scripts have a nasty habit of leaving warning messages in the error logs that are neither time stamped nor fully identified. Tracking down the script that caused the error is a pain. This fixes that. Replace the usual

    use Carp;

with

    use CGI::Carp

And the standard warn(), die (), croak(), confess() and carp() calls will automagically be replaced with functions that write out nicely time-stamped messages to the HTTP server error log.

For example:

   [Fri Nov 17 21:40:43 1995] test.pl: I'm confused at test.pl line 3.
[Fri Nov 17 21:40:43 1995] test.pl: Got an error message: Permission denied.
[Fri Nov 17 21:40:43 1995] test.pl: I'm dying.


REDIRECTING ERROR MESSAGES

By default, error messages are sent to STDERR. Most HTTPD servers direct STDERR to the server's error log. Some applications may wish to keep private error logs, distinct from the server's error log, or they may wish to direct error messages to STDOUT so that the browser will receive them.

The carpout() function is provided for this purpose. Since carpout() is not exported by default, you must import it explicitly by saying

   use CGI::Carp qw(carpout);

The carpout() function requires one argument, which should be a reference to an open filehandle for writing errors. It should be called in a BEGIN block at the top of the CGI application so that compiler errors will be caught. Example:

   BEGIN {
use CGI::Carp qw(carpout);
open(LOG, ">>/usr/local/cgi-logs/mycgi-log") or
die("Unable to open mycgi-log: $!\n");
carpout(LOG);
}

carpout() does not handle file locking on the log for you at this point.

The real STDERR is not closed -- it is moved to SAVEERR. Some servers, when dealing with CGI scripts, close their connection to the browser when the script closes STDOUT and STDERR. SAVEERR is used to prevent this from happening prematurely.

You can pass filehandles to carpout() in a variety of ways. The ``correct'' way according to Tom Christiansen is to pass a reference to a filehandle GLOB:

    carpout(\*LOG);

This looks weird to mere mortals however, so the following syntaxes are accepted as well:

    carpout(LOG);
carpout(main::LOG);
carpout(main'LOG);
carpout(\LOG);
carpout(\'main::LOG');

    ... and so on

FileHandle and other objects work as well.

Use of carpout() is not great for performance, so it is recommended for debugging purposes or for moderate-use applications. A future version of this module may delay redirecting STDERR until one of the CGI::Carp methods is called to prevent the performance hit.


MAKING PERL ERRORS APPEAR IN THE BROWSER WINDOW

If you want to send fatal (die, confess) errors to the browser, ask to import the special ``fatalsToBrowser'' subroutine:

    use CGI::Carp qw(fatalsToBrowser);
die "Bad error here";

Fatal errors will now be echoed to the browser as well as to the log. CGI::Carp arranges to send a minimal HTTP header to the browser so that even errors that occur in the early compile phase will be seen. Nonfatal errors will still be directed to the log file only (unless redirected with carpout).


By default, the software error message is followed by a note to contact the Webmaster by e-mail with the time and date of the error. If this message is not to your liking, you can change it using the set_message() routine. This is not imported by default; you should import it on the use() line:

    use CGI::Carp qw(fatalsToBrowser set_message);
set_message("It's not a bug, it's a feature!");

You may also pass in a code reference in order to create a custom error message. At run time, your code will be called with the text of the error message that caused the script to die. Example:

    use CGI::Carp qw(fatalsToBrowser set_message);
BEGIN {
sub handle_errors {
my $msg = shift;
print "<h1>Oh gosh</h1>";
print "Got an error: $msg";
}
set_message(\&handle_errors);
}

In order to correctly intercept compile-time errors, you should call set_message() from within a BEGIN{} block.


CHANGE LOG

1.05 carpout() added and minor corrections by Marc Hedlund <hedlund@best.com> on 11/26/95.

1.06 fatalsToBrowser() no longer aborts for fatal errors within eval() statements.

1.08 set_message() added and carpout() expanded to allow for FileHandle objects.

1.09 set_message() now allows users to pass a code REFERENCE for really custom error messages. croak and carp are now exported by default. Thanks to Gunther Birznieks for the patches.

1.10 Patch from Chris Dean (ctdean@cogit.com) to allow module to run correctly under mod_perl.


AUTHORS

Lincoln D. Stein <lstein@genome.wi.mit.edu> Feel free to redistribute this under the Perl Artistic License.


SEE ALSO

Carp, CGI::Base, CGI::BasePlus, CGI::Request, CGI::MiniSvr, CGI::Form, CGI::Response


DISCLAIMER

We are painfully aware that these documents may contain incorrect links and misformatted HTML. Such bugs lie in the automatic translation process that automatically created the hundreds and hundreds of separate documents that you find here. Please do not report link or formatting bugs, because we cannot fix per-document problems. The only bug reports that will help us are those that supply working patches to the installhtml or pod2html programs, or to the Pod::HTML module itself, for which I and the entire Perl community will shower you with thanks and praises.

If rather than formatting bugs, you encounter substantive content errors in these documents, such as mistakes in the explanations or code, please use the perlbug utility included with the Perl distribution.

--Tom Christiansen, Perl Documentation Compiler and Editor


Return to the Perl Documentation Index.
Return to the Perl Home Page.

pyguru 2005-02-18 05:27 发表评论
]]>
Perl 5 by Examplehttp://www.tkk7.com/pyguru/archive/2005/02/18/1292.htmlpygurupyguruThu, 17 Feb 2005 19:56:00 GMThttp://www.tkk7.com/pyguru/archive/2005/02/18/1292.htmlhttp://www.tkk7.com/pyguru/comments/1292.htmlhttp://www.tkk7.com/pyguru/archive/2005/02/18/1292.html#Feedback0http://www.tkk7.com/pyguru/comments/commentRss/1292.htmlhttp://www.tkk7.com/pyguru/services/trackbacks/1292.html阅读全文

pyguru 2005-02-18 03:56 发表评论
]]>
Perl: The Carp Modulehttp://www.tkk7.com/pyguru/archive/2005/02/18/1291.htmlpygurupyguruThu, 17 Feb 2005 19:49:00 GMThttp://www.tkk7.com/pyguru/archive/2005/02/18/1291.htmlhttp://www.tkk7.com/pyguru/comments/1291.htmlhttp://www.tkk7.com/pyguru/archive/2005/02/18/1291.html#Feedback0http://www.tkk7.com/pyguru/comments/commentRss/1291.htmlhttp://www.tkk7.com/pyguru/services/trackbacks/1291.html Example: The Carp Module

This useful little module lets you do a better job of analyzing runtime errors-like when your script can't open a file or when an unexpected input value is found. It defines the carp(), croak(), and confess() fuNCtions. These are similar to warn() and die(). However, instead of reported in the exact script line where the error occurred, the fuNCtions in this module will display the line number that called the fuNCtion that generated the error. Confused? So was I, until I did some experimenting. The results of that experimenting can be found in Listing 15.6.

Load the Carp module.
Invoke the strict pragma.
Start the Foo namespace.
Define the
foo() fuNCtion.
Call the
carp() fuNCtion.
Call the
croak() fuNCtion.
Switch to the main namespace.
Call the
foo() fuNCtion.

Listing 15.6  15LST06.PL-Using the carp() and croak() from the Carp Module
use Carp;
use strict;

package Foo;
sub foo {
main::carp("carp called at line " . __LINE__ .
",\n but foo() was called");

main::croak("croak called at line " . __LINE__ .
",\n but foo() was called");
}

package main;
foo::foo();


This program displays:

carp called at line 9, 

but foo() was called at e.pl line 18

croak called at line 10,

but foo() was called at e.pl line 18

This example uses a compiler symbol, __LINE__, to iNCorporate the current line number in the string passed to both carp() and croak(). This technique enables you to see both the line number where carp() and croak() were called and the line number where foo() was called.

The Carp module also defines a confess() fuNCtion which is similar to croak() except that a fuNCtion call history will also be displayed. Listing 15.7 shows how this fuNCtion can be used. The fuNCtion declarations were placed after the foo() fuNCtion call so that the program flow reads from top to bottom with no jumping around.

Load the Carp module.
Invoke the strict pragma.
Call
foo().
Define
foo().
Call
bar().
Define
bar().
Call
baz().
Define
baz().
Call
Confess().

Listing 15.7  15LST07.PL-Using confess() from the Carp Module
use Carp;
use strict;

foo();

sub foo {
bar();
}

sub bar {
baz();
}

sub baz {
confess("I give up!");
}

This program displays:

I give up! at e.pl line 16

main::baz called at e.pl line 12

main::bar called at e.pl line 8

main::foo called at e.pl line 5

This daisy-chain of fuNCtion calls was done to show you how the fuNCtion call history looks when displayed. The fuNCtion call history is also called a stack trace. As each fuNCtion is called, the address from which it is called gets placed on a stack. When the confess() fuNCtion is called, the stack is unwound or read. This lets Perl print the fuNCtion call history.



pyguru 2005-02-18 03:49 发表评论
]]>
Add RSS feeds to your Web site with Perl XML::RSShttp://www.tkk7.com/pyguru/archive/2005/02/17/1268.htmlpygurupyguruWed, 16 Feb 2005 19:04:00 GMThttp://www.tkk7.com/pyguru/archive/2005/02/17/1268.htmlhttp://www.tkk7.com/pyguru/comments/1268.htmlhttp://www.tkk7.com/pyguru/archive/2005/02/17/1268.html#Feedback0http://www.tkk7.com/pyguru/comments/commentRss/1268.htmlhttp://www.tkk7.com/pyguru/services/trackbacks/1268.html Guest Contributor, TechRepublic
December 22, 2004
URL: http://www.builderau.com.au/architect/webservices/0,39024590,39171461,00.htm


TechRepublic

Take advantage of the XML::RSS CPAN package, which is specifically designed to read and parse RSS feeds.

You've probably already heard of RSS, the XML-based format which allows Web sites to publish and syndicate the latest content on their site to all interested parties. RSS is a boon to the lazy Webmaster, because (s)he no longer has to manually update his or her Web site with new content.

Instead, all a Webmaster has to do is plug in an RSS client, point it to the appropriate Web sites, and sit back and let the site "update itself" with news, weather forecasts, stock market data, and software alerts. You've already seen, in previous articles, how you can use the ASP.NET platform to manually parse an RSS feed and extract information from it by searching for the appropriate elements. But I'm a UNIX guy, and I have something that's even better than ASP.NET. It's called Perl.

Installing XML::RSS
RSS parsing in Perl is usually handled by the XML::RSS CPAN package. Unlike ASP.NET, which comes with a generic XML parser and expects you to manually write RSS-parsing code, the XML::RSS package is specifically designed to read and parse RSS feeds. When you give XML::RSS an RSS feed, it converts the various <item>s in the feed into array elements, and exposes numerous methods and properties to access the data in the feed. XML::RSS currently supports versions 0.9, 0.91, and 1.0 of RSS.

Written entirely in Perl, XML::RSS isn't included with Perl by default, and you must install it from CPAN. Detailed installation instructions are provided in the download archive, but by far the simplest way to install it is to use the CPAN shell, as follows:

shell> perl -MCPAN -e shell
cpan> install XML::RSS

If you use the CPAN shell, dependencies will be automatically downloaded for you (unless you told the shell not to download dependent modules). If you manually download and install the module, you may need to download and install the XML::Parser module before XML::RSS can be installed. The examples in this tutorial also need the LWP::Simple package, so you should download and install that one too if you don't already have it.

Basic usage
For our example, we'll assume that you're interested in displaying the latest geek news from Slashdot on your site. The URL for Slashdot's RSS feed is located here. The script in Listing A retrieves this feed, parses it, and turns it into a human-readable HTML page using XML::RSS:

Listing A

#!/usr/bin/perl

# import packages
use XML::RSS;
use LWP::Simple;

# initialize object
$rss = new XML::RSS();

# get RSS data
$raw = get('http://www.slashdot.org/index.rss');

# parse RSS feed
$rss->parse($raw);

# print HTML header and page
print "Content-Type: text/html\n\n";
print ""; print ""; print "";
print "";
print "
" . $rss->channel('title') . "
"; # print titles and URLs of news items foreach my $item (@{$rss->{'items'}}) { $title = $item->{'title'}; $url = $item->{'link'}; print "$title

"; } # print footers print "

";
print "";

Place the script in your Web server's cgi-bin/ directory/. Remember to make it executable, and then browse to it using your Web browser. After a short wait for the RSS file to download, you should see something like Figure A.

Figure A


Slashdot RSS feed

How does the script in Listing A work? Well, the first task is to get the RSS feed from the remote system to the local one. This is accomplished with the LWP::Simple package, which simulates an HTTP client and opens up a network connection to the remote site to retrieve the RSS data. An XML::RSS object is created, and this raw data is then passed to it for processing.

The various elements of the RSS feed are converted into Perl structures, and a foreach() loop is used to iterate over the array of items. Each item contains properties representing the item name, URL and description; these properties are used to dynamically build a readable list of news items. Each time Slashdot updates its RSS feed, the list of items displayed by the script above will change automatically, with no manual intervention required.

The script in Listing A will work with other RSS feeds as well—simply alter the URL passed to the LWP's get() method, and watch as the list of items displayed by the script changes.


Here are some RSS feeds to get you started

Tip: Notice that the RSS channel name (and description) can be obtained with the object's channel() method, which accepts any one of three arguments (title, description or link) and returns the corresponding channel value.


Adding multiple sources and optimising performance
So that takes care of adding a feed to your Web site. But hey, why limit yourself to one when you can have many? Listing B, a revision of the Listing A, sets up an array containing the names of many different RSS feeds, and iterates over the array to produce a page containing multiple channels of information.

Listing B

#!/usr/bin/perl

# import packages
use XML::RSS;
use LWP::Simple;

# initialize object
$rss = new XML::RSS();

# get RSS data
$raw = get('http://www.slashdot.org/index.rss');

# parse RSS feed
$rss->parse($raw);

# print HTML header and page
print "Content-Type: text/html\n\n";
print ""; print ""; print "";
print "";
print "
" . $rss->channel('title') . "
"; # print titles and URLs of news items foreach my $item (@{$rss->{'items'}}) { $title = $item->{'title'}; $url = $item->{'link'}; print "$title

"; } # print footers print "

";
print "";

Figure B shows you what it looks like.

Figure B


Several RSS feeds

You'll notice, if you're sharp-eyed, that Listing B uses the parsefile() method to read a local version of the RSS file, instead of using LWP to retrieve it from the remote site. This revision results in improved performance, because it does away with the need to generate an internal request for the RSS data source every time the script is executed. Fetching the RSS file on each script run not only causes things to go slow (because of the time taken to fetch the RSS file), but it's also inefficient; it's unlikely that the source RSS file will change on a minute-by-minute basis, and by fetching the same data over and over again, you're simply wasting bandwidth. A better solution is to retrieve the RSS data source once, save it to a local file, and use that local file to generate your page.

Depending on how often the source file gets updated, you can write a simple shell script to download a fresh copy of the file on a regular basis.

Here's an example of such a script:

#!/bin/bash
/bin/wget http://www.freshmeat.net/backend/fm.rdf -O freshmeat.rdf

This script uses the wget utility (included with most Linux distributions) to download and save the RSS file to disk. Add this to your system crontab, and set it to run on an hourly or daily basis.

If you find performance unacceptably low even after using local copies of RSS files, you can take things a step further, by generating a static HTML snapshot from the script above, and sending that to clients instead. To do this, comment out the line printing the "Content-Type" header in the script above and then run the script from the console, redirecting the output to an HTML file. Here's how:

$ ./rss.cgi > static.html

Now, simply serve this HTML file to your users. Since the file is a static file and not a script, no server-side processing takes place before the server transmits it to the client. You can run the command-line above from your crontab to regenerate the HTML file on a regular basis. Performance with a static file should be noticeably better than with a Perl script.

Looks easy? What are you waiting for—get out there and start hooking your site up to your favorite RSS news feeds.



pyguru 2005-02-17 03:04 发表评论
]]>
LilinaQRSS聚合器构Z人门?Write once, publish anywhere)http://www.tkk7.com/pyguru/archive/2005/02/17/1267.htmlpygurupyguruWed, 16 Feb 2005 19:00:00 GMThttp://www.tkk7.com/pyguru/archive/2005/02/17/1267.htmlhttp://www.tkk7.com/pyguru/comments/1267.htmlhttp://www.tkk7.com/pyguru/archive/2005/02/17/1267.html#Feedback0http://www.tkk7.com/pyguru/comments/commentRss/1267.htmlhttp://www.tkk7.com/pyguru/services/trackbacks/1267.htmlLilinaQ?b style="color: black; background-color: rgb(160, 255, 255);">RSS聚合器构Z人门?Write once, publish anywhere)

最q搜?b style="color: black; background-color: rgb(160, 255, 255);">RSS解析工具中找CMagPieRSS 和基于其设计?a >LilinaQLilina的主要功能:

1 ZWEB界面?b style="color: black; background-color: rgb(160, 255, 255);">RSS理Q添加,删除QOPML导出Q?b style="color: black; background-color: rgb(160, 255, 255);">RSS后台~存机制Q避免对数据源服务器产生q大压力Q,ScriptLet: cM于Del.icio.us it的收藏夹x订阅JS脚本Q?/p>

2 前台发布Q将自己的首|成了用Lilina发布我常看的几个朋友的网志,也省M很多更新自己|页的工作,需?strong>php 4.3 + mbstring iconv
lilina.png
开源Y件对i18n的支持越来越好了Qphp 4.3.xQ?--enable-mbstring' '--with-iconv'后比较好的同时处理了UTF-8和其他中文字W集发布?b style="color: black; background-color: rgb(160, 255, 255);">RSS?br> 需要感谢Steve在PHPq行转码斚w?a >MagPieRSSq行和XML Hacking工作。至目前ؓ止:Add to my yahooq不能很好的处理utf-8字符集的RSS收藏?/p>

记得q初Wen Xin在CNBlog的研讨会上介l了个h门户的概念,随着RSS在CMS技术中的成熟,来多的服务可以让个h用户Ҏ自己需求构建门P也算是符合了互联|的非中心化势吧,比如利用Add to My Yahoo!功能Q用户可以轻杄实现自己从更多数据源q行新闻订阅。想象一下把你自qdel.icio.us书签收藏 / flickr囄收藏 / Yahoo!新闻都通过q样一?b style="color: black; background-color: rgb(160, 255, 255);">RSS聚合器聚?发布h。其传播效率有多快?/p>

好比软g开发通过中间q_/虚拟机实玎ͼ一ơ写成,随处q行QWrite once, run anywhereQ,通过RSS/XMLq个中间层,信息发布也实CQ一ơ写成,随处发布QWrite once, publish anywhere...Q?/p>

安装Lilina需要PHP 4.3 以上Qƈ带有iconv mbstring{函数的支持Q请认一?a --with-iconv'

另外是一个需要能通过服务器端向外部服务器发送RPChQ这?1.NET不支持。感?a >PowWeb的服?/a>很不错,很多~省的包都安装好了:

iconv
iconv support enabled
iconv implementation unknown
iconv library version unknown

Directive Local Value Master Value
iconv.input_encoding ISO-8859-1 ISO-8859-1
iconv.internal_encoding ISO-8859-1 ISO-8859-1
iconv.output_encoding ISO-8859-1 ISO-8859-1

mbstring
Multibyte Support enabled
Japanese support enabled
Simplified chinese support enabled
Traditional chinese support enabled
Korean support enabled
Russian support enabled
Multibyte (japanese) regex support enabled

安装包解包Q下载文件扩展名?gz 其实?tgzQ需要重命名一下)Q上传到服务器相应目录下Q注意:相应cache目录和当前目录的可写入属性设|,然后配置一下conf.php中的参数卛_开始用?/p>

何东l我的徏议:
1Q右边的一栏,W一的sources最好跟hobby、友情链接一P加个囄?br> 2Q一堆检索框在那儿,有些乱,只有一个,其它的放C个二U页面上?br> 3Q把联系方式及cc,分别做成一条或一个图片,攑֜双一栏中Q具体的内容可以攑ֈ二面上,因ؓ我觉得好象没有多h会细读这些文字?br> 4Q如果可能,把lilina的头部链接汉化一下吧Q?/p>

一些改q计划:
1 删除q长的摘要,可以通过LW??

" 实现Q?br> 2 分组功能Q将RSSq行l输出;

修改默认昄实现QLilina~省昄最q?天发表的文章Q如果需要改成其他时间周期可以找刎ͼ
$TIMERANGE = ( $_REQUEST['hours'] ? $_REQUEST['hours']*3600 : 3600*24 ) ;

q行改动?/p>

RSS是一个能自q所有资源:WIKI / BLOG / 邮g聚合h的轻量协议Q以后无Z在何处书写,只要?b style="color: black; background-color: rgb(160, 255, 255);">RSS接口都可以通过一定方式进行再ơ的汇聚和发布v来,从而大大提高了个h知识理和发?传播效率?/p>

以前?b style="color: black; background-color: rgb(160, 255, 255);">RSS理解非常:不就是一个DTD嘛,真了解v解析器来Q才知道namespace的重要性,一个好的协议也应该是这LQƈ非没有什么可加的Q但肯定是没有什么可“减”的了,而真的要做到q个其实很难很难……?/p>

我会再尝试一下JAVA的相兌析器Q将其扩展到WebLucene目中,更多Java相关Open Source RSS解析器资?/a>?/p>

另外扑ֈ?个?b style="color: black; background-color: rgb(255, 255, 102);">Perlq行RSS解析的包Q?br> 使用XML::RSS::Parser::Lite?a >XML::RSS::Parser 解析RSS

XML::RSS::Parser::Lite的代码样例如下:

#!/usr/bin/perl -w
# $Id$
# XML::RSS::Parser::Lite sample

use strict;
use XML::RSS::Parser::Lite;
use LWP::Simple;


my $xml = get("http://www.klogs.org/index.xml");
my $rp = new XML::RSS::Parser::Lite;
$rp->parse($xml);

# print blog header
print "<a href=\"".$rp->get('url')."\">" . $rp->get('title') . " - " . $rp->get('description') . "</a>\n";

# convert item to <li>
print "<ul>";
for (my $i = 0; $i < $rp->count(); $i++) {
my $it = $rp->get($i);
print "<li><a href=\"" . $it->get('url') . "\">" . $it->get('title') . "</a></li>\n";
}
print "</ul>";

安装Q?br> 需要SOAP-Lite

优点Q?br> Ҏ单,支持q程抓取Q?/p>

~点Q?br> 只支持title, url, descriptionq?个字D,不支持时间字D,

计划用于单的抓取RSS同步服务设计Q每个h都可以出版自p阅的RSS?/p>


XML::RSS::Parser代码样例如下Q?br> #!/usr/bin/perl -w
# $Id$
# XML::RSS::Parser sample with Iconv charset convert

use strict;
use XML::RSS::Parser;
use Text::Iconv;
my $converter = Text::Iconv->new("utf-8", "gbk");


my $p = new XML::RSS::Parser;
my $feed = $p->parsefile('index.xml');

# output some values
my $title = XML::RSS::Parser->ns_qualify('title',$feed->rss_namespace_uri);
# may cause error this line: print $feed->channel->children($title)->value."\n";
print "item count: ".$feed->item_count()."\n\n";
foreach my $i ( $feed->items ) {
map { print $_->name.": ".$converter->convert($_->value)."\n" } $i->children;
print "\n";
}

优点Q?br> 能够直接数据按字段输出Q提供更底层的界面;

~点Q?br> 不能直接解析q程RSSQ需要下载后再解析;

2004-12-14:
从cnblog的Trackback中了解到?a >Planet RSS聚合?/a>

Planet的安装:解包后,直接在目录下q行Qpython planet.py examples/config.ini 可以在output目录中看到缺省样例FEED中的输出了index.htmlQ另外还有opml.xml?b style="color: black; background-color: rgb(160, 255, 255);">rss.xml{输出(q点比较好)

我用几个RSS试了一下,UTF-8的没有问题,但是GBK的全部都q了,planetlib.py中和XML字符集处理的只有以下代码Q看来所有的非UTF-8都被当作iso8859_1处理了:
try:
data = unicode(data, "utf8").encode("utf8")
logging.debug("Encoding: UTF-8")
except UnicodeError:
try:
data = unicode(data, "iso8859_1").encode("utf8")
logging.debug("Encoding: ISO-8859-1")
except UnicodeError:
data = unicode(data, "ascii", "replace").encode("utf8")
logging.warn("Feed wasn't in UTF-8 or ISO-8859-1, replaced " +
"all non-ASCII characters.")

q期学习一下Python的unicode处理Q感觉是一个很z的语言Q有比较好的try ... catch 机制和logging

关于MagPieRSS性能问题的疑虑:
对于Planet和MagPieRSS性能的主要差异在是缓存机制上Q关于用缓存机制加速WEB服务可以参考:可缓存的cms设计?/p>

可以看到QLilina的缓存机制是每次h的时候遍历缓存目录下?b style="color: black; background-color: rgb(160, 255, 255);">RSS文gQ如果缓存文件过期,q要动态向RSS数据源进行请求。因此不能支持后台太多的RSS订阅和前端大量的q发讉KQ会造成很多的I/O操作Q?/p>

Planet是一个后台脚本,通过脚本订阅的RSS定期汇聚成一个文件输出成静态文件?/p>

其实只要在MagPieRSS前端增加一个wget脚本定期index.php的数据输出成index.htmlQ然后要求每ơ访问先讉Kindex.html~存Q这样不和Planet的每时生成index.html静态缓存一样了吗?/p>

所以在不允许自己配|服务器脚本的虚拟主机来说PlanetҎ是无法运行的?/p>

更多关于PHP中处理GBK的XML解析问题请参考:
MagPieRSS中UTF-8和GBK?b style="color: black; background-color: rgb(160, 255, 255);">RSS解析分析

2004-12-19
正如在SocialBrain 2005q的讨论会中QIsaac Mao所_Blog is a 'Window', also could be a 'Bridge'QBlog是个?l织对外的“窗口”,?b style="color: black; background-color: rgb(160, 255, 255);">RSS更方便你这些窗口组合v来,成ؓ光的“桥梁”,有了q样的中间发布层QBlog不仅从单点发布,更到P2P自助传播Q越来越看到?b style="color: black; background-color: rgb(160, 255, 255);">RSS在网l传播上的重要性?/p>

Posted by chedong at December 11, 2004 12:34 AM Edit
Last Modified at December 19, 2004 04:40 PM

Trackback Pings

TrackBack URL for this entry:
http://www.chedong.com/cgi-bin/mt3/mt-tb.cgi/27

Listed below are links to weblogs that reference LilinaQ?b style="color: black; background-color: rgb(160, 255, 255);">RSS聚合器构Z人门?Write once, publish anywhere):

MagPieRSS中UTF-8和GBK?b style="color: black; background-color: rgb(160, 255, 255);">RSS解析分析Q附Qphp中的面向字符~程详解Q?/a> from 车东BLOG
W一ơ尝试MagpieRSSQ因为没有安装iconv和mbstringQ所以失败了Q今天在服务器上安装了iconv和mtstring的支持,我今天仔l看了一下lilina中的rss_fetch的用法:最重要的是制定RSS的输出格式ؓ'MAGPIE_OU...
[Read More]

Tracked on December 19, 2004 12:37 AM

?lilina ?blogline 来看 blog from Philharmania's Weblog
看到一?a rel="nofollow">介绍 lilina 的文?/a>后就自己安装了一?/a>试了下?a rel="nofollow">lilina 是一个用 PHP ?[Read More]

Tracked on December 26, 2004 01:57 PM

CNBlog作者群RSS征集?/a> from CNBlog: Blog on Blog
在CNBLOG上搭Z
Lilina RSS聚合?/a>Q请各位志愿者将各自|志或者和与cnblog相关专栏?b style="color: black; background-color: rgb(160, 255, 255);">RSS提交l我 ?直接在评Z回复卛_? 推广使用RSS聚合工具主要的目? . [Read More]

Tracked on December 26, 2004 07:42 PM

关于加快 lilina 昄速度的一些设|?/a> from Kreny's Blog
我的 lilina 在设定了几位朋友?blog 和一?news 以后Q发现打开速度异常的慢Q于是请教了车东Q解决了问题? 解决的关键在于:

直接以下语句加入到 index.php 头部卛_QLILINA中你 .
[Read More]

Tracked on January 14, 2005 06:14 PM

MT的模板修改和界面皮肤讄 from 车东BLOG
分类索引Q?首页~省有按月归档的索引Q没有分cȝ录的索引Q看了手册里面也没有具体的参数定义,只好直接看SOURCEQ尝试着把MonthlyҎCategoryQ居然成?:-) q到了Movable Style的MT样式站,... [Read More]

Tracked on January 17, 2005 01:25 PM

Comments

请问如果更改默认昄7天的新闻Q谢谢?/p>

Posted by: honren at December 12, 2004 10:20 PM

我用lilina已经一D|间了?br> http://news.yanfeng.org
E微改了一点UI?br> 如果你能改进它,那就好了?/p>

Posted by: mulberry at December 13, 2004 09:24 AM

老R同志Q没觉得你用lilina以来Q主늚讉K速度h吗?攑ּ吧,臛_没必要当作首,lilinaq在技术还不成熟`~

Posted by: kalen at December 16, 2004 10:33 AM

可以考虑一下用drupal

Posted by: shunz at December 28, 2004 06:46 PM

可以试试我做的:http://blog.terac.com

?时抓取blog,然后每个?条最新的Q排序,聚合Q生成静态xmlQ用xsl格式化显C。。?/p>

Posted by: andy at January 6, 2005 12:53 PM

车东同志Q这样做不好QP
rss本来在|上Q你聚合它在你的|页上不仅损害了你自׃늚质量Q而且qh了搜索引擎,造成你痛斥的“门L站损宛_作热情”的效果。还是不要聚合的好!



pyguru 2005-02-17 03:00 发表评论
]]>
վ֩ģ壺 69ӰԺ߹ۿ| ޹Ʒþþ | ߾Ʒһ | ۺɫƵ| þþþþƵ| ѸƵ| һ| www޾Ʒþþ| ѳ˻ɫƬ| þƵһ| AVһ | ձ˳Ƶ| ߹ۿ| þþƷһ| þþþùɫAVѹۿ| AVרӰ | ӰԺMV߹ۿƵ| ľþþƷww16| վ߿| þ޹ƷAVϼ| 91ѾƷԲ߲| ޾Ʒþþþþ| ޾Ʒ۲ӰԺ| Ʒtvþþþþþ| ŷձƷ| Ƶ| ˸ŮѲžþþ| ɫվwww| aaaƵѹۿ| ƷžƷƵ| ɫɫۺվ| Ů18ëƬaëƬ| ƷƵ| ޾Ʒ1ҳ| ձһ岻| Ʒһ| ŷרһ| þþþþ޾Ʒ| Ҹ߳Ƶ| ëƬƬѹۿ| ŷۺϾþþ|