<rt id="bn8ez"></rt>
<label id="bn8ez"></label>

  • <span id="bn8ez"></span>

    <label id="bn8ez"><meter id="bn8ez"></meter></label>

    Python, Java, Life, etc

    A blog of technology and life.

    BlogJava 首頁(yè) 新隨筆 聯(lián)系 聚合 管理
      30 Posts :: 0 Stories :: 9 Comments :: 0 Trackbacks
    Content syndication for the Web

    Level: Introductory


    Mike Olson (mike.olson@fourthought.com), Principal Consultant, Fourthought, Inc.
    Uche Ogbuji (uche.ogbuji@fourthought.com), Principal Consultant, Fourthought, Inc.

    13 Nov 2002

    Column iconRSS is one of the most successful XML services ever. Despite its chaotic roots, it has become the community standard for exchanging content information across Web sites. Python is an excellent tool for RSS processing, and Mike Olson and Uche Ogbuji introduce a couple of modules available for this purpose.

    RSS is an abbreviation with several expansions: "RDF Site Summary," "Really Simple Syndication," "Rich Site Summary," and perhaps others. Behind this confusion of names is an astonishing amount of politics for such a mundane technological area. RSS is a simple XML format for distributing summaries of content on Web sites. It can be used to share all sorts of information including, but not limited to, news flashes, Web site updates, event calendars, software updates, featured content collections, and items on Web-based auctions.

    RSS was created by Netscape in 1999 to allow content to be gathered from many sources into the Netcenter portal (which is now defunct). The UserLand community of Web enthusiasts became early supporters of RSS, and it soon became a very popular format. The popularity led to strains over how to improve RSS to make it even more broadly useful. This strain led to a fork in RSS development. One group chose an approach based on RDF, in order to take advantage of the great number of RDF tools and modules, and another chose a more stripped-down approach. The former is called RSS 1.0, and the latter RSS 0.91. Just last month the battle flared up again with a new version of the non-RDF variant of RSS, which its creators are calling "RSS 2.0."

    RSS 0.91 and 1.0 are very popular, and used in numerous portals and Web logs. In fact, the blogging community is a great user of RSS, and RSS lies behind some of the most impressive networks of XML exchange in existence. These networks have grown organically, and are really the most successful networks of XML services in existence. RSS is a XML service by virtue of being an exchange of XML information over an Internet protocol (the vast majority of RSS exchange is simple HTTP GET of RSS documents). In this article, we introduce just a few of the many Python tools available for working with RSS. We don't provide a technical introduction to RSS, because you can find this in so many other articles (see Resources). We recommend first that you gain a basic familiarity with RSS, and that you understand XML. Understanding RDF is not required.

    [We consider RSS an 'XML service' rather than a 'Web service' due to the use of XML descriptions but the lack of use of WSDL. -- Editors]

    RSS.py
    Mark Nottingham's RSS.py is a Python library for RSS processing. It is very complete and well-written. It requires Python 2.2 and PyXML 0.7.1. Installation is easy; just download the Python file from Mark's home page and copy it to somewhere in your PYTHONPATH.

    Most users of RSS.py need only concern themselves with two classes it provides: CollectionChannel and TrackingChannel. The latter seems the more useful of the two. TrackingChannel is a data structure that contains all the RSS data indexed by the key of each item. CollectionChannel is a similar data structure, but organized more as RSS documents themselves are, with the top-level channel information pointing to the item details using hash values for the URLs. You will probably use the utility namespace declarations in the RSS.ns structure. Listing 1 is a simple script that downloads and parses an RSS feed for Python news, and prints out all the information from the various items in a simple listing.



    from RSS import ns, CollectionChannel, TrackingChannel

    #Create a tracking channel, which is a data structure that
    #Indexes RSS data by item URL
    tc = TrackingChannel()

    #Returns the RSSParser instance used, which can usually be ignored
    tc.parse("http://www.python.org/channews.rdf")

    RSS10_TITLE = (ns.rss10, 'title')
    RSS10_DESC = (ns.rss10, 'description')

    #You can also use tc.keys()
    items = tc.listItems()
    for item in items:
    #Each item is a (url, order_index) tuple
    url = item[0]
    print "RSS Item:", url
    #Get all the data for the item as a Python dictionary
    item_data = tc.getItem(item)
    print "Title:", item_data.get(RSS10_TITLE, "(none)")
    print "Description:", item_data.get(RSS10_DESC, "(none)")



    We start by creating a TrackingChannel instance, and then populate it with data parsed from the RSS feed at http://www.python.org/channews.rdf. RSS.py uses tuples as the property names for RSS data. This may seem an unusual approach to those not used to XML processing techniques, but it is actually a very useful way of being very precise about what was in the original RSS file. In effect, an RSS 0.91 title element is not considered to be equivalent to an RSS 1.0 one. There is enough data for the application to ignore this distinction, if it likes, by ignoring the namespace portion of each tuple; but the basic API is wedded to the syntax of the original RSS file, so that this information is not lost. In the code, we use this property data to gather all the items from the news feed for display. Notice that we are careful not to assume which properties any particular item might have. We retrieve properties using the safe form as seen in the code below.



    print "Title:", item_data.get(RSS10_TITLE, "(none)")

    Which provides a default value if the property is not found, rather than this example.



    print "Title:", item_data[RSS10_TITLE]

    This precaution is necessary because you never know what elements are used in an RSS feed. Listing 2shows the output from Listing 1.



    $ python listing1.py
    RSS Item: http://www.python.org/2.2.2/
    Title: Python 2.2.2b1
    Description: (none)
    RSS Item: http://sf.net/projects/spambayes/
    Title: spambayes project
    Description: (none)
    RSS Item: http://www.mems-exchange.org/software/scgi/
    Title: scgi 0.5
    Description: (none)
    RSS Item: http://roundup.sourceforge.net/
    Title: Roundup 0.4.4
    Description: (none)
    RSS Item: http://www.pygame.org/
    Title: Pygame 1.5.3
    Description: (none)
    RSS Item: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
    Title: Pyrex 0.4.4.1
    Description: (none)
    RSS Item: http://www.tundraware.com/Software/hb/
    Title: hb 1.88
    Description: (none)
    RSS Item: http://www.tundraware.com/Software/abck/
    Title: abck 2.2
    Description: (none)
    RSS Item: http://www.terra.es/personal7/inigoserna/lfm/
    Title: lfm 0.9
    Description: (none)
    RSS Item: http://www.tundraware.com/Software/waccess/
    Title: waccess 2.0
    Description: (none)
    RSS Item: http://www.krause-software.de/jinsitu/
    Title: JinSitu 0.3
    Description: (none)
    RSS Item: http://www.alobbs.com/pykyra/
    Title: PyKyra 0.1.0
    Description: (none)
    RSS Item: http://www.havenrock.com/developer/treewidgets/index.html
    Title: TreeWidgets 1.0a1
    Description: (none)
    RSS Item: http://civil.sf.net/
    Title: Civil 0.80
    Description: (none)
    RSS Item: http://www.stackless.com/
    Title: Stackless Python Beta
    Description: (none)

    Of course, you would expect somewhat different output because the news items will have changed by the time you try it. The RSS.py channel objects also provide methods for adding and modifying RSS information. You can write the result back to RSS 1.0 format using the output() method. Try this out by writing back out the information parsed in Listing 1. Kick off the script in interactive mode by running: python -i listing1.py . At the resuting Python prompt, run the following example.



    >>> result = tc.output(items)
    >>> print result

    The result is an RSS 1.0 document printed out. You must have RSS.py, version 0.42 or more recent for this to work. There is a bug in the output() method in earlier versions.

    rssparser.py
    Mark Pilgrim offers another module for RSS file parsing. It doesn't provide all the features and options that RSS.py does, but it does offer a very liberal parser, which deals well with all the confusing diversity in the world of RSS. To quote from the rssparser.py page:

    You see, most RSS feeds suck. Invalid characters, unescaped ampersands (Blogger feeds), invalid entities (Radio feeds), unescaped and invalid HTML (The Register's feed most days). Or just a bastardized mix of RSS 0.9x elements with RSS 1.0 elements (Movable Type feeds).
    Then there are feeds, like Aaron's feed, which are too bleeding edge. He puts an excerpt in the description element but puts the full text in the content:encoded element (as CDATA). This is valid RSS 1.0, but nobody actually uses it (except Aaron), few news aggregators support it, and many parsers choke on it. Other parsers are confused by the new elements (guid) in RSS 0.94 (see Dave Winer's feed for an example). And then there's Jon Udell's feed, with the fullitem element that he just sort of made up.

    It's funny to consider this in the light of the fact that XML and Web services are supposed to increase interoperability. Anyway, rssparser.py is designed to deal with all the madness.

    Installing rssparser.py is also very easy. You download the Python file (see Resources), rename it from "rssparser.py.txt" to "rssparser.py", and copy it to your PYTHONPATH. I also suggest getting the optional timeoutsocket module which improves the timeout behavior of socket operations in Python, and thus can help getting RSS feeds less likely to stall the application thread in case of error.

    Listing 3 is a script that is the equivalent of Listing 1, but using rssparser.py, rather than RSS.py.



    import rssparser
    #Parse the data, returns a tuple: (data for channels, data for items)
    channel, items = rssparser.parse("http://www.python.org/channews.rdf")

    for item in items:
    #Each item is a dictionary mapping properties to values
    print "RSS Item:", item.get('link', "(none)")
    print "Title:", item.get('title', "(none)")
    print "Description:", item.get('description', "(none)")



    As you can see, the code is much simpler. The trade-off between RSS.py and rssparser.py is largely that the former has more features, and maintains more syntactic information from the RSS feed. The latter is simpler, and a more forgiving parser (the RSS.py parser only accepts well-formed XML).

    The output should be the same as in Listing 2.

    Conclusion
    There are many Python tools for RSS, and we don't have space to cover them all. Aaron Swartz's page of RSS tools is a good place to start looking if you want to explore other modules out there. RSS is easy to work with in Python, because of all the great modules available for it. The modules hide all the chaos brought about by the history and popularity of RSS. If your XML services needs mostly involve the exchange of descriptive information for Web sites, we highly recommend using the most successful XML service technology in employment.

    Next month, we will explain how to use e-mail packages for Python for writing Web services over SMTP.

    Resources

    About the authors
    Photo of Mike Olson Mike Olson is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open source platform for XML middleware. You can contact Mr. Olson at mike.olson@fourthought.com.


    Photo of Uche Ogbuji Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open source platform for XML middleware. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche.ogbuji@fourthought.com.

    posted on 2005-02-17 02:48 pyguru 閱讀(351) 評(píng)論(0)  編輯  收藏 所屬分類: Build Website
    主站蜘蛛池模板: 亚洲免费视频一区二区三区| 性盈盈影院免费视频观看在线一区| 国产精品免费一级在线观看| 亚洲一区中文字幕在线观看| 免费观看的毛片大全| 亚洲另类自拍丝袜第1页| 9420免费高清在线视频| 亚洲视频在线免费播放| 国拍在线精品视频免费观看 | 粉色视频在线观看www免费| 国产无遮挡吃胸膜奶免费看视频| 国产亚洲Av综合人人澡精品| 免费播放春色aⅴ视频| 精品国产免费人成网站| 亚洲高清国产拍精品26U| 57pao国产成永久免费视频 | 亚洲日韩AV一区二区三区中文| 四虎www免费人成| fc2免费人成在线视频| 亚洲福利视频导航| 国产卡一卡二卡三免费入口| 亚洲AV无码一区二区三区性色 | 国产免费观看青青草原网站| fc2成年免费共享视频网站| 亚洲天堂男人天堂| 成年丰满熟妇午夜免费视频| 日韩一级片免费观看| 亚洲精品乱码久久久久66| 青青草a免费线观a| 免费人成大片在线观看播放| 久久久久久久尹人综合网亚洲| 91精品免费国产高清在线| 老司机午夜性生免费福利 | 亚洲av色香蕉一区二区三区蜜桃| 亚洲一级特黄大片无码毛片| 国产高清不卡免费视频| 亚洲国产精品成人AV在线| 国产精品亚洲а∨无码播放| 国产免费看JIZZ视频| 黄桃AV无码免费一区二区三区| 亚洲嫩草影院在线观看|