博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Python libraries for web scraping
阅读量:6227 次
发布时间:2019-06-21

本文共 3028 字,大约阅读时间需要 10 分钟。

philips
2011.10.07 8:21
  • Python( ) is a very simple, powerful programming language. FMiner( ) is developed by python, and it use PySide( ) doing the core scraping features. In addition to PySide, python has many libraries for web scraping(screen scraping), this article will list those common python libraries for web extraction.
    Web scraping framework
    Scrapy:
    Scrapy is a fast high-level web crawling and web scraping(screen scraping) framework, used to crawl websites, parse and extract structured data from their pages. It can be used for a wide range of purposes, such as data mining, automated testing and sites monitoring.
    Page downloading libraries
    urllib:
    urllib2:
    They are standard libraries in python, can do the general jobs for downloading web pages.
    PycURL:
    PycURL is a Python interface to libcurl, and it can be used to fetch objects identified by a URL from a Python program, similar to the urllib Python module. PycURL's core is libcurl and made by C language, so it's fast, very fast, and supports a lot of features.
    mechanize:
    Stateful programmatic web browsing in Python, it can simulate web browser, but it does not use a real browser core, and can not handle javascript code.
    twill:
    Twill is a simple language that allows users to browse through the web from a command-line interface. With twill, you can navigate through Web sites that use forms, cookies, and most standard web features. Twill supports automated web testing and has a simple Python interface.
    Page parser
    BeautifulSoup:
    Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen scraping. It's very easy using for some small python web scraping projects. Its selection work likes Query.
    lxml:
    The lxml XML toolkit is a Pythonic binding on the C libraries libxml2 and libxslt. lxml.html can parse a html page to a dom tree, select the dom using XPath. Early versions of FMiner use it as a core module, but in order to deal with the page that contains javascript code, it was replaced with PySide.
    re:
    Regular Expression, it is a standard library in python. You can use regular expression to extract the page contents, but the writing a regular expression is very complex.
    Browser core
    PyQt:
    PyQt is a set of Python bindings for Nokia's Qt application framework, and it developed a long time, very mature. It contain the Webkit package which can browse through web pages and do web extraction. It has the GNU GPL (v2 and v3) and a commercial license.
    PySide:
    The PySide project provides LGPL-licensed Python bindings for the Qt cross-platform application and UI framework. It also contains webkit package and support LGPL. That's why FMiner chooses it.
    Pamie:
    stands for Python Automated Module For I.E.
    Pamie's main use is for testing web sites by which you automate the Internet Explorer client using the Pamie scripting language. It uses IE com as the core, and main for testing web, to make screen scraping, you should do some more work to extract the page's content, and some javascript code is needed.

转载地址:http://zunna.baihongyu.com/

你可能感兴趣的文章
流程管理软件(BPM)功能简介
查看>>
0408 汉堡包
查看>>
记一次服务器被勒索!
查看>>
docker jenkins安装(一)
查看>>
linux安装软件的几种方法
查看>>
HTML5系列:HTML5表单
查看>>
团队编程项目作业2-爬虫豆瓣top250项目代码设计规范
查看>>
Oracle觸發器調用procedure寄信
查看>>
练习-为网页添加icon图标;为网页添加关键字/作者;超链接;input的type属性有哪些常用属性值-form表单...
查看>>
实验一
查看>>
单页数据多iOS预加载的方法
查看>>
acm计划(更新于2014.11.9)
查看>>
hdu3364 高斯消元1(开关控制灯,异或解的个数)
查看>>
Python网络编程1:套接字
查看>>
Complete Physics Platformer Kit 学习
查看>>
软件工程---删除重复数组
查看>>
ubuntu16 64 搭建lnmp环境
查看>>
数据结构中的图
查看>>
设计模式:结构型模式总结
查看>>
HDU 1260:Tickets(DP)
查看>>