基本信息
源码名称:Python网络数据采集 2nd Edition.pdf
源码大小:6.58M
文件格式:.pdf
开发语言:Python
更新时间:2019-09-01
友情提示:(无需注册或充值,赞助后即可获取资源下载链接)
嘿,亲!知识可是无价之宝呢,但咱这精心整理的资料也耗费了不少心血呀。小小地破费一下,绝对物超所值哦!如有下载和支付问题,请联系我们QQ(微信同号):813200300
本次赞助数额为: 2 元×
微信扫码支付:2 元
×
请留下您的邮箱,我们将在2小时内将文件发到您的邮箱
源码介绍
Python网络数据采集采用简洁强大的Python语言,介绍了网络数据采集,并为采集新式网络中的各种数据类型提供了全面的指导。第1部分重点介绍网络数据采集的基本原理:如何用Python从网络服务器请求信息,如何对服务器的响应进行基本处理,以及如何以自动化手段与网站进行交互。第二部分介绍如何用网络爬虫测试网站,自动化处理,以及如何通过更多的方式接入网络。
Python网络数据采集采用简洁强大的Python语言,介绍了网络数据采集,并为采集新式网络中的各种数据类型提供了全面的指导。第1部分重点介绍网络数据采集的基本原理:如何用Python从网络服务器请求信息,如何对服务器的响应进行基本处理,以及如何以自动化手段与网站进行交互。第二部分介绍如何用网络爬虫测试网站,自动化处理,以及如何通过更多的方式接入网络。
Structuring Crawlers 58 Crawling Sites Through Search 58 Crawling Sites Through Links 61 Crawling Multiple Page Types 64 Thinking About Web Crawler Models 65 5. Scrapy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Installing Scrapy 67 Initializing a New Spider 68 Writing a Simple Scraper 69 Spidering with Rules 70 Creating Items 74 Outputting Items 76 The Item Pipeline 77 Logging with Scrapy 80 More Resources 80 6. Storing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Media Files 83 Storing Data to CSV 86 MySQL 88 Installing MySQL 89 Some Basic Commands 91 Integrating with Python 94 Database Techniques and Good Practice 97 “Six Degrees” in MySQL 100 Email 103 Part II. Advanced Scraping 7. Reading Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Document Encoding 107 Text 108 Text Encoding and the Global Internet 109 CSV 113 Reading CSV Files 113 PDF 115 Microsoft Word and .docx 117 8. Cleaning Your Dirty Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Cleaning in Code 121 iv | Table of Contents Data Normalization 124 Cleaning After the Fact 126 OpenRefine 126 9. Reading and Writing Natural Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Summarizing Data 132 Markov Models 135 Six Degrees of Wikipedia: Conclusion 139 Natural Language Toolkit 142 Installation and Setup 142 Statistical Analysis with NLTK 143 Lexicographical Analysis with NLTK 145 Additional Resources 149 10. Crawling Through Forms and Logins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Python Requests Library 151 Submitting a Basic Form 152 Radio Buttons, Checkboxes, and Other Inputs 154 Submitting Files and Images 155 Handling Logins and Cookies 156 HTTP Basic Access Authentication 157 Other Form Problems 158 11. Scraping JavaScript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 A Brief Introduction to JavaScript 162 Common JavaScript Libraries 163 Ajax and Dynamic HTML 165 Executing JavaScript in Python with Selenium 166 Additional Selenium Webdrivers 171 Handling Redirects 171 A Final Note on JavaScript 173 12. Crawling Through APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 A Brief Introduction to APIs 175 HTTP Methods and APIs 177 More About API Responses 178 Parsing JSON 179 Undocumented APIs 181 Finding Undocumented APIs 182 Documenting Undocumented APIs 184 Finding and Documenting APIs Automatically 184 Combining APIs with Other Data Sources 187 Table of Contents | v More About APIs 190 13. Image Processing and Text Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Overview of Libraries 194 Pillow 194 Tesseract 195 NumPy 197 Processing Well-Formatted Text 197 Adjusting Images Automatically 200 Scraping Text from Images on Websites 203 Reading CAPTCHAs and Training Tesseract 206 Training Tesseract 207 Retrieving CAPTCHAs and Submitting Solutions 211 14. Avoiding Scraping Traps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 A Note on Ethics 215 Looking Like a Human 216 Adjust Your Headers 217 Handling Cookies with JavaScript 218 Timing Is Everything 220 Common Form Security Features 221 Hidden Input Field Values 221 Avoiding Honeypots 223 The Human Checklist 224 15. Testing Your Website with Scrapers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 An Introduction to Testing 227 What Are Unit Tests? 228 Python unittest 228 Testing Wikipedia 230 Testing with Selenium 233 Interacting with the Site 233 unittest or Selenium? 236 16. Web Crawling in Parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Processes versus Threads 239 Multithreaded Crawling 240 Race Conditions and Queues 242 The threading Module 245 Multiprocess Crawling 247 Multiprocess Crawling 249 Communicating Between Processes 251 vi | Table of Contents Multiprocess Crawling—Another Approach 253 17. Scraping Remotely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Why Use Remote Servers? 255 Avoiding IP Address Blocking 256 Portability and Extensibility 257 Tor 257 PySocks 259 Remote Hosting 259 Running from a Website-Hosting Account 260 Running from the Cloud 261 Additional Resources 262 18. The Legalities and Ethics of Web Scraping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Trademarks, Copyrights, Patents, Oh My! 263 Copyright Law 264 Trespass to Chattels 266 The Computer Fraud and Abuse Act 268 robots.txt and Terms of Service 269 Three Web Scrapers 272 eBay versus Bidder’s Edge and Trespass to Chattels 272 United States v. Auernheimer and The Computer Fraud and Abuse Act 274 Field v. Google: Copyright and robots.txt 275 Moving Forward 276 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279