Python网络数据采集 2nd Edition.pdf

基本信息

源码名称：Python网络数据采集 2nd Edition.pdf

源码大小：6.58M

文件格式：.pdf

开发语言：Python

更新时间：2019-09-01

友情提示：（无需注册或充值，赞助后即可获取资源下载链接）

嘿，亲！知识可是无价之宝呢，但咱这精心整理的资料也耗费了不少心血呀。小小地破费一下，绝对物超所值哦！如有下载和支付问题，请联系我们QQ(微信同号)：813200300

本次赞助数额为： 2 元　

源码介绍

Python网络数据采集采用简洁强大的Python语言，介绍了网络数据采集，并为采集新式网络中的各种数据类型提供了全面的指导。第1部分重点介绍网络数据采集的基本原理：如何用Python从网络服务器请求信息，如何对服务器的响应进行基本处理，以及如何以自动化手段与网站进行交互。第二部分介绍如何用网络爬虫测试网站，自动化处理，以及如何通过更多的方式接入网络。

Structuring Crawlers 58
Crawling Sites Through Search 58
Crawling Sites Through Links 61
Crawling Multiple Page Types 64
Thinking About Web Crawler Models 65
5. Scrapy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Installing Scrapy 67
Initializing a New Spider 68
Writing a Simple Scraper 69
Spidering with Rules 70
Creating Items 74
Outputting Items 76
The Item Pipeline 77
Logging with Scrapy 80
More Resources 80
6. Storing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Media Files 83
Storing Data to CSV 86
MySQL 88
Installing MySQL 89
Some Basic Commands 91
Integrating with Python 94
Database Techniques and Good Practice 97
“Six Degrees” in MySQL 100
Email 103
Part II. Advanced Scraping
7. Reading Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Document Encoding 107
Text 108
Text Encoding and the Global Internet 109
CSV 113
Reading CSV Files 113
PDF 115
Microsoft Word and .docx 117
8. Cleaning Your Dirty Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Cleaning in Code 121
iv | Table of Contents
Data Normalization 124
Cleaning After the Fact 126
OpenRefine 126
9. Reading and Writing Natural Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Summarizing Data 132
Markov Models 135
Six Degrees of Wikipedia: Conclusion 139
Natural Language Toolkit 142
Installation and Setup 142
Statistical Analysis with NLTK 143
Lexicographical Analysis with NLTK 145
Additional Resources 149
10. Crawling Through Forms and Logins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Python Requests Library 151
Submitting a Basic Form 152
Radio Buttons, Checkboxes, and Other Inputs 154
Submitting Files and Images 155
Handling Logins and Cookies 156
HTTP Basic Access Authentication 157
Other Form Problems 158
11. Scraping JavaScript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A Brief Introduction to JavaScript 162
Common JavaScript Libraries 163
Ajax and Dynamic HTML 165
Executing JavaScript in Python with Selenium 166
Additional Selenium Webdrivers 171
Handling Redirects 171
A Final Note on JavaScript 173
12. Crawling Through APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A Brief Introduction to APIs 175
HTTP Methods and APIs 177
More About API Responses 178
Parsing JSON 179
Undocumented APIs 181
Finding Undocumented APIs 182
Documenting Undocumented APIs 184
Finding and Documenting APIs Automatically 184
Combining APIs with Other Data Sources 187
Table of Contents | v
More About APIs 190
13. Image Processing and Text Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Overview of Libraries 194
Pillow 194
Tesseract 195
NumPy 197
Processing Well-Formatted Text 197
Adjusting Images Automatically 200
Scraping Text from Images on Websites 203
Reading CAPTCHAs and Training Tesseract 206
Training Tesseract 207
Retrieving CAPTCHAs and Submitting Solutions 211
14. Avoiding Scraping Traps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
A Note on Ethics 215
Looking Like a Human 216
Adjust Your Headers 217
Handling Cookies with JavaScript 218
Timing Is Everything 220
Common Form Security Features 221
Hidden Input Field Values 221
Avoiding Honeypots 223
The Human Checklist 224
15. Testing Your Website with Scrapers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
An Introduction to Testing 227
What Are Unit Tests? 228
Python unittest 228
Testing Wikipedia 230
Testing with Selenium 233
Interacting with the Site 233
unittest or Selenium? 236
16. Web Crawling in Parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Processes versus Threads 239
Multithreaded Crawling 240
Race Conditions and Queues 242
The threading Module 245
Multiprocess Crawling 247
Multiprocess Crawling 249
Communicating Between Processes 251
vi | Table of Contents
Multiprocess Crawling—Another Approach 253
17. Scraping Remotely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Why Use Remote Servers? 255
Avoiding IP Address Blocking 256
Portability and Extensibility 257
Tor 257
PySocks 259
Remote Hosting 259
Running from a Website-Hosting Account 260
Running from the Cloud 261
Additional Resources 262
18. The Legalities and Ethics of Web Scraping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Trademarks, Copyrights, Patents, Oh My! 263
Copyright Law 264
Trespass to Chattels 266
The Computer Fraud and Abuse Act 268
robots.txt and Terms of Service 269
Three Web Scrapers 272
eBay versus Bidder’s Edge and Trespass to Chattels 272
United States v. Auernheimer and The Computer Fraud and Abuse Act 274
Field v. Google: Copyright and robots.txt 275
Moving Forward 276
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279