程序写的比较粗糙,主要就是处理领英页面信息的逻辑,提供一种思路,用得到的可以自行优化
LinkedIn是全球最大的职业社交网站,是一家面向商业客户的社交网络(SNS)
最近写了一个采集个人信息的程序,用的java语言,springboot框架,因为领英做了一些防爬虫处理,所以返回来的数据需要做些处理,主要逻辑代码如下
1 | package linkedin.service; |
打包
1 | mvn install -Dmaven.test.skip=true |
在项目目录target
目录下找对应jar
包
运行
1 | java -jar linkedin-1.0.jar param1 param2 param3 |
注意:
- param1 用户名
- param2 密码
- param3 爬取点
这三个参数必填
结果
1 | 抓取用户‘LinkedInModel{id=0, firstName='Roger', lastName='Gu', industryName='Mechanical or Industrial Engineering', headline='Greater China Finance Controller at Perkinelmer Instrument Shanghai Co., Ltd.', address='Shanghai City, China', educations='[{"schoolName":"Shanghai University of Finance and Economics"}]', experiences='[{"companyName":"KPMG Audit","title":"Assistant Manager","rangeDate":"1998-7至2003-11"},{"locationName":"Shanghai City, China","companyName":"Perkinelmer","description":"China finance head, overall finance responsibility","title":"Greater China Finance Controller","region":"urn:li:fs_region:(cn,8909)","rangeDate":"2011-1"},{"locationName":"Shanghai City, China","companyName":"Moog","description":"China finance head. Overall finance responsibility.","title":"China Finance Controller","region":"urn:li:fs_region:(cn,8909)","rangeDate":"2007-12至2010-7"},{"companyName":"Eaton","title":"Finance manager","rangeDate":"2004-2至2007-12"}]', skills='["Financial Analysis","US GAAP","SOX","internal control","Hyperion Enterprise","China region","Fortune 500","Accounts Receivable","costing","IFRS","Auditing","Financial Analysis","Manufacturing","contract review","US GAAP","Internal Controls","forecast","compliance","Sarbanes-Oxley Act","Cash Flow","SOX","internal control","Pricing","Budgets","legal","Treasury","Financial Integration","Due Diligence","cash flow management","Tax","corporate finance","Target Costing","Inventory Control","Contract Management","Forecasting","ERP","credit control","Financial Reporting","budgeting","working capital management"]', uniqueUrl='roger-gu-698a036', insertTime='null'}’数据 |
后记
采集慢的原因,一个是服务器在国外,还有一个是做了些页面破解的处理,最后一个就是我采用的是单线程。。。
程序放在了github上
查看