Group discussion > Topic 1: Spidering / Scraping Browser data

Topic 1: Spidering / Scraping Browser data

Evaristo Caraballo
1073 days ago

I am a very new-in-programming who whould like to evaluate a project implementation in web in the future.

Currently the project would be based substantially on Python (which is the program I am really focusing on). Some MySQL could be involved.

One of the components of the project includes spidering / scraping data from Browsers - pages and blogs. I found that it is not easy - Browser companies are against this type of activities, despite of a soft regulatory body around the topic. 

Does anybody know reliable forms of practicing spidering / without using APIs (for being very poor in results as many in the web suggests) but still within whatever is considered "legal"? I would like to avoid the browser companies to block my searches (apparently they already did some, not having access in google to a full html of a search).

I have the impression that Python is not the best tool for this (I am a fan though)? Is it that high level languages are not appropiate for designing engines of this type?

At the end, what can I expect? I have the impression that any software option should require constant maintenace due to changes made by the companies to avoid automated spidering/scraping implementations.

Evaristo

Wouter Tebbens
1067 days ago

Dear Evaristo,

welcome at the FTA and thanks for posing your question here!

I wouldn't be the most specialised help for you here (and if you would consider a massive feedback, you'd be better off at Ask Slashdot).

But let's see about Python, at the Python.org site there is quite a few references to Python study books, for example the BeginnersGuide, which is available at their site, or a list of introductory books. Did you already try one of these? And if so, what do you think about it? More advanced books, on specific applications, can be found here.

I saw this Python framework for scraping, http://scrapy.org/, is that in any way helpful for you?

Best regards,

Wouter

Wouter Tebbens
1067 days ago

Evaristo,

generally one of the best ways to get skilled in a technology or programming language is to become an active participant and contributor to a community project of your interest that uses the language. That's one of the great lessons we have learned from Free Software communities.

Now scraping is very much used in the Open Data movement. In fact in October there's an OpenGovernment Data Camp in Poland: http://ogdcamp.org/ where you can meet with projects and people behind it.

One of the organisations behind the OGD movement is the Open Knowledge Foundation in the UK. And they have been developing Python libraries (CKAN) for Open Data projects. This month they are looking for a Python Web Developer as you can read on their website.

I hope this is useful for you.

best,

Wouter

Free Cell PhonesAntique Skeleton Keys