Identify sites that go to "Access Denied" or "Verify you are a human" pages when loaded #51

katehausladen · 2023-06-21T15:19:57Z

I've noticed that there are some sites that go to a page that says some iteration of "Access Denied" or "Verify you are a human." I think this is mostly caused by the VPN (i.e. the VPN IP address is blocked/recognized), but the crawler probably contributes, too. It seems like there is some variability in the sites run to run, particularly in the "Verify you are a human" category.
We want to:

identify these sites, since the crawl data will be inaccurate
hopefully find a way to bypass at least some of these issues

Currently, I am identifying them by the title of the site (i.e. "Access Denied" and "Just a moment..." in the pictures below).

The sites I do identify are logged as either a "VerifyHumanError" or a "AccessDeniedError". The crawler won't restart after these errors, so it doesn't affect the overall flow.
In the latest full crawl, the sites that asked to verify you were a human were:
VerifyHumanError: [
'https://www.legacy.com',
'https://www.ticketsatwork.com',
'https://www.fixya.com',
'https://www.cameo.com',
'https://www.cardinalcommerce.com',
'https://www.cinemark.com',
'https://www.securecafe.com',
'https://www.lordandtaylor.com',
'https://www.moneytalksnews.com',
'https://www.allegiantair.com',
'https://www.newspapers.com',
'https://www.rentcafe.com',
'https://www.babylonbee.com',
'https://www.fleetfarm.com',
'https://www.jegs.com',
'https://www.appurse.com',
'https://www.123-movies.com',
'https://www.camelcamelcamel.com',
'https://www.muscleandstrength.com']
And the sites with Access Denied were:
AccessDeniedError: [
'https://www.sprint.com',
'https://www.kroger.com',
'https://www.petsmart.com',
'https://www.tacobell.com',
'https://www.subway.com',
'https://www.zoosk.com',
'https://www.officedepot.com',
'https://www.hotwire.com',
'https://www.meijer.com',
'https://www.jcrew.com',
'https://www.backcountry.com',
'https://www.littlecaesars.com',
'https://www.fisglobal.com',
'https://www.fossil.com',
'https://www.flyertalk.com',
'https://www.citizensbank.com',
'https://www.earthlink.net',
'https://www.apartmentfinder.com',
'https://www.demandforce.com',
'https://www.pizzahut.com',
'https://www.t-mobile.com']

OliverWang13 · 2023-06-21T19:11:16Z

I've put some thought into this too and we could search for text like "Access Denied" or "Verify you are a human" by using a method similar to how we search for Do Not Sell links. Also, while we have access to @sophieeng's IP address, we could have her run a crawl and see if sites are still blocked.

now, errors will be logged in error-logging.json

katehausladen · 2023-06-21T20:13:31Z

some more examples:

katehausladen · 2023-06-22T15:40:11Z

I wrote some code to click the "Verify you are a human" button by Cloudflare (as in the cinemark.com example). When the button is clicked, the same captcha page loads again. So since clicking the button alone does not work, Cloudflare is definitely detecting that we are using Selenium. I haven't found any packages that would help with this for Selenium with Nodejs and Firefox, and since it's only impacting ~20 sites, I'm not sure it's worth spending a ton of time trying to bypass. We can discuss this more at the meeting.

- Matching access and denied separately (for titles like "Access to this page has been denied.") - matching "error" and "service unavailable" - maximize browser screen size - clicking on Cloudfare captcha is commented in lines 143-152

SebastianZimmeck · 2023-06-24T16:08:08Z

The crawler won't restart after these errors, so it doesn't affect the overall flow.

The crawler continues crawling (as a clarification).

impacting ~20 sites

For the total about 1,800 sites, though, the 20 are specifically for the Cloudflare human-checks.

- instead of having VerifyHumanError, AccessDeniedError, SiteError, just have one HumanCheckError - adding "pardon our interruption" to the list to check, as it occasionally occurs on realtor.com

describing updates from issue #51 in the readme

katehausladen · 2023-06-30T15:24:47Z

We decided in the last meeting that we will not try to bypass any human checks. We will add new titles to the regex in the visit_site function in local-crawler.js as we see more sites like this. The readme is also updated to include this in section 4.

katehausladen self-assigned this Jun 21, 2023

katehausladen added a commit that referenced this issue Jun 21, 2023

put error catching into json file (issue #51

a68f5d2

now, errors will be logged in error-logging.json

SebastianZimmeck added the enhancement New feature label Jun 24, 2023

katehausladen added a commit that referenced this issue Jun 30, 2023

Adding Limitations section to README.md

ef37f44

describing updates from issue #51 in the readme

katehausladen closed this as completed Jun 30, 2023

JoeChampeau mentioned this issue Apr 19, 2024

Improve Crawler Error Handling privacy-tech-lab/privacy-pioneer-web-crawler#33

Closed

franciscawijaya mentioned this issue Jul 12, 2024

Deal with null entries #121

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify sites that go to "Access Denied" or "Verify you are a human" pages when loaded #51

Identify sites that go to "Access Denied" or "Verify you are a human" pages when loaded #51

katehausladen commented Jun 21, 2023

OliverWang13 commented Jun 21, 2023

katehausladen commented Jun 21, 2023

katehausladen commented Jun 22, 2023 •

edited

Loading

SebastianZimmeck commented Jun 24, 2023 •

edited

Loading

katehausladen commented Jun 30, 2023

Identify sites that go to "Access Denied" or "Verify you are a human" pages when loaded #51

Identify sites that go to "Access Denied" or "Verify you are a human" pages when loaded #51

Comments

katehausladen commented Jun 21, 2023

OliverWang13 commented Jun 21, 2023

katehausladen commented Jun 21, 2023

katehausladen commented Jun 22, 2023 • edited Loading

SebastianZimmeck commented Jun 24, 2023 • edited Loading

katehausladen commented Jun 30, 2023

katehausladen commented Jun 22, 2023 •

edited

Loading

SebastianZimmeck commented Jun 24, 2023 •

edited

Loading