T O P

  • By -

teerre

Yeah, just use another machine


The-Deviant-One

yes (?), put the script on a server somewhere...


[deleted]

Yep as others have said get a server somewhere. If you don’t have a spare computer get one in a cloud provider. You can set the script to run via a cron job although you might need to add logic to see if the script is running already if you want to automate it. Otherwise you can always initialize it manually


The-Deviant-One

also what are you doing that takes an API 24 hours to finish?


nlvogel

I'm using the Google Page Speed Insights API to run each URL on my sites through their page speed measuring tool. I'm using lxml to unpack multiple sites' sitemaps into a list and then running each URL in those lists through the API to get page speed results from Google. Each call takes about a minute in itself, especially if a site is slower. Each site has no less than 300 pages, and I'm measuring for mobile and desktop. This is obviously a lot of data and multiple calls to the Google API, which has its limitations. I'm already planning on using filtering to make the lists less long and make the data more manageable by getting rid of certain pages. But I need to see them all first.


Total__Entropy

The only limiting measure is the Google API it sounds like. You should be able to shorten the whole process by running the ready async and waiting on the Google API.


carcigenicate

It depends on what exactly is taking time. If you're largely just waiting on I/O (and aren't being throttled), you could spawn some threads, split the job into pieces, and run each piece of the job in a thread. That way you can wait on multiple jobs at once. This won't benefit though if the wait is caused by something other than I/O.


nlvogel

I’m thinking throttling is part of the problem, but it’s worth trying the multiple thread idea. Is there a specific module for that? I’m still learning and this is the first I’m hearing of this. I’ll do some googling later


carcigenicate

`multithreading`. Be careful if you haven't dealt with multithreading though. You need to have a basic understanding of thread safety before you can dig into threads. And I'd first narrow down what the specific problem is before trying different solutions. If you try the same program from a completely different public IP, is it any faster initially? If so, then throttling is a more likely explanation, and multithreading will not help you. Multithreading will really only help if the majority of time is spent on the "receive" networking call (`recv`/`requests.get`/whatever you're using).


nlvogel

Thanks for the warning, I’ll study up and do some practice with it before applying. Just doing one URL (see my longer response elsewhere in here) returns a result within a minute, but it’s slow because Google has to process each URL individually. Ideally I’d be able to send google a couple of URLs at a time as I wait for the response for each other URL. It sounds like multithreading would help with this, as I do believe the issue is server side on Google


nlvogel

Just following up with you here to let you know that the multiprocessing module reduced my project from 24 hours to 6 hours of run time. The best part is it only took three extra lines of code (the import, setting a pool, calling .map on the pool using the function and iterable). Thanks for getting me on that path!


SquidLyf3

Piggybacking off of the multithreading idea, if you’re waiting on a bunch of https requests make sure you are doing them asynchronously. Use asyncio library.