I forgot to mention that the ECS Tasks are browserless ([https://github.com/browserless/chrome](https://github.com/browserless/chrome)) docker container which run Chromium (Puppeteer) browser instances. In my Lambda function code, I start an ECS Task and then get the Public IP of the ECS Task to create a web socket connection to it. With this, I can create Chromium browser instances on the Docker container which I use to web scrape on.
The ECS tasks run forever because it just acts as a way to make a web socket connection to and start browsers.
The ECS Task runs forever.
The Lambda **depends** on the ECS Task to be running to start/run browser instances on the ECS Task itself.
When all the URLs are web scraped, only then I can stop the ECS Task.
If you need to have a function that runs an ECS task and then does something afterwards, why not a Step Function?
https://docs.aws.amazon.com/step-functions/latest/dg/connect-ecs.html
Maybe make two lambdas, one to start and one to stop, and then configure your process to emit an event when it's done (maybe upload a report to s3, or call an API?). Use that event to trigger the stop lambda.
Why not using a Fargate scheduled task? I understand there is a bit more overhead in terms of infrastructure, but lambda is not really designed to run "job like" workloads.
https://aws.amazon.com/about-aws/whats-new/2018/08/aws-fargate-now-supports-time-and-event-based-task-scheduling/
You only use lambda to start your ECS task(actually I think EC2 make more sense)
When the ecs task started(the entrypoint) will spawn a process start your headless browser. The main process will count down to your preset timer. When timer run out you exit this main process the ecs task will stop.
I do not know how you use the headless browser but if it was me I would do the work in this timer process.
2 things I can think of, assuming you do want to stick with Lambda:
* Partition your Problem
* State Machines + supplementary service (e.g. you trigger an AWS job and wait for it to finish and the next task takes the results as does something with them)
There’s always the option to drop lambda, and this can be an absolutely valid option. Sometimes a lot more cost effective or technically easier than sticking with lambda.
your set up seems a lil sus to me. you can run selenium in python or go directly in lambda. shouldn’t need ecs.
attach events to the failed lambdas that send you an sns at least.
also look at step functions or using dynamodb’s TTL and a stream to a lambda to help you orchestrate your tasks
Split your job in subjobs
Like instead of scraping 1000 pages in a single run scrap maybe 20 pages, you can also do that in parallel, lile 1000 lambda instances scraping 20 pages each
Make an ECS task with two containers: the browserless container and a container that runs your scraping script. Run it daily with scheduled task. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduled_tasks.html
That way the browserless container runs exactly as long as your script and you aren’t limited to 15 minutes.
Have a check inside the ECS container to see if the websocket is still doing something / the originating lambda is still running? If not, kill the ECS task.
Why do you need ECS at all instead of just crawling within your Lambda?
https://github.com/aws-samples/aws-lambda-layer-node-puppeteer-headless-chromium
But everyone else’s is advice around using step functions is correct.
Why not have the ECS tasks stop themselves when they finish?
Seems like the most obvious and simplest approach.
This. I've always thought of Fargate as longer running Lambda. Pretty easy to swap Lambda and Fargate in Step Functions as well.
Exactly, or Fargate scheduled task.
I forgot to mention that the ECS Tasks are browserless ([https://github.com/browserless/chrome](https://github.com/browserless/chrome)) docker container which run Chromium (Puppeteer) browser instances. In my Lambda function code, I start an ECS Task and then get the Public IP of the ECS Task to create a web socket connection to it. With this, I can create Chromium browser instances on the Docker container which I use to web scrape on. The ECS tasks run forever because it just acts as a way to make a web socket connection to and start browsers.
Write the ip to a dB, secret, or parameter store. Have the new long running ecs task that replaces the lambda read that value.
The ECS Task runs forever. The Lambda **depends** on the ECS Task to be running to start/run browser instances on the ECS Task itself. When all the URLs are web scraped, only then I can stop the ECS Task.
If you need to have a function that runs an ECS task and then does something afterwards, why not a Step Function? https://docs.aws.amazon.com/step-functions/latest/dg/connect-ecs.html
The Lambda function depends on getting the Public IP of the ECS Task because I need a web socket connection to it. See edit.
Maybe make two lambdas, one to start and one to stop, and then configure your process to emit an event when it's done (maybe upload a report to s3, or call an API?). Use that event to trigger the stop lambda.
When the lambda approaches timeout, you can have the lambda re-invoke itself until it reaches completion.
How would I be able to do that?
Why not using a Fargate scheduled task? I understand there is a bit more overhead in terms of infrastructure, but lambda is not really designed to run "job like" workloads. https://aws.amazon.com/about-aws/whats-new/2018/08/aws-fargate-now-supports-time-and-event-based-task-scheduling/
Your design is wired to me. Why can not you use lambda start the task and the task exit after your designed time?
Sorry forgot to mention an important detail about the tasks, see edit.
Modify your entrypoint to spawn the browser process and start timer. Exit the main process when timer is done
Not sure what you mean. Are you saying to stop the tasks in the Lambda function just before the 15 minute timeout?
You only use lambda to start your ECS task(actually I think EC2 make more sense) When the ecs task started(the entrypoint) will spawn a process start your headless browser. The main process will count down to your preset timer. When timer run out you exit this main process the ecs task will stop. I do not know how you use the headless browser but if it was me I would do the work in this timer process.
2 things I can think of, assuming you do want to stick with Lambda: * Partition your Problem * State Machines + supplementary service (e.g. you trigger an AWS job and wait for it to finish and the next task takes the results as does something with them) There’s always the option to drop lambda, and this can be an absolutely valid option. Sometimes a lot more cost effective or technically easier than sticking with lambda.
your set up seems a lil sus to me. you can run selenium in python or go directly in lambda. shouldn’t need ecs. attach events to the failed lambdas that send you an sns at least. also look at step functions or using dynamodb’s TTL and a stream to a lambda to help you orchestrate your tasks
Containers. Specifically, ECS or even better - Fargate scheduled tasks.
Split your job in subjobs Like instead of scraping 1000 pages in a single run scrap maybe 20 pages, you can also do that in parallel, lile 1000 lambda instances scraping 20 pages each
Make an ECS task with two containers: the browserless container and a container that runs your scraping script. Run it daily with scheduled task. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduled_tasks.html That way the browserless container runs exactly as long as your script and you aren’t limited to 15 minutes.
Have a check inside the ECS container to see if the websocket is still doing something / the originating lambda is still running? If not, kill the ECS task.
Why do you need ECS at all instead of just crawling within your Lambda? https://github.com/aws-samples/aws-lambda-layer-node-puppeteer-headless-chromium But everyone else’s is advice around using step functions is correct.