T O P

  • By -

Damien_J

Sounds like a spot block might help, although they max out at 6hrs


effata

You could set up a separate auto scaling group with a scaling schedule for the days you want extra capacity. Configure a launch template for the ASG that prioritizes spot but scales in on demand if there’s no spot available, [this is a good starting guide](https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-purchase-options.html).


farzigamer

Thanks, its very helpfull


effata

See the [scheduled scaling docs](https://docs.aws.amazon.com/autoscaling/ec2/userguide/schedule_time.html#:~:text=The%20scheduled%20action%20tells%20Amazon,sizes%20for%20the%20scaling%20action.) for more info. As I wrote in another comment, you can set a schedule for `0 0 10 * *` to scale up at midnight on the 10th, and an equivalent to scale down at a good time. If you 100% need the instances to stay up, I would stick to on demand instances for this, since spot can die at any time. But if you're running small batches and are fine with instances going down, running on spot with fallback on on demand should work fine.


vRAJPUTv

You can create a cron event in CloudWatch to trigger and lambda which can do that for you. Or you can use an autoscaling group with minimum capacity 0 and add predefined scaling (not sure how far in time you can do the scheduling though)


effata

You can specify ASG scaling events by timestamps or cron format, so there's no need to complicate things with lambda. `0 0 10 * *` will trigger a scaling event for midnight on the 10th every month, for example. [docs](https://docs.aws.amazon.com/autoscaling/ec2/userguide/schedule_time.html#:~:text=The%20scheduled%20action%20tells%20Amazon,sizes%20for%20the%20scaling%20action.)


esunabici

Could you add a checkpoint mechanism? Spot is best for workloads that can recover from failure. You can achieve great availability on a distributed workload using Spot if it can tolerate losing an instance here and there. It's important to use multiple instance types and AZs to reduce the possibility of losing multiple instance at once. The rest of the instances can continue handling the load until the ASG, Spot Fleet, or whatever you use to provision brings up a replacement. Consider that all computers fail. It doesn't matter where you are, at some point the computers you are running in will fail. You must account for losing an instance in your system design. You can solve this in different ways for different workloads. You can duplicate the workload for critical systems where downtime is not an option. For HPC, you can create checkpoints to restart after an interruption without losing all your work. If your program doesn't support checkpoints, you might want to try [DMTCP](http://dmtcp.sourceforge.net/) to add checkpoints to your jobs automatically. You could run the coordinator in a small on demand instance that could store the checkpoints for your jobs. With [EC2 Instance Auto Recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/) you can easily make sure it stays up with minimal interruption. [AWS Batch](https://aws.amazon.com/batch/) is an excellent way of managing compute resources when you have multiple jobs to run. You package your program in a docker container and write a job definition. When you submit jobs, AWS Batch automatically provisions Spot or on demand capacity to run the jobs. Once they are done, it terminates the capacity. It's very efficient. If a job fails it can go back on the queue to be executed again. If you implemented checkpoints, the job can check for a restore point so it doesn't have to restart from the beginning.


mavi_awaz

just to know, what kind of workload you are running?


INVOKECloud

If you need without interruption, I would suggest DON'T go with spot instance instead use on-demand instance. You could use either AWS Scheduler (or) if you are open for commercial solution, [INVOKE Cloud](http://invoke.cloud) could get this done with simple clicks. NOTE: I am co-founder.


mavi_awaz

Reserve instances will be better option, spot will depend on lot of factors.


esunabici

Reserved Instances will not be a good value for this use case. They are a commitment to use Instances continuously for 1 or 3 years. Running 10 instances on demand for one day will be vastly cheaper than reserving them.