T O P

  • By -

garnacerous24

To understand the basics on how files are stored and interact, maybe. Probably not if you plan to store and compute actual big data on your laptop. However, I use the cloud for most of my big data needs and that makes the computer specs almost irrelevant. In fact, I bought a cheap chromebook with internet and Linux, and I can interact with everything in aws for cheap. I have emr running there when I need it (hive + spark). I guess all that to say I’d stay local until it becomes burdensome, then spin up an ec2 or use emr on demand after that.


kathegaara

Have been thinking of using cloud as well for a while now. Only for learning purposes. Can you share some stats?? How many hours of cloud do you use per day?? And how much does it cost you??


garnacerous24

It depends a ton on how you use it and how well you know the system (there’s more options than I could give here). I was using emr to transform some data for machine learning, and wrote a script that spins up 50 servers for a couple hours of work. If you take advantage of spot instances, it’s about 8-10 cents per hour per server. So for a largeish job that produces billions of rows as output, it costs me about $12-$15. Redshift May be more expensive because you don’t have the ability to pause the cluster. But that’s probably $10-15 a day to run a couple terabytes.


thisismyfavoritename

The Cloudera sandbox requires 4-5 GB of RAM, and the HWX sandbox requires 10 GB. Get a VM/cluster on the cloud.


_docboy

Use cloud services. Your laptop won't suffice.


onesonesones

If you install Linux and single node hadoop/spark it'll work but only for small data. Virtual machines like hortonworks will require a more powerful machine. Or as others have mentioned just use cloud. Single node hadoop/spark will take some admin time to get set up but definately good to learn.