General advice on archiving social media posts using Pipedream

user-1 · October 23, 2021, 12:10am

This topic was automatically generated from Slack. You can find the original thread here.

Joshua Gantt : Hi gang. I’m using pipedream to build an archive utility. I need to save social media posts. biggest hurdle is aws media from a url and was hoping to get some thoughts and suggestions.

What’s the best approach for downloading the media from url and saving the file output?
what is the best output format for archiving a social media type feed for portability (json, csv, mbox, maildir, zip) I was thinking an easy solution would be to just write to a google sheet.
are there rate limits and an easy way to throttle? (if i have 100 posts and i need 50 of them to run async for saving)
Thanks for your help.

user-2 · October 23, 2021, 12:10am

Dylan Sather (Pipedream) : Hi , I had just a couple of questions for you:

Are you asking for suggestions on services to use for storing media long-term (like AWS S3), or are you asking just about how to save the file in your Pipedream workflow while you’re processing the data?
If I were building this, I would try to figure out how the data will be consumed. Is it primarily humans reviewing it (Google Sheets might be great in that case), or are programs reading / transforming the data? If the latter, something like JSON might have sense. You could store that JSON in a database like MongoDB so that you can run queries based on properties (e.g. show me all the posts that contain the phrase “star wars”).
Workflows support built-in throttling and concurrency controls. Does that help? Also, re: the rate limit, what workflow trigger are you using?

user-1 · October 23, 2021, 12:10am

Joshua Gantt : Hi . Thanks.This was all helpful. Mostly just trying to figure out where to start. I think i remember seeing some docs on using a tmp folder so I’ll look into that some more.

I guess that’s my issue is determining consumption. It’s meant to be friendly and human readable but I was more concerned with portability. I was going to spin up a self hosted front end for viewing

I’ll check out the throttling and concurrency as well. Not sure just yet on the pipedream workflows but I’ve got webhook to google sheets so far.

user-2 · October 23, 2021, 12:10am

Dylan Sather (Pipedream) : Great, yes here at the filesystem / /tmp docs.

The concurrency / throttling docs actually use Google sheets as an example. You’ll want to limit the concurrency to 1 on that workflow, otherwise two requests might try to write to Google sheet at the same time, and one overwrites the other.

You can also apply concurrency and rate limiting together. For example, if you need to also rate limit the execution of the workflow to 50 requests per minutes, you can add that on top of the concurrency rule you have so that:

• Only one request will run at a time
• No more than 50 requests will run per minute.

user-3 · October 23, 2021, 12:10am

Ashutosh Saboo : Hi just came across this thread, did you manage to build this workflow eventually? Would be super interested in seeing the workflow if you could share the public link (ofcourse if it’s possible for you). I was myself planning to build such a workflow some time back but for archiving tweets specifically - so your workflow should be a key component of what I was thinking for.

user-1 · October 23, 2021, 12:10am

Joshua Gantt : Thank you so much . I really appreciate your help. I will be working on this as a side project @U01D92S4TJ4 so I won’t dive into it straightaway. But I will be happy share my Workflow. It isn’t *Twitter specific but the app I’m trying to archive has a similar data schema. I’ll try to remember to add to to the share-your-work

user-3 · October 23, 2021, 12:10am

Ashutosh Saboo : Thanks , much appreciated!