#+ORG2BLOG

Posts of, to, and from Bits4Waves’ Blog.

Fetch the data from the shortcodes

Hello!

Here at Bits4Waves things got a special detour yesterday. Dealing with all this data turned out to be really brain-teasing! This resulted in a lot—literally dozens—of ideas and also questions. So yesterday there was a “pause-and-assess” moment to develop some tools to deal with all this info. (If you like Emacs and Org, you may love what happened!)

But back to the matter at hand: the #100daysofpractice dataset! After acquiring the shortcodes, the next natural step would be to explore the data. We would like to know, for instance, how they are spread through time, and some information about the practictioners, along with a bunch of important details about music practice! But before that, we need to effectively use the shortcodes to gather this data. This is the objective for today!

The idea is simple: we will go through the list of shortcodes, and, give the shortocodes one at a time to instaloader, asking for it to analyze the given shortcode and return the corresponding data (instaloader is a Python API that interfaces with Instagram).

Let’s start! First, lets open the file to read the shortcodes:

#!/usr/bin/env python

import fileinput

for line in fileinput.input('../shortcodes/shortcodes-uniq.txt'):
    print(line, end='')
    if (fileinput.lineno() == 10): break

Now, let’s fetch the username for the post. To do this for the first shorcode we’ll run:

#!/usr/bin/env python

import instaloader

shortcode = '008-CMh_h-'
I = instaloader.Instaloader()
post = instaloader.Post.from_shortcode(I.context, shortcode)
print(post.owner_profile.username)

OK, now that we know how to grab the profile info, we can create a simple python script that will receive a shortcode and return the corresponding username. This script will be called from a shell script, which will fetch the posts from the profile. This approach may seem counterintuitive at first, because we could do everything ourselves from inside the python script. It seems better to do this way—call instaloader from the shell script—because it is what worked best in the past, in terms of reliability. Let’s get to it, then:

#!/usr/bin/env python

import argparse
import instaloader

parser = argparse.ArgumentParser()
parser.add_argument('shortcode')
args = parser.parse_args()

I = instaloader.Instaloader()
post = instaloader.Post.from_shortcode(I.context, args.shortcode)
print(post.owner_profile.username)

Now, we’ll create a shell script to fetch the username for each one of the shortcodes, and then fetch the data:

#!/bin/bash

PROJECT=~/sci/100daysofpractice-dataset
PYTHON=$PROJECT/venv/bin/python
SRC=$PROJECT/src
GET_USERNAME="$PYTHON $SRC/get-username.py"
SHORTCODES=$PROJECT/shortcodes/shortcodes-uniq.txt
PROFILES=$PROJECT/profiles
CSV=$PROFILES/shortcode-username.csv

while read SHORTCODE; do
    USERNAME=$($GET_USERNAME $SHORTCODE)
    PAIR=$SHORTCODE,$USERNAME
    echo $PAIR
    echo $PAIR >> $CSV
    instaloader $USERNAME
done <$SHORTCODES

This file will then download the necessary data!

Let’s Make(file) it more replicable!

Hello! I’m Eglur, and this is my first authored post here (all the previous ones were also written by me, but this is the first with my name, so I felt it was appropriate to introduce myself). Nice to meet you! How have you been?

Now where were we…? Oh yes, replicability! Here at Bits4Waves we take Sciencing very seriously! The most important premise is that we have to have massive fun in the process! But this does not mean that we shouldn’t follow some guidelines…

One very important component of the Scientific Method is that of replicability: other people may be able to obtain the same results if they apply the same methodology. The course Principles, Statistical and Computational Tools for Reproducible Data Science gives a very useful practical tip: avoid doing things manually; instead, create scripts for everything.

We’ve been running some commands manually, and now is a good time to create a “script” for them. Specifically, we’ll create a Makefile to generate the shortcodes.

The Makefile needs a recipe to obtain each target. Let’s review how we got each one of the targets:

  1. shortcodes-orig.txt: used a dedicated Makefile
  2. shortcodes-sort.txt: ran commands manually
  3. shortcodes-uniq.txt: ran commands manually
  4. shortcodes-test.txt: ran commands manually

Nothing like getting some perspective, huh?! It looks like we started things well, and… derailed a little bit afterwards. Nothing to worry, though! Let’s fix it right away!

As we already have a Makefile, it seems natural to use it—we just have to include the remaining targets—shortcodes-sort.txt, shortcodes-test.txt, and shortcodes-uniq.txt.

But some things changed after the Makefile was created—the plot thickens:

  • the shortcode- files earned the right to have their own folder shortcodes/
  • the original file was renamed from shortcode.txt to shortcode-orig.txt (because OCD, that’s why :-).

Therefore, we’ll have to account for these changes while dealing with reconciling past, present and near future.

Practically, we should have the Makefile in its proper context. Let’s move it to the shortcodes/ folder: (We’ll not use a script for this, but document it here, because this is a structural change, that should really be done once—meaning, it doesn’t deserve a script of its own… Please share your thoughts in the comments below!)

PROJECT=~/sci/100daysofpractice-dataset
pushd $PROJECT
git mv Makefile shortcodes/

We have to make some accomodations for the new place inside the Makefile. First, it needs the correct Python virtual environment. Let’s get the appropriate command for that.

PYTHON=../venv/bin/python

Now, the command inside the Makefile is not correct, we need to fix it:

instaloader --login ${IG_USER} --no-profile-pic --no-pictures --no-videos --no-captions "#100daysofpractice"

To get the shortcodes, we used the script get-shortcodes.py. Let’s fix that:

SRC=../src

and

$(PYTHON) $(SRC)/get-shortcodes.py

The script get-shortcodes.py is not currently accomodating for he OCD, as it creates the file shortcodes.txt instead of shortcodes-orig.txt:

import instaloader import time import os

I = instaloader.Instaloader()
I.interactive_login(os.getenv('IG_USER'))
query = instaloader.Hashtag.from_name(I.context, '100daysofpractice')
k = 1
for post in query.get_all_posts():
    print(k)
    shortcode = post.shortcode
    print(shortcode)
    with open('shortcodes.txt', 'a') as file_object:
        file_object.write(shortcode + '\n')
    time.sleep(1)
    k += 1

Let’s fix that…

import instaloader
import time
import os

I = instaloader.Instaloader()
I.interactive_login(os.getenv('IG_USER'))
query = instaloader.Hashtag.from_name(I.context, '100daysofpractice')
k = 1
for post in query.get_all_posts():
    print(k)
    shortcode = post.shortcode
    print(shortcode)
    with open('shortcodes-orig.txt', 'a') as file_object:
        file_object.write(shortcode + '\n')
    time.sleep(1)
    k += 1

Done!

And I think we’ll call it a day! My dinner is getting colder here LOL

See you soon! Take care!

Let’s Make(file) it more replicable! Part 2

Hello back! So, to make things more replicable, yesterday we worked on the Makefile. Today we’ll continue this work!

In retrospect, I realized it’s not the best practice to set the Python virtual environment in the Makefile, as not every user may want to do that (or maybe not specifically that way.) Let’s start fixing that:

ifndef IG_USER
$(error IG_USER is not set)
endif

PYTHON=../venv/bin/python    # <-- we'll change this
SRC=../src

all: shortcodes

shortcodes:
	$(PYTHON) $(SRC)/get-shortcodes.py

clean:
	rm -rf \#100daysofpractice/

becomes

ifndef IG_USER
$(error IG_USER is not set)
endif

PYTHON=python
SRC=../src

all: shortcodes

shortcodes:
	$(PYTHON) $(SRC)/get-shortcodes.py

clean:
	rm -rf \#100daysofpractice/

It would also be nice to

  • [X] use variables for filenames
  • [X] change the name of the target to reflect the output filename
  • [X] add target for test file; get it directly from the original
  • [X] add target for sort file
  • [X] add target for uniq file
  • [X] update the clean target to reflect the changes

Something very important is to update the instructions on how to obtain the several shortcode files.

  • [X] Use Org format for the main README
  • [X] have the instructions in the main README
  • [X] add description of the shortcodes
  • [X] add quick installation instructions
  • [X] add a usage example
  • [X] do not set environment variables for Instagram user and password

Well, that took a while… If you want to see the results, they’re (mainly) in the project’s main README file. You can also open the hood and look at the list of commits for all that was done!

All I can say is that it was really humbling to test all the changes below in a clean new clone… But now I’m pretty sure that anyone can replicate the process, which is great for Science! Cheesy, but true LOL!

See ya!

On the shoulders of giants

This expression got fame with Newton in the 1600’s, but it had been used already as early as the 1100’s¹. Here at Bits4Waves we usually don’t immediately dismiss ideas that linger for 1000 years or so—we try to learn from them, if possible! That’s why today’s activity is so gratifying…

We’ve been collecting shortcodes of the posts with the hashtag #100daysofpractice. There are 600k in total, but we could get only 50k (12 times less!).

The process to obtain them used the Python library instaloader, and it was breaking at the 50k mark.

After sharing the issue on instaloader‘s Github, one of the developers was kind enough to help. Applying some advanced wizardry, he cooked a new script using ideas and codes from related opened issues and SHAZAM: we now have 250k shortcodes! It breaks at this point, and I communicated the fact. Let’s hope it’s solvable!

Meanwhile, we have work to do:

  • [X] update the code with the new script

About the code, I had created a new branch for the new script, giving the script also a new different name. As it worked better than the previous version, it could simply replace that one. Let’s do this:

SRC=~/sci/100daysofpractice-dataset/src
pushd $SRC

git -C $SRC rm get-shortcodes.py
git -C $SRC mv get-hashtag.py get-shortcodes.py

This takes care of the renaming. Now we have to check to see if everything can work well with the new script. Let’s start from the beginning: the Makefile.

ifndef IG_USER
$(error IG_USER is not set)
endif

PYTHON=python
SRC=../src
SHORTCODES_ORIG=shortcodes-orig.txt
SHORTCODES_TEST=shortcodes-test.txt
SHORTCODES_SORT=shortcodes-sort.txt
SHORTCODES_UNIQ=shortcodes-uniq.txt

all: shortcodes-orig shortcodes-test shortcodes-sort shortcodes-uniq

shortcodes-orig:
	$(PYTHON) $(SRC)/get-shortcodes.py

shortcodes-test: $(SHORTCODES_ORIG)
	head --lines=10 $(SHORTCODES_ORIG) > $(SHORTCODES_TEST)

shortcodes-sort: $(SHORTCODES_ORIG)
	sort $(SHORTCODES_ORIG) > $(SHORTCODES_SORT)

shortcodes-uniq: $(SHORTCODES_SORT)
	uniq $(SHORTCODES_SORT) > $(SHORTCODES_UNIQ)

clean:
	rm -rf $(SHORTCODES_ORIG) $(SHORTCODES_TEST) $(SHORTCODES_SORT) $(SHORTCODES_UNIQ)

First, let’s fix some issues with Makefile:

  • [X] a fundamental problem with the Makefile: the targets must have the file extension!
  • [X] fix: typos in targets’ names
  • [X] create a link for the final file at the end
  • [X] add variable for link to final file

ifndef IG_USER
$(error IG_USER is not set)
endif

PYTHON=python
SRC=../src
GET_SHORTCODES_PY=$(SRC)/get-shortcodes.py
SHORTCODES_ORIG=shortcodes-orig.txt
SHORTCODES_TEST=shortcodes-test.txt
SHORTCODES_SORT=shortcodes-sort.txt
SHORTCODES_UNIQ=shortcodes-uniq.txt
SHORTCODES_LINK=shortcodes.txt
OBJECTS = $(SHORTCODES_ORIG) $(SHORTCODES_TEST) $(SHORTCODES_SORT) $(SHORTCODES_UNIQ)

all: $(OBJECTS)

$(SHORTCODES_ORIG): $(GET_SHORTCODES_PY)
	$(PYTHON) $(GET_SHORTCODES_PY)

$(SHORTCODES_TEST): $(SHORTCODES_ORIG)
	head --lines=10 $(SHORTCODES_ORIG) > $(SHORTCODES_TEST)

$(SHORTCODES_SORT): $(SHORTCODES_ORIG)
	sort $(SHORTCODES_ORIG) > $(SHORTCODES_SORT)

$(SHORTCODES_UNIQ): $(SHORTCODES_SORT)
	uniq $(SHORTCODES_SORT) > $(SHORTCODES_UNIQ)
	ln --symbolic $(SHORTCODES_UNIQ) $(SHORTCODES_LINK)

clean:
	rm -rf $(OBJECTS) $(SHORTCODES_LINK)

Much better!

Now, it would be nice to

  • [X] unify the old and new shortcodes into a single file

Done!

Finally, let’s make use of all the wizardry we got access to, and try and continue downloading from 250k onwards.

We’ll manually change the session file to make total_index point to 250k. OK, that’s done! Now let’s make it and wait for the results!

See ya!

#100daysofpractice: 450k shortcodes (and post data!)

Hello!

After a slight change in the strategy used to obtain the shortcodes, I was able to fetch way more data from the post, including

  • user who posted it
  • date when it was posted
  • number of likes, and
  • hashtags (completes post caption),

among others. This data will have to be anonymized before it is released, so some information (like the user who posted it) will have to be processed (e.g., replace each actual username for a randomly generated unique code).

But before that, there’s some housekeeping to do. This time I could get the “resuming” function of the script working. But it was somewhat confusing.

It all began when I saw that the code got the shortcodes with post.shortcode. I wondered: what else could be there—without having to make a new request to Instagram (and possibly get a time out)? Then I found out that Python’s function vars could give all the data currently in post. Executing postdict = vars(post) put all this data in the dictionary postdict.

I then needed a way to save it to a file in a convenient format—like JSON. Running json.dumps(postdict) gave a JSON string extracted from the post’s dictionary. The key _context from postdict had to be removed because json couldn’t parse it. It contains the string representation for the instance of the post. This is internal Python code data, and not data about the post, so it can be safely ignored.

In the end, the change was from this line:

print(post.shortcode, file=file)

to these ones:

postdict = vars(post)
del postdict['_context'] # json can't process this key (and we don't need it)
print(json.dumps(postdict), file=file)

These changes felt like deserving a new script, so =100daysofpractice-dataset/src/get-posts.py was created, along with 100daysofpractice-dataset/posts/Makefile (new folder too!).

I started the process using 100daysofpractice-dataset/posts/Makefile. It broke (Error 400) after successfully fetching 300k out of 600k posts. In the past, I wasn’t able to resume the process—meaning restart it and fetch from the 300,001 post onwards. But this time I wanted to gave it a try.

To be able to follow the whole process, I ran the script using the debugger pdb inside Emacs. From inside 100daysofpractice-dataset/src/ I ran M-x pdb then ../venv/bin/python -m pdb get-posts.py.

As I had ran it before using the makefile in 100daysofpractice-dataset/posts/Makefile, the file containing the resume data was there. This time I would ran the debugger from inside another folder, 100daysofpractice-dataset/src/, so I copied the resume file there.

At first run, it didn’t resume from where it broke. Restarting and following the steps, I could see that it was looking for a resume file with a different name from the one I copied there. The difference was in the “magic” part of the filename, that was obtained from the code:

format_path=lambda magic: f"resume_info_{magic}.json.xz",

The actual resume file had the name resume_info_dsk7_D2b.json.xz: its “magical” part was dsk7_D2b. The program expected the magical part jQQhVmW0. I renamed the resume file to resume_info_jQQhVmW0.json.xz, so it would contain the expected magical part. Then the program accepted it and started the resuming routine.

The program, when not resuming, accesses the shortcodes roughly by date in descending order. This usually means that the first posts it accesses are from the current day.

For a quick check, I looked at the first post that the resuming routing accessed, and it was from the same day as the last one that it accessed before breaking. This suggested that it was indeed resuming the process, and not starting it from beginning.

After downloading 60k posts, the program ended with the following output:

Iteration complete, deleted resume information file resume_info_jQQhVmW0.json.xz.

“Iteration complete”: as it was iterating through the iterator containing the posts, this suggested that it had processed all the posts with the hashtag #100daysofpractice. But if you had been looking at the numbers, maybe you got the same feeling: it doesn’t add up!

I just went to the Instagram website and inserting “#100daysofpractice” in the search bar. Instagram then says that there are 614,795 posts with the hashtag #100daysofpractice. The program downloaded 449,851 posts. The difference is 164,944. Maybe Instagram only counts the hashtags at the time of posting, not editing and removing a hashtag or deleting a post? I asked the developers if they know anything about it.

Meanwhile, I want to explore the data. First thing I want to know is how it spreads over time. This may shed some light and maybe help clarify if the possibly missing 160k posts are due to problems in the download process.

I expect to see a smooth spreading over time. If there are some “gaps” in certain periods, this could suggest that the program skipped those periods and didn’t download those posts.

But this will be on a post of its own!

See ya!

Dataset release: 450,000 Instagram posts with the hashtag #100daysofpractice

whose most prominent words are 100daysofpractice, music, practice, and violin.

Hello! I’m glad to announce the release of 100daysofpractice-dataset!

Below you’ll find the description of the dataset.

Data from Instagram posts with the hashtag #100daysofpractice.

The file posts.zip (40 MB) contains data from 450,000 Instagram posts with the hashtag #100daysofpractice.

posts.zip contains two files:

  • posts.csv, which contains the posts data, and
  • metadata.txt, which contains the details about its generation.

posts.csv is in the CSV format (everything quoted with ", separated by ,). The fields therein and a short explanation are:

  • post-id: post’s unique ID
  • shortcode: a short string that can be used to access the post in a web browser (see below for instructions)
  • taken_at_timestamp: the date when it was posted
  • owner-id: a unique ID representing the user who posted it; this data was anonymized for privacy reasons, therefore this is not the real user ID
  • is_video: 1 if it is a video, 0 otherwise
  • edge_liked_by-count: number of likes
  • edge_media_to_comment-count: number of comments
  • video_view_count: number of views
  • comments_disabled: 1 if comments were disabled, 0 otherwise
  • __typename: GraphImage if it is an image post, GraphVideo for video one, or GraphSidecar for a post with more than one media
  • hashtags: hashtags from the comments

To access the post in a web browser using a shortcode, just paste it after https://www.instagram.com/p/. For instance the first post with the hashtag #100daysofpractice has the shortcode BTrwiUuh8vV. Hence you may access it with the link https://www.instagram.com/p/BTrwiUuh8vV. It was posted by the creator of the hashtag, @violincase, the violin virtuosa Hilary Hahn.

Create your website with WordPress.com
Get started
%d bloggers like this: