@lukeocodes 🕹👨‍💻

Posted on Aug 12

Time to Leave? Time to Rebuild! Making Twitter2.0

#go #programming #series

The most critical features of a new social network for users fed up with Musk and Twitter, are as follows;

Import Twitter's archive.zip file
Easy as possible to sign up
Similar if not identical user features

Less critical but definitely helpful features of the platform;

Ethically monetised and moderated
Make use of AI to help identify problematic content
Blue tick with the use of Onfido or SMART identity services

In this post, we'll focus on the first feature. Importing Twitter's archive.zip file.

If you'd rather not read my waffle and get straight to the juicy, here is the repo I published this script to;

lukeocodes / twitter-archive-extractor

Turn the data in your Twitter archive download into JSON.

Twitter Archive Extractor

This Go program extracts and processes JavaScript files from a ZIP archive, specifically targeting files in the /data directory. It replaces certain window. assignments with var data = and outputs the JSON representation of the processed data.

How It Works

The program opens a ZIP file specified as a command-line argument.
It scans for JavaScript files within the /data directory.
For each JavaScript file, it replaces any window. assignments (e.g., window.__THAR_CONFIG = {) with var data =.
It then uses the goja JavaScript interpreter to execute the modified script and extract the data variable.
The extracted data is marshaled into JSON format and output to the console.

Todo

Select a destination for export
- separate files? via an ORM to a database?
Release to go.dev

Prerequisites

Ensure you have Go installed on your system.

Setup

Clone this repository or download the source files.
Navigate to the project directory.
…

View on GitHub

The file

Twitter haven't made your data all that easy to obtain. It's great that they give you access to it (legally, they have to). The format is crap.

It actually comes as a mini web archive and all your data is stuck in JavaScript files. It is more of a web app than convenient storage of data.

When you open up the Your archive.html file you get something like this;

Note: I made the descision pretty early on to build using Next.js for the site, Go and GraphQL for the backend.

So, what do you do when your data isn't structured data?

Well, you parse it.

Creating a basic Go script

Head on over to the official docs on how to get started with Go, and set up your project directory.

We're going to hack this process together. It seems one of the most important features to attract people who feel too attached to TwitterX.

First step is to create a main.go file. In this file we'll GO (hah) and do some STUFF;

os.Args: This is a slice that holds command-line arguments.
os.Args[0] is the program's name, and os.Args[1] is the first argument passed to the program.
Argument Check: The function checks if at least one argument is provided. If not, it prints a message asking for a path.
run function: This function simply prints the path passed to it, for now.



package main

import (
    "fmt"
    "os"
)

func run(path string) {
    fmt.Println("Path:", path)
}

func main() {
    if len(os.Args) < 2 {
        fmt.Println("Please provide a path as an argument.")
        return
    }

    path := os.Args[1]
    run(path)
}

At every step, we'll run the file like so;



go run main.go twitter.zip

If you don't have a Twitter archive export, create a simple manifest.js file and give it the following JavaScript.



window.__THAR_CONFIG = {
  "userInfo" : {
    "accountId" : "1234567890",
    "userName" : "lukeocodes",
    "displayName" : "Luke ✨"
  },
};

Compress that into your twitter.zip file that we'll use throughout.

Read a Zip file

The next step is to read the contents of the zip file. We want to do this as efficiently as possible, and reduce time data is extracted on the disk.

There are many files in the zip that don't need to be extracted, too.

We'll edit the main.go file;

Opening the ZIP file: The zip.OpenReader() function is used to open the ZIP file specified by path.
Iterating through the files: The function loops over each file in the ZIP archive using r.File, which is a slice of zip.File. The Name property of each file is printed.



package main

import (
    "archive/zip"
    "fmt"
    "log"
    "os"
)

func run(path string) {
    // Open the zip file
    r, err := zip.OpenReader(path)
    if err != nil {
        log.Fatal(err)
    }
    defer r.Close()

    // Iterate through the files in the zip archive
    fmt.Println("Files in the zip archive:")
    for _, f := range r.File {
        fmt.Println(f.Name)
    }
}

func main() {
    // Example usage
    if len(os.Args) < 2 {
        log.Fatal("Please provide the path to the zip file as an argument.")
    }

    path:= os.Args[1]
    run(path)
}

JS only! We're hunting structured data

This archive file is seriously unhelpful. We want to check for just .js files, and only in the /data directory.

Opening the ZIP file: The ZIP file is opened using zip.OpenReader().
Checking the /data directory: The program iterates through the files in the ZIP archive. It uses strings.HasPrefix(f.Name, "data/") to check if the file resides in the /data directory.
Finding .js files: The program also checks if the file has a .js extension using filepath.Ext(f.Name).
Reading and printing contents: If a .js file is found in the /data directory, the program reads and prints its contents.



package main

import (
    "archive/zip"
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "path/filepath"
    "strings"
)

func readFile(file *zip.File) {
    // Open the file inside the zip
    rc, err := file.Open()
    if err != nil {
        log.Fatal(err)
    }
    defer rc.Close()

    // Read the contents of the file
    contents, err := ioutil.ReadAll(rc) // deprecated? :/ 
    if err != nil {
        log.Fatal(err)
    }

    // Print the contents
    fmt.Printf("Contents of %s:\n", file.Name)
    fmt.Println(string(contents))
}

func run(path string) {
    // Open the zip file
    r, err := zip.OpenReader(path)
    if err != nil {
        log.Fatal(err)
    }
    defer r.Close()

    // Iterate through the files in the zip archive
    fmt.Println("JavaScript files in the zip archive:")
    for _, f := range r.File {
        // Use filepath.Ext to check the file extension
        if strings.HasPrefix(f.Name, "data/") && strings.ToLower(filepath.Ext(f.Name)) == ".js" {
            readFile(f)
            return // Exit after processing the first .js file so we don't end up printing a gazillion lines when testing
        }
    }
}

func main() {
    // Example usage
    if len(os.Args) < 2 {
        log.Fatal("Please provide the path to the zip file as an argument.")
    }

    path:= os.Args[1]
    run(path)
}

Parse the JS! We want that data

We've found the structured data. Now we need to parse it. The good news is there are existing packages for using JavaScript inside Go. We'll be using goja.

If you're on this section, familiar with Goja, and you've seen the output of the file, you may see we're going to have errors in our future.

Install goja:



go get github.com/dop251/goja

Now we're going to edit the main.go file to do the following;

Parsing with goja: The goja.New() function creates a new JavaScript runtime, and vm.RunString(processedContents) runs the processed JavaScript code within that runtime.
Handle errors in parsing



package main

import (
    "archive/zip"
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "path/filepath"
    "strings"
)

func readFile(file *zip.File) {
    // Open the file inside the zip
    rc, err := file.Open()
    if err != nil {
        log.Fatal(err)
    }
    defer rc.Close()

    // Read the contents of the file
    contents, err := ioutil.ReadAll(rc) // deprecated? :/ 
    if err != nil {
        log.Fatal(err)
    }

    // Parse the JavaScript file using goja
    vm := goja.New()
    _, err = vm.RunString(contents)
    if err != nil {
        log.Fatalf("Error parsing JS file: %v", err)
    }

    fmt.Printf("Parsed JavaScript file: %s\n", file.Name)
}

func run(path string) {
    // Open the zip file
    r, err := zip.OpenReader(path)
    if err != nil {
        log.Fatal(err)
    }
    defer r.Close()

    // Iterate through the files in the zip archive
    fmt.Println("JavaScript files in the zip archive:")
    for _, f := range r.File {
        // Use filepath.Ext to check the file extension
        if strings.HasPrefix(f.Name, "data/") && strings.ToLower(filepath.Ext(f.Name)) == ".js" {
            readFile(f)
            return // Exit after processing the first .js file so we don't end up printing a gazillion lines when testing
        }
    }
}

func main() {
    // Example usage
    if len(os.Args) < 2 {
        log.Fatal("Please provide the path to the zip file as an argument.")
    }

    path:= os.Args[1]
    run(path)
}

SUPRISE. window is not defined might be a familiar error. Basically goja runs an EMCA runtime. window is browser context and sadly unavailable.

ACTUALLY Parse the JS

I went through a few issues at this point. Including not being able to return data because it's a top level JS file.

Long story short, we need to modify the contents of the files before loading them into the runtime.

Let's modify the main.go file;

reConfig: A regex that matches any assignment of the form window.someVariable = { and replaces it with var data = {.
reArray: A regex that matches any assignment of the form window.someObject.someArray = [ and replaces it with var data = [
Extracting data: Running the script, we use vm.Get("data") to retrieve the value of the data variable from the JavaScript context.



package main

import (
    "archive/zip"
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "path/filepath"
    "regexp"
    "strings"

    "github.com/dop251/goja"
)

func readFile(file *zip.File) {
    // Open the file inside the zip
    rc, err := file.Open()
    if err != nil {
        log.Fatal(err)
    }
    defer rc.Close()

    // Read the contents of the file
    contents, err := ioutil.ReadAll(rc)
    if err != nil {
        log.Fatal(err)
    }

    // Regular expressions to replace specific patterns
    reConfig := regexp.MustCompile(`window\.\w+\s*=\s*{`)
    reArray := regexp.MustCompile(`window\.\w+\.\w+\.\w+\s*=\s*\[`)

    // Replace patterns in the content
    processedContents := reConfig.ReplaceAllStringFunc(string(contents), func(s string) string {
        return "var data = {"
    })
    processedContents = reArray.ReplaceAllStringFunc(processedContents, func(s string) string {
        return "var data = ["
    })

    // Parse the JavaScript file using goja
    vm := goja.New()
    _, err = vm.RunString(processedContents)
    if err != nil {
        log.Fatalf("Error parsing JS file: %v", err)
    }

    // Retrieve the value of the 'data' variable from the JavaScript context
    value := vm.Get("data")
    if value == nil {
        log.Fatalf("No data variable found in the JS file")
    }

    // Output the parsed data
    fmt.Printf("Processed JavaScript file: %s\n", file.Name)
    fmt.Printf("Data extracted: %v\n", value.Export())
}

func run(path string) {
    // Open the zip file
    r, err := zip.OpenReader(path)
    if err != nil {
        log.Fatal(err)
    }
    defer r.Close()

    // Iterate through the files in the zip archive
    for _, f := range r.File {
        // Check if the file is in the /data directory and has a .js extension
        if strings.HasPrefix(f.Name, "data/") && strings.ToLower(filepath.Ext(f.Name)) == ".js" {
            readFile(f)
            return // Exit after processing the first .js file so we don't end up printing a gazillion lines when testing
        }
    }
}

func main() {
    // Example usage
    if len(os.Args) < 2 {
        log.Fatal("Please provide the path to the zip file as an argument.")
    }

    path:= os.Args[1]
    run(path)
}

Hurrah. Assuming I didn't muck up the copypaste into this post, you should now see a rather ugly print of the struct data from Go.

JSON would be nice

Edit the main.go file to marshall the JSON output.

Use value.Export() to get the data from the struct
Use json.MarshallIndent() for pretty printed JSON (use json.Marshall if you want to minify the output).



package main

import (
    "archive/zip"
    "encoding/json"
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "path/filepath"
    "regexp"
    "strings"

    "github.com/dop251/goja"
)

func readFile(file *zip.File) {
    // Open the file inside the zip
    rc, err := file.Open()
    if err != nil {
        log.Fatal(err)
    }
    defer rc.Close()

    // Read the contents of the file
    contents, err := ioutil.ReadAll(rc) // deprecated :/
    if err != nil {
        log.Fatal(err)
    }

    // Regular expressions to replace specific patterns
    reConfig := regexp.MustCompile(`window\.\w+\s*=\s*{`)
    reArray := regexp.MustCompile(`window\.\w+\.\w+\.\w+\s*=\s*\[`)

    // Replace patterns in the content
    processedContents := reConfig.ReplaceAllStringFunc(string(contents), func(s string) string {
        return "var data = {"
    })
    processedContents = reArray.ReplaceAllStringFunc(processedContents, func(s string) string {
        return "var data = ["
    })

    // Parse the JavaScript file using goja
    vm := goja.New()
    _, err = vm.RunString(processedContents)
    if err != nil {
        log.Fatalf("Error parsing JS file: %v", err)
    }

    // Retrieve the value of the 'data' variable from the JavaScript context
    value := vm.Get("data")
    if value == nil {
        log.Fatalf("No data variable found in the JS file")
    }

    // Convert the data to a Go-native type
    data := value.Export()

    // Marshal the Go-native type to JSON
    jsonData, err := json.MarshalIndent(data, "", "  ")
    if err != nil {
        log.Fatalf("Error marshalling data to JSON: %v", err)
    }

    // Output the JSON data
    fmt.Println(string(jsonData))
}

func run(zipFilePath string) {
    // Open the zip file
    r, err := zip.OpenReader(zipFilePath)
    if err != nil {
        log.Fatal(err)
    }
    defer r.Close()

    // Iterate through the files in the zip archive
    for _, f := range r.File {
        // Check if the file is in the /data directory and has a .js extension
        if strings.HasPrefix(f.Name, "data/") && strings.ToLower(filepath.Ext(f.Name)) == ".js" {
            readFile(f)
            return // Exit after processing the first .js file
        }
    }
}

func main() {
    // Example usage
    if len(os.Args) < 2 {
        log.Fatal("Please provide the path to the zip file as an argument.")
    }

    zipFilePath := os.Args[1]
    run(zipFilePath)
}

That's it!



go run main.go twitter.zip



}

  "userInfo": {

    "accountId": "1234567890",

    "displayName": "Luke ✨",

    "userName": "lukeocodes"

  }

}

Open source

I'll be open sourcing a lot of this work so that others who want to parse the data from the archive, can store it how they like.

Top comments (3)

Ben Sinclair • Aug 12

definitely helpful features of the platform;

Ethically monetised and moderated
Make use of AI to help identify problematic content
Blue tick with the use of Onfido or SMART identity services

I'd say these are all part of the problem with Twitter in the first place. Monetisation cannot be ethical. AI emphasises any bias it doesn't literally intruduce. And blue tick status, whether paid-for , assigned by nepotism, or by the perceived "notability" of the person encourage class division.

There are plenty of alternatives to Twitter out there right now, some of which are more equal than others, sure, but we don't need Friends Reunited any more. We don't need Geocities. We don't need MySpace.

We can make something better.

@lukeocodes 🕹👨‍💻 • Aug 12

Yes but I am no Elon Musk

Ethically monetised

non-profit

Ethically moderated

community moderation tools like community notes, but better

Make use of AI to help identify problematic content

They don't use AI for this. AI can be used very effectively to support moderation, by flagging potentially problematic content before users report it.

Blue tick with the use of Onfido or SMART identity services

Blue tick's for reach are gross. Blue tick's free to everyone to verify someone is who they say they are, keeps people responsible for their actions/posts. I don't suggest forcing people to externally identify as their ID, because that is also gross and not equitable

n n • Aug 14

this is great. would love to hear your thoughts on warpcast and web3 alternatives if you have any!

DEV Community