The most critical features of a new social network for users fed up with Musk and Twitter, are as follows;
- Import Twitter's archive.zip file
- Easy as possible to sign up
- Similar if not identical user features
Less critical but definitely helpful features of the platform;
- Ethically monetised and moderated
- Make use of AI to help identify problematic content
- Blue tick with the use of Onfido or SMART identity services
In this post, we'll focus on the first feature. Importing Twitter's archive.zip file.
If you'd rather not read my waffle and get straight to the juicy, here is the repo I published this script to;
lukeocodes / twitter-archive-extractor
Turn the data in your Twitter archive download into JSON.
Twitter Archive Extractor
This Go program extracts and processes JavaScript files from a ZIP archive, specifically targeting files in the /data directory. It replaces certain window. assignments with var data = and outputs the JSON representation of the processed data.
How It Works
- The program opens a ZIP file specified as a command-line argument.
- It scans for JavaScript files within the /data directory.
- For each JavaScript file, it replaces any window. assignments (e.g., window.__THAR_CONFIG = {) with var data =.
- It then uses the goja JavaScript interpreter to execute the modified script and extract the data variable.
- The extracted data is marshaled into JSON format and output to the console.
Todo
- Select a destination for export
- separate files? via an ORM to a database?
- Release to go.dev
Prerequisites
Ensure you have Go installed on your system.
Setup
- Clone this repository or download the source files.
- Navigate to the project directory.
- β¦
The file
Twitter haven't made your data all that easy to obtain. It's great that they give you access to it (legally, they have to). The format is crap.
It actually comes as a mini web archive and all your data is stuck in JavaScript files. It is more of a web app than convenient storage of data.
When you open up the Your archive.html
file you get something like this;
Note: I made the descision pretty early on to build using Next.js for the site, Go and GraphQL for the backend.
So, what do you do when your data isn't structured data?
Well, you parse it.
Creating a basic Go script
Head on over to the official docs on how to get started with Go, and set up your project directory.
We're going to hack this process together. It seems one of the most important features to attract people who feel too attached to TwitterX.
First step is to create a main.go
file. In this file we'll GO (hah) and do some STUFF;
-
os.Args
: This is a slice that holds command-line arguments. -
os.Args[0]
is the program's name, andos.Args[1]
is the first argument passed to the program. - Argument Check: The function checks if at least one argument is provided. If not, it prints a message asking for a path.
-
run
function: This function simply prints the path passed to it, for now.
package main
import (
"fmt"
"os"
)
func run(path string) {
fmt.Println("Path:", path)
}
func main() {
if len(os.Args) < 2 {
fmt.Println("Please provide a path as an argument.")
return
}
path := os.Args[1]
run(path)
}
At every step, we'll run the file like so;
go run main.go twitter.zip
If you don't have a Twitter archive export, create a simple manifest.js
file and give it the following JavaScript.
window.__THAR_CONFIG = {
"userInfo" : {
"accountId" : "1234567890",
"userName" : "lukeocodes",
"displayName" : "Luke β¨"
},
};
Compress that into your twitter.zip
file that we'll use throughout.
Read a Zip file
The next step is to read the contents of the zip file. We want to do this as efficiently as possible, and reduce time data is extracted on the disk.
There are many files in the zip that don't need to be extracted, too.
We'll edit the main.go
file;
- Opening the ZIP file: The
zip.OpenReader()
function is used to open the ZIP file specified by path. - Iterating through the files: The function loops over each file in the ZIP archive using
r.File
, which is a slice ofzip.File
. TheName
property of each file is printed.
package main
import (
"archive/zip"
"fmt"
"log"
"os"
)
func run(path string) {
// Open the zip file
r, err := zip.OpenReader(path)
if err != nil {
log.Fatal(err)
}
defer r.Close()
// Iterate through the files in the zip archive
fmt.Println("Files in the zip archive:")
for _, f := range r.File {
fmt.Println(f.Name)
}
}
func main() {
// Example usage
if len(os.Args) < 2 {
log.Fatal("Please provide the path to the zip file as an argument.")
}
path:= os.Args[1]
run(path)
}
JS only! We're hunting structured data
This archive file is seriously unhelpful. We want to check for just .js
files, and only in the /data
directory.
- Opening the ZIP file: The ZIP file is opened using
zip.OpenReader()
. - Checking the
/data
directory: The program iterates through the files in the ZIP archive. It usesstrings.HasPrefix(f.Name, "data/")
to check if the file resides in the/data
directory. - Finding
.js
files: The program also checks if the file has a.js
extension usingfilepath.Ext(f.Name)
. - Reading and printing contents: If a
.js
file is found in the/data
directory, the program reads and prints its contents.
package main
import (
"archive/zip"
"fmt"
"io/ioutil"
"log"
"os"
"path/filepath"
"strings"
)
func readFile(file *zip.File) {
// Open the file inside the zip
rc, err := file.Open()
if err != nil {
log.Fatal(err)
}
defer rc.Close()
// Read the contents of the file
contents, err := ioutil.ReadAll(rc) // deprecated? :/
if err != nil {
log.Fatal(err)
}
// Print the contents
fmt.Printf("Contents of %s:\n", file.Name)
fmt.Println(string(contents))
}
func run(path string) {
// Open the zip file
r, err := zip.OpenReader(path)
if err != nil {
log.Fatal(err)
}
defer r.Close()
// Iterate through the files in the zip archive
fmt.Println("JavaScript files in the zip archive:")
for _, f := range r.File {
// Use filepath.Ext to check the file extension
if strings.HasPrefix(f.Name, "data/") && strings.ToLower(filepath.Ext(f.Name)) == ".js" {
readFile(f)
return // Exit after processing the first .js file so we don't end up printing a gazillion lines when testing
}
}
}
func main() {
// Example usage
if len(os.Args) < 2 {
log.Fatal("Please provide the path to the zip file as an argument.")
}
path:= os.Args[1]
run(path)
}
Parse the JS! We want that data
We've found the structured data. Now we need to parse it. The good news is there are existing packages for using JavaScript inside Go. We'll be using goja
.
If you're on this section, familiar with Goja, and you've seen the output of the file, you may see we're going to have errors in our future.
Install goja
:
go get github.com/dop251/goja
Now we're going to edit the main.go
file to do the following;
- Parsing with
goja
: Thegoja.New()
function creates a new JavaScript runtime, andvm.RunString(processedContents)
runs the processed JavaScript code within that runtime. - Handle errors in parsing
package main
import (
"archive/zip"
"fmt"
"io/ioutil"
"log"
"os"
"path/filepath"
"strings"
)
func readFile(file *zip.File) {
// Open the file inside the zip
rc, err := file.Open()
if err != nil {
log.Fatal(err)
}
defer rc.Close()
// Read the contents of the file
contents, err := ioutil.ReadAll(rc) // deprecated? :/
if err != nil {
log.Fatal(err)
}
// Parse the JavaScript file using goja
vm := goja.New()
_, err = vm.RunString(contents)
if err != nil {
log.Fatalf("Error parsing JS file: %v", err)
}
fmt.Printf("Parsed JavaScript file: %s\n", file.Name)
}
func run(path string) {
// Open the zip file
r, err := zip.OpenReader(path)
if err != nil {
log.Fatal(err)
}
defer r.Close()
// Iterate through the files in the zip archive
fmt.Println("JavaScript files in the zip archive:")
for _, f := range r.File {
// Use filepath.Ext to check the file extension
if strings.HasPrefix(f.Name, "data/") && strings.ToLower(filepath.Ext(f.Name)) == ".js" {
readFile(f)
return // Exit after processing the first .js file so we don't end up printing a gazillion lines when testing
}
}
}
func main() {
// Example usage
if len(os.Args) < 2 {
log.Fatal("Please provide the path to the zip file as an argument.")
}
path:= os.Args[1]
run(path)
}
SUPRISE. window is not defined
might be a familiar error. Basically goja runs an EMCA runtime. window
is browser context and sadly unavailable.
ACTUALLY Parse the JS
I went through a few issues at this point. Including not being able to return data because it's a top level JS file.
Long story short, we need to modify the contents of the files before loading them into the runtime.
Let's modify the main.go
file;
-
reConfig
: A regex that matches any assignment of the formwindow.someVariable = {
and replaces it withvar data = {
. -
reArray
: A regex that matches any assignment of the formwindow.someObject.someArray = [
and replaces it withvar data = [
- Extracting
data
: Running the script, we usevm.Get("data")
to retrieve the value of the data variable from the JavaScript context.
package main
import (
"archive/zip"
"fmt"
"io/ioutil"
"log"
"os"
"path/filepath"
"regexp"
"strings"
"github.com/dop251/goja"
)
func readFile(file *zip.File) {
// Open the file inside the zip
rc, err := file.Open()
if err != nil {
log.Fatal(err)
}
defer rc.Close()
// Read the contents of the file
contents, err := ioutil.ReadAll(rc)
if err != nil {
log.Fatal(err)
}
// Regular expressions to replace specific patterns
reConfig := regexp.MustCompile(`window\.\w+\s*=\s*{`)
reArray := regexp.MustCompile(`window\.\w+\.\w+\.\w+\s*=\s*\[`)
// Replace patterns in the content
processedContents := reConfig.ReplaceAllStringFunc(string(contents), func(s string) string {
return "var data = {"
})
processedContents = reArray.ReplaceAllStringFunc(processedContents, func(s string) string {
return "var data = ["
})
// Parse the JavaScript file using goja
vm := goja.New()
_, err = vm.RunString(processedContents)
if err != nil {
log.Fatalf("Error parsing JS file: %v", err)
}
// Retrieve the value of the 'data' variable from the JavaScript context
value := vm.Get("data")
if value == nil {
log.Fatalf("No data variable found in the JS file")
}
// Output the parsed data
fmt.Printf("Processed JavaScript file: %s\n", file.Name)
fmt.Printf("Data extracted: %v\n", value.Export())
}
func run(path string) {
// Open the zip file
r, err := zip.OpenReader(path)
if err != nil {
log.Fatal(err)
}
defer r.Close()
// Iterate through the files in the zip archive
for _, f := range r.File {
// Check if the file is in the /data directory and has a .js extension
if strings.HasPrefix(f.Name, "data/") && strings.ToLower(filepath.Ext(f.Name)) == ".js" {
readFile(f)
return // Exit after processing the first .js file so we don't end up printing a gazillion lines when testing
}
}
}
func main() {
// Example usage
if len(os.Args) < 2 {
log.Fatal("Please provide the path to the zip file as an argument.")
}
path:= os.Args[1]
run(path)
}
Hurrah. Assuming I didn't muck up the copypaste into this post, you should now see a rather ugly print of the struct data from Go.
JSON would be nice
Edit the main.go
file to marshall the JSON output.
- Use
value.Export()
to get the data from the struct - Use
json.MarshallIndent()
for pretty printed JSON (usejson.Marshall
if you want to minify the output).
package main
import (
"archive/zip"
"encoding/json"
"fmt"
"io/ioutil"
"log"
"os"
"path/filepath"
"regexp"
"strings"
"github.com/dop251/goja"
)
func readFile(file *zip.File) {
// Open the file inside the zip
rc, err := file.Open()
if err != nil {
log.Fatal(err)
}
defer rc.Close()
// Read the contents of the file
contents, err := ioutil.ReadAll(rc) // deprecated :/
if err != nil {
log.Fatal(err)
}
// Regular expressions to replace specific patterns
reConfig := regexp.MustCompile(`window\.\w+\s*=\s*{`)
reArray := regexp.MustCompile(`window\.\w+\.\w+\.\w+\s*=\s*\[`)
// Replace patterns in the content
processedContents := reConfig.ReplaceAllStringFunc(string(contents), func(s string) string {
return "var data = {"
})
processedContents = reArray.ReplaceAllStringFunc(processedContents, func(s string) string {
return "var data = ["
})
// Parse the JavaScript file using goja
vm := goja.New()
_, err = vm.RunString(processedContents)
if err != nil {
log.Fatalf("Error parsing JS file: %v", err)
}
// Retrieve the value of the 'data' variable from the JavaScript context
value := vm.Get("data")
if value == nil {
log.Fatalf("No data variable found in the JS file")
}
// Convert the data to a Go-native type
data := value.Export()
// Marshal the Go-native type to JSON
jsonData, err := json.MarshalIndent(data, "", " ")
if err != nil {
log.Fatalf("Error marshalling data to JSON: %v", err)
}
// Output the JSON data
fmt.Println(string(jsonData))
}
func run(zipFilePath string) {
// Open the zip file
r, err := zip.OpenReader(zipFilePath)
if err != nil {
log.Fatal(err)
}
defer r.Close()
// Iterate through the files in the zip archive
for _, f := range r.File {
// Check if the file is in the /data directory and has a .js extension
if strings.HasPrefix(f.Name, "data/") && strings.ToLower(filepath.Ext(f.Name)) == ".js" {
readFile(f)
return // Exit after processing the first .js file
}
}
}
func main() {
// Example usage
if len(os.Args) < 2 {
log.Fatal("Please provide the path to the zip file as an argument.")
}
zipFilePath := os.Args[1]
run(zipFilePath)
}
That's it!
go run main.go twitter.zip
}
"userInfo": {
"accountId": "1234567890",
"displayName": "Luke β¨",
"userName": "lukeocodes"
}
}
Open source
I'll be open sourcing a lot of this work so that others who want to parse the data from the archive, can store it how they like.
Top comments (3)
I'd say these are all part of the problem with Twitter in the first place. Monetisation cannot be ethical. AI emphasises any bias it doesn't literally intruduce. And blue tick status, whether paid-for , assigned by nepotism, or by the perceived "notability" of the person encourage class division.
There are plenty of alternatives to Twitter out there right now, some of which are more equal than others, sure, but we don't need Friends Reunited any more. We don't need Geocities. We don't need MySpace.
We can make something better.
Yes but I am no Elon Musk
Ethically monetised
non-profit
Ethically moderated
community moderation tools like community notes, but better
Make use of AI to help identify problematic content
They don't use AI for this. AI can be used very effectively to support moderation, by flagging potentially problematic content before users report it.
Blue tick with the use of Onfido or SMART identity services
Blue tick's for reach are gross. Blue tick's free to everyone to verify someone is who they say they are, keeps people responsible for their actions/posts. I don't suggest forcing people to externally identify as their ID, because that is also gross and not equitable
this is great. would love to hear your thoughts on warpcast and web3 alternatives if you have any!