min-dalle-test/min_dalle/min_dalle_base.py

import os
import json
import numpy

from .text_tokenizer import TextTokenizer

class MinDalleBase:
    def __init__(self, is_mega: bool):
        self.is_mega = is_mega
        model_name = 'dalle_bart_{}'.format('mega' if is_mega else 'mini')
        self.model_path = os.path.join('pretrained', model_name)

        print("reading files from {}".format(self.model_path))
        vocab_path = os.path.join(self.model_path, 'vocab.json')
        merges_path = os.path.join(self.model_path, 'merges.txt')

        with open(vocab_path, 'r', encoding='utf8') as f:
            vocab = json.load(f)
        with open(merges_path, 'r', encoding='utf8') as f:
            merges = f.read().split("\n")[1:-1]
            
        self.tokenizer = TextTokenizer(vocab, merges)


    def tokenize_text(self, text: str) -> numpy.ndarray:
        print("tokenizing text")
        tokens = self.tokenizer.tokenize(text)
        print("text tokens", tokens)
        text_tokens = numpy.ones((2, 64), dtype=numpy.int32)
        text_tokens[0, :2] = [tokens[0], tokens[-1]]
        text_tokens[1, :len(tokens)] = tokens
        return text_tokens
refactored to load models once and run multiple times 2022-06-29 13:42:12 +00:00			`import os`
			`import json`
			`import numpy`

			`from .text_tokenizer import TextTokenizer`

is_expendable argument reduces memory usage for command line script 2022-06-30 10:43:10 +00:00			`class MinDalleBase:`
refactored to load models once and run multiple times 2022-06-29 13:42:12 +00:00			`def __init__(self, is_mega: bool):`
			`self.is_mega = is_mega`
			`model_name = 'dalle_bart_{}'.format('mega' if is_mega else 'mini')`
pre converting params to torch allows mega to run in standard colab runtime 2022-06-30 18:54:08 +00:00			`self.model_path = os.path.join('pretrained', model_name)`
refactored to load models once and run multiple times 2022-06-29 13:42:12 +00:00
pre converting params to torch allows mega to run in standard colab runtime 2022-06-30 18:54:08 +00:00			`print("reading files from {}".format(self.model_path))`
			`vocab_path = os.path.join(self.model_path, 'vocab.json')`
			`merges_path = os.path.join(self.model_path, 'merges.txt')`
read tokenizer files with utf8 encoding 2022-06-29 18:18:23 +00:00
			`with open(vocab_path, 'r', encoding='utf8') as f:`
refactored to load models once and run multiple times 2022-06-29 13:42:12 +00:00			`vocab = json.load(f)`
read tokenizer files with utf8 encoding 2022-06-29 18:18:23 +00:00			`with open(merges_path, 'r', encoding='utf8') as f:`
refactored to load models once and run multiple times 2022-06-29 13:42:12 +00:00			`merges = f.read().split("\n")[1:-1]`
read tokenizer files with utf8 encoding 2022-06-29 18:18:23 +00:00
refactored to load models once and run multiple times 2022-06-29 13:42:12 +00:00			`self.tokenizer = TextTokenizer(vocab, merges)`
is_expendable argument reduces memory usage for command line script 2022-06-30 10:43:10 +00:00

refactored to load models once and run multiple times 2022-06-29 13:42:12 +00:00			`def tokenize_text(self, text: str) -> numpy.ndarray:`
			`print("tokenizing text")`
			`tokens = self.tokenizer.tokenize(text)`
			`print("text tokens", tokens)`
remove config.json dependency, default to torch in image_from_text.py 2022-07-01 16:03:37 +00:00			`text_tokens = numpy.ones((2, 64), dtype=numpy.int32)`
save converted detokenizer params 2022-07-01 14:17:29 +00:00			`text_tokens[0, :2] = [tokens[0], tokens[-1]]`
			`text_tokens[1, :len(tokens)] = tokens`
refactored to load models once and run multiple times 2022-06-29 13:42:12 +00:00			`return text_tokens`