Ovler

Ovler

tg_channel
twitter
telegram
github

MongoDB Sync Issues with MeiliSearch

Using meilisync

Thanks to the strong support from yzqzss

tl;dr: forked and significantly modified the program, see

Broken Frontend#

First, I tried using his Admin Console, https://github.com/long2ice/meilisync-admin

There is only an AMD64 image, no ARM. Who doesn't have an x86_64 machine? I tried using it on a non-database machine. At that time, I didn't realize what it meant to pull data from machine A to machine B and then shove it into machine C, but anyway, I was foolish at that time.

Download the image, run it... wait, why does this database sync admin still need MySQL and Redis DB URLs? I didn't understand but configured it anyway.

Then...

  1. No initial account ref #7

    The solution was to manually write the email and password into the database, and I had to manually create a bcrypt hash.

  2. Create a MongoDB data source, error Unknown option user #11

    The reason was that different databases require different parameters in the backend configuration file, the webpage only passed the PostgreSQL user, and the backend didn't handle it and just shoved it in, causing the sync program to explode.

    I solved it by modifying the package and replaying it.

    I originally wrote a fix but found that the PostgreSQL parameter was correct, so whatever, I felt I couldn't manage it. I also shouldn't think anyone would want to use this after seeing it... right?

  3. After setting everything up, it still wouldn't run... the backend had the following error (deleted a ton of content):

    2025-04-11 21:38:42.156 | INFO     | uvicorn.protocols.http.httptools_impl:send:496 - 10.0.1.1:64000 - "POST /api/sync HTTP/1.1" 500
    ERROR:    Exception in ASGI application
    Traceback (most recent call last):
      File "/meilisync_admin/meilisync_admin/models.py", line 64, in meili_client
        self.meilisearch.api_url,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    AttributeError: 'QuerySet' object has no attribute 'api_url'
    

    Uh... it seems like it's not just a simple configuration file issue...

So I tried using the CLI, avoiding the possibility that the demonic Admin Console was causing the problem, thinking it was the beginning of the end, but it turned out to be just the end of the beginning.

Delayed Docker#

I found the actual program for synchronization, https://github.com/long2ice/meilisync

I still tried Docker since it was the "Recommended" method. However, according to his README's compose,

version: "3"
services:
  meilisync:
    image: long2ice/meilisync
    volumes:
      - ./config.yml:/meilisync/config.yml
    restart: always

The pulled image had issues.

I encountered a TypeError: 'async for' requires an object with aiter method, got list #94 but there was also TypeError: 'async for' requires an object with aiter method, got coroutine #76

Hmm, I need to use dev.

Ah, right, like mentioned above, his MongoDB user field is username, which is different from the template. Who knows how I suddenly thought it was username back then.

After a few back-and-forths with the configuration file, I didn't want to deal with Docker anymore, so I switched to the local CLI.

Local Python#

During the local CLI phase, everything seemed to be going in a good direction.

The few issues were that although there is pip install meilisync[mongo] for MongoDB, just installing this is not enough. Running any command would harshly tell you that something is missing. You can only pip install meilisync[all] for all.

There was also a small bug with zsh.

$ pip install meilisync[mongo]
zsh: no matches found: meilisync[mongo]

Finally configured the config, everything seemed to be developing well... or not?

Exploding Progress#

Referring to #17 in the reply, when using MongoDB as the data source, progress.json may not be automatically generated, leading to a bunch of TypeError: meilisync.progress.file.File.set() argument after ** must be a mapping, not NoneType.

The solution is to first touch progress.json, then write into the example...

{"resume_token": {"_data": "8267FBA647000000022B042C0100296E5A10046F963A9EB7AB4D14B8CF191E8E5E8D67463C6F7065726174696F6E54797065003C696E736572740046646F63756D656E744B65790046645F6964006467FBA6470D168B18625CC73E000004"}}

Heavy Logs#

After testing and finding no major issues, I modified the configuration file to turn off debug, and after running it with nohup, I went to do other things until the hard drive alarm pulled me back to the shell. Synchronizing the database to a new location indeed takes up a lot of space, and I was prepared for that. But I really didn't expect that this hard drive would explode first—its growth rate even exceeded that of the Meili database machine—this harmless synchronizer dumped 6GB of logs in my face.

This is impossible; I clearly wrote debug=false in the configuration file—

I checked the huge log and found that it recorded every single synchronization content in plain text...

It was actually caused by the default plugin instance.

In the configuration file, there was the following content:

debug: false
plugins:
  - meilisync.plugin.Plugin

The plugin part actually references https://github.com/long2ice/meilisync/blob/dev/meilisync/plugin.py

class Plugin:
    is_global = False

    async def pre_event(self, event: Event):
        logger.debug(f"pre_event: {event}, is_global: {self.is_global}")
        return event

    async def post_event(self, event: Event):
        logger.debug(f"post_event: {event}, is_global: {self.is_global}")
        return event

In this plugin, regardless of what the debug setting in the configuration file is, it will write debug logs.

There are three solutions:

  1. Do not reference this plugin

  2. Modify the plugin content

  3. Change the global log level:

    Meilisync uses loguru, and according to its documentation, you can set the level, and according to its environment variable documentation, you can set LOGURU_LEVEL, which can take values as shown in the table below:

    Level nameSeverity valueLogger method
    TRACE5logger.trace()
    DEBUG10logger.debug()
    INFO20logger.info()
    SUCCESS25logger.success()
    WARNING30logger.warning()
    ERROR40logger.error()
    CRITICAL50logger.critical()

    Then set the environment variable:

    On Unix:

    export LOGURU_LEVEL=INFO
    

    On Windows:

    PowerShell

    $env:LOGURU_LEVEL="INFO"
    

    CMD

    set LOGURU_LEVEL=INFO
    

After that, I saved a ton of space...

In various debug sessions, I migrated this service from B to C, which is where Meili Search is located. This later proved to significantly improve speed.

Index Concerns#

My data has an id field, but various issues arise during actual use, so I still use _id as the primary key.

Hands-on Script Modification#

Corrected Types#

It seemed everything was running normally for a while, but the program just died. After several checks, I found it was TypeError: Object of type ObjectId is not JSON serializable. At this point, the progress was about 1,140,000 records.

There was also a GitHub Issue, #16, stating it was "fixed." I checked the local code, and it indeed contained the fix. However, it seemed to still appear in #102, and there was no response this time.

The hardest part was not fixing the code but reproducing the error. Since the progress would reset after an error, each time it would start from the beginning, and it took 20 minutes to reach the error point, wasting several hours on this... Also, during the run, the CPU was fully utilized... I should be thankful that I wasn't using a small service provider's machine that would trip the circuit...

By the way, during this time, I started using sentry.io. It's strange that the author specifically left a hole for sentry.io in this synchronization tool, but it was indeed very useful. Maybe the author knew there would be bugs in various places?

Local Modifications#

So I implemented detection and repair. Initially, I tried to modify that plugin, but I couldn't sort out the content, so I finally decided to hard-code the source! I added additional type checks. I didn't want to pip install repeatedly, so I directly modified the files in site-packages, which was quick and effective.

Not long after fixing ObjectId, I saw that there were no issues for over half an hour, and the progress was slowly moving forward. I was just getting ready to sleep when another error occurred, this time Object of type datetime is not JSON serializable, similar to #31. The same check was added. This time it was at position 5,270,000, about one-third of the way. After fixing it, I could continue syncing, and I was halfway through, so I went to sleep with peace of mind.

Then I woke up to a thunderclap; at about two-thirds of the way, I encountered Client error '408 Request Timeout' for url 'http://127.0.0.1:7700/tasks/xxxx. There was really no way; the indexed items were too many, and it couldn't keep up, leading to a backlog until it completely boom. But the abnormal thing was that this issue had been fixed, as mentioned in #13, but for some reason, it still exploded. At this point, I checked how much was backed up and found that running for an hour would result in a half-hour backlog... I should be thankful that I didn't encounter the "Too many open files" issue...

Indexing was slow... wait, ultimately, indexing shouldn't be added so quickly!

Delayed Indexing#

I decided to try delaying it and discovered a terrifying fact: when creating an index, meilisync did not specify any field indexing options. Therefore, every field in the document would be displayed and searchable, consuming a lot of resources and causing terrible waste. We absolutely didn't need to index all the content from the beginning. Instead, when synchronizing data from the remote, we shouldn't create any indexes but should wait until all content from the source has been successfully inserted before handling the indexing, and we should be able to specify which fields are indexed in the config file, including the type of indexing (searchable, sortable, filterable, none).

So I wrote it. Currently, the logic is that no indexes should be created when synchronizing data from the remote, which improved the sync speed by ten thousand times.

Then I wrote the functionality to set the index after synchronization by modifying the settings. Everything seemed very normal until I woke up and still found no indexes.

This shouldn't be the case. After checking, I found that although I submitted the task to modify the index, it exploded halfway through:

Index `nmbxd`: internal: MDB_TXN_FULL: Transaction has too many dirty pages - transaction too big.

Finally, it wasn't a meilisync issue!

Memory Optimization#

After careful confirmation, I found that it didn't run for several hours. It ran for several minutes before the transaction became too big, and then it retried with a smaller batch until it couldn't try anything.

Training!

After being reminded by @yzqzss, the best implementation is to set attribution first, then synchronize the index, following best practices, but that might cause 408 again. To solve the 408, I would need to implement a queue, but I was too lazy to write more code.

Later, I found a flag that could reduce memory usage during indexing, referring to https://github.com/meilisearch/meilisearch/issues/3603, --experimental-reduce-indexing-memory-usage, and indeed, it worked wonders; it succeeded in one go.

As for updates, running meilisync refresh is sufficient.

After that, I wrote a systemd timer to synchronize regularly, as follows.

# /etc/systemd/system/meilisync.timer
[Unit]
Description=Run meilisync refresh nmbxd weekly on Monday at 5 AM

[Timer]
# Run every Monday at 5:00 AM local time
OnCalendar=Mon *-*-* 05:00:00
Persistent=false

[Install]
WantedBy=timers.target
# /etc/systemd/system/meilisync.service
[Unit]
Description=Meilisync Refresh nmbxd
#After=network.target

[Service]
Type=oneshot
WorkingDirectory=/path/to/meilisync/config/
Environment='LOGURU_LEVEL=DEBUG' 
ExecStart=/etc/meilisync/meilisync/bin/meilisync refresh

Configuration file /path/to/meilisync/config/config.yml

debug: false
progress:
  type: file
source:
  type: mongo
  host: REDACTED
  port: REDACTED
  username: 'REDACTED'
  password: 'REDACTED'
  database: REDACTED
meilisearch:
  api_url: http://127.0.0.1:REDACTED
  api_key: REDACTED
  insert_size: 10000
  insert_interval: 10
sync:
  - table: REDACTED
    index: REDACTED
    full: true
    pk: _id
    attributes:
      id: [filterable, sortable]
      fid: [filterable]
      img: [filterable]
      ext: [filterable, sortable]
      now: [filterable, sortable]
      name: [searchable]
      title: [searchable]
      content: [searchable]
      parent: [filterable, sortable]
      type: [filterable]
      userid: [filterable]
sentry:
  dsn: ''
  environment: 'production'
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.